On-policy concurrent reinforcement learning

Bikramjit Banerjee, Sandip Sen, Jing Peng

Research output: Contribution to journalArticleResearchpeer-review

5 Citations (Scopus)

Abstract

When an agent learns in a multi-agent environment, the payoff it receives is dependent on the behaviour of the other agents. If the other agents are also learning, its reward distribution becomes non-stationary. This makes learning in multi-agent systems more difficult than single-agent learning. Prior attempts at value-function based learning in such domains have used off-policy Q-learning that do not scale well as the cornerstone, with restricted success. This paper studies on-policy modifications of such algorithms, with the promise of scalability and efficiency. In particular, it is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions. It is also shown, experimentally, that the new techniques can learn (from self-play) better policies than the previous algorithms (also in self-play) during some phases of the exploration.

Original languageEnglish
Pages (from-to)245-260
Number of pages16
JournalJournal of Experimental and Theoretical Artificial Intelligence
Volume16
Issue number4
DOIs
StatePublished - 1 Oct 2004

Fingerprint

Reinforcement learning
Reinforcement Learning
Concurrent
Q-learning
Multi agent systems
Reward
Value Function
Multi-agent Systems
Scalability
Fixed point
Restriction
Converge
Policy
Learning
Dependent

Keywords

  • Game theory
  • Multi-agent learning
  • On-policy reinforcement learning

Cite this

Banerjee, Bikramjit ; Sen, Sandip ; Peng, Jing. / On-policy concurrent reinforcement learning. In: Journal of Experimental and Theoretical Artificial Intelligence. 2004 ; Vol. 16, No. 4. pp. 245-260.
@article{16bc7705af694660a2bc07016e8a498a,
title = "On-policy concurrent reinforcement learning",
abstract = "When an agent learns in a multi-agent environment, the payoff it receives is dependent on the behaviour of the other agents. If the other agents are also learning, its reward distribution becomes non-stationary. This makes learning in multi-agent systems more difficult than single-agent learning. Prior attempts at value-function based learning in such domains have used off-policy Q-learning that do not scale well as the cornerstone, with restricted success. This paper studies on-policy modifications of such algorithms, with the promise of scalability and efficiency. In particular, it is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions. It is also shown, experimentally, that the new techniques can learn (from self-play) better policies than the previous algorithms (also in self-play) during some phases of the exploration.",
keywords = "Game theory, Multi-agent learning, On-policy reinforcement learning",
author = "Bikramjit Banerjee and Sandip Sen and Jing Peng",
year = "2004",
month = "10",
day = "1",
doi = "10.1080/09528130412331297956",
language = "English",
volume = "16",
pages = "245--260",
journal = "Journal of Experimental and Theoretical Artificial Intelligence",
issn = "0952-813X",
publisher = "Taylor and Francis Ltd.",
number = "4",

}

On-policy concurrent reinforcement learning. / Banerjee, Bikramjit; Sen, Sandip; Peng, Jing.

In: Journal of Experimental and Theoretical Artificial Intelligence, Vol. 16, No. 4, 01.10.2004, p. 245-260.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - On-policy concurrent reinforcement learning

AU - Banerjee, Bikramjit

AU - Sen, Sandip

AU - Peng, Jing

PY - 2004/10/1

Y1 - 2004/10/1

N2 - When an agent learns in a multi-agent environment, the payoff it receives is dependent on the behaviour of the other agents. If the other agents are also learning, its reward distribution becomes non-stationary. This makes learning in multi-agent systems more difficult than single-agent learning. Prior attempts at value-function based learning in such domains have used off-policy Q-learning that do not scale well as the cornerstone, with restricted success. This paper studies on-policy modifications of such algorithms, with the promise of scalability and efficiency. In particular, it is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions. It is also shown, experimentally, that the new techniques can learn (from self-play) better policies than the previous algorithms (also in self-play) during some phases of the exploration.

AB - When an agent learns in a multi-agent environment, the payoff it receives is dependent on the behaviour of the other agents. If the other agents are also learning, its reward distribution becomes non-stationary. This makes learning in multi-agent systems more difficult than single-agent learning. Prior attempts at value-function based learning in such domains have used off-policy Q-learning that do not scale well as the cornerstone, with restricted success. This paper studies on-policy modifications of such algorithms, with the promise of scalability and efficiency. In particular, it is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions. It is also shown, experimentally, that the new techniques can learn (from self-play) better policies than the previous algorithms (also in self-play) during some phases of the exploration.

KW - Game theory

KW - Multi-agent learning

KW - On-policy reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=9144256373&partnerID=8YFLogxK

U2 - 10.1080/09528130412331297956

DO - 10.1080/09528130412331297956

M3 - Article

VL - 16

SP - 245

EP - 260

JO - Journal of Experimental and Theoretical Artificial Intelligence

JF - Journal of Experimental and Theoretical Artificial Intelligence

SN - 0952-813X

IS - 4

ER -