R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning

被引：472

作者：

Brafman, RI ^{[1
]}

Tennenholtz, M

机构：

[1] Ben Gurion Univ Negev, Dept Comp Sci, IL-84105 Beer Sheva, Israel

[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA

[3] Technion Israel Inst Technol, Fac Ind Engn & Management, IL-32000 Haifa, Israel

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2003年 / 3卷 / 02期

关键词：

reinforcement learning; learning in games; stochastic games; Markov decision processes; provably efficient learning;

D O I：

10.1162/153244303765208377

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, it is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E-3 algorithm, covering zero-sum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the "optimism under uncertainty" bias used in many RL algorithms. (4) It is simpler, more general, and more efficient than Brafman and Tennenholtz's LSG algorithm for learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Termenholtz for learning in repeated games. (6) It is the only algorithm for learning in repeated games, to date, which is provably efficient, considerably improving and simplifying previous algorithms by Banos and by Megiddo.

引用

页码：213 / 231

页数：19

共 20 条

[1] AUMANN R. J., 1995, Repeated games with incomplete information
[2] ON PSEUDO-GAMES
BANOS, A
[J]. ANNALS OF MATHEMATICAL STATISTICS, 1968, 39 (06): : 1932 - &
[3] A near-optimal polynomial time algorithm for learning in certain classes of stochastic games
Brafman, RI
Tennenholtz, M
[J]. ARTIFICIAL INTELLIGENCE, 2000, 121 (1-2) : 31 - 47
[4] SELF-CONFIRMING EQUILIBRIUM
FUDENBERG, D
LEVINE, DK
[J]. ECONOMETRICA, 1993, 61 (03) : 523 - 545
[5] HART S, 2000, REINFORCEMENT PROCED
[6] Hoffman A., 1966, MANAGE SCI, V12, P359, DOI DOI 10.1287/MNSC.12.5.359
[7] Hu J., 1998, P 15 INT C MACHINE L
[8] KAELBING LP, 1993, LEARNING EMBEDDED SY
[9] Reinforcement learning: A survey
Kaelbling, LP
Littman, ML
Moore, AW
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 : 237 - 285
[10] Kearns M, 1999, IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, P740

← 1 2 →