Near-optimal reinforcement learning in polynomial time

被引：371

作者：

Kearns, M

Singh, S

机构：

[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA

[2] Syntek Capital, New York, NY 10019 USA

[3] AT&T Labs, Florham Pk, NJ USA

来源：

MACHINE LEARNING | 2002年 / 49卷 / 2-3期

基金：

美国国家科学基金会;

关键词：

reinforcement learning; Markov decision processes; exploration versus exploitation;

D O I：

10.1023/A:1017984413808

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

引用

页码：209 / 232

页数：24

共 33 条

[1]

BARTO AG, 1990, ADV NEURAL INFORMATI, V2, P686

[2]

Bertsekas D. P., 1987, DYNAMIC PROGRAMMING

[3]

Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st

[4]

Bertsekas Dimitri P., 1989, PARALLEL DISTRIBUTED

[5]

CHRISMAN L, 1992, AAAI 92

[6]

Christopher John Cornish Hellaby Watkins, 1989, LEARNING DELAYED REW

[7]

FIECHTER C, 1994, COLT94, P88

[8]

FIECHTER C, 1997, MACH LEARN, P116

[9]

Gordon G. J., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P261

[10]

GULLAPALLI V, 1994, ADV NEURAL INFORMATI, V6, P695

← 1 2 3 4 →