Near-optimal reinforcement learning in polynomial time

被引:371
作者
Kearns, M
Singh, S
机构
[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
[2] Syntek Capital, New York, NY 10019 USA
[3] AT&T Labs, Florham Pk, NJ USA
基金
美国国家科学基金会;
关键词
reinforcement learning; Markov decision processes; exploration versus exploitation;
D O I
10.1023/A:1017984413808
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.
引用
收藏
页码:209 / 232
页数:24
相关论文
共 33 条
[1]  
BARTO AG, 1990, ADV NEURAL INFORMATI, V2, P686
[2]  
Bertsekas D. P., 1987, DYNAMIC PROGRAMMING
[3]  
Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st
[4]  
Bertsekas Dimitri P., 1989, PARALLEL DISTRIBUTED
[5]  
CHRISMAN L, 1992, AAAI 92
[6]  
Christopher John Cornish Hellaby Watkins, 1989, LEARNING DELAYED REW
[7]  
FIECHTER C, 1994, COLT94, P88
[8]  
FIECHTER C, 1997, MACH LEARN, P116
[9]  
Gordon G. J., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P261
[10]  
GULLAPALLI V, 1994, ADV NEURAL INFORMATI, V6, P695