Convergence results for single-step on-policy reinforcement-learning algorithms

被引:412
作者
Singh, S
Jaakkola, T
Littman, ML
Szepesvári, C
机构
[1] AT&T Labs Res, Florham Pk, NJ 07932 USA
[2] MIT, Dept Comp Sci, Cambridge, MA 02139 USA
[3] Duke Univ, Dept Comp Sci, Durham, NC 27708 USA
[4] Mindmaker Ltd, H-1121 Budapest, Hungary
基金
匈牙利科学研究基金会; 美国国家科学基金会;
关键词
reinforcement-learning; on-policy; convergence; Markov decision processes;
D O I
10.1023/A:1007678930559
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
引用
收藏
页码:287 / 308
页数:22
相关论文
共 37 条
[1]  
[Anonymous], 1992, PROBABILITY
[2]   LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING [J].
BARTO, AG ;
BRADTKE, SJ ;
SINGH, SP .
ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) :81-138
[3]  
Bellman R., 1957, DYNAMIC PROGRAMMING
[4]  
Bertsekas D.P., 2005, DYNAMIC PROGRAMMING, V1
[5]  
Bertsekas DP, 1995, Dynamic Programming and Optimal Control, V2
[6]  
Boyan J. A., 1995, Advances in Neural Information Processing Systems 7, P369
[7]  
CRITES RH, 1996, ADV NEURAL INFORMATI, V8
[8]   THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA [J].
DAYAN, P .
MACHINE LEARNING, 1992, 8 (3-4) :341-362
[9]   Exploration bonuses and dual control [J].
Dayan, P ;
Sejnowski, TJ .
MACHINE LEARNING, 1996, 25 (01) :5-22
[10]  
DAYAN P, 1994, MACHINE LEARNING, V14