ON THE CONVERGENCE OF STOCHASTIC ITERATIVE DYNAMIC-PROGRAMMING ALGORITHMS

被引:376
作者
JAAKKOLA, T [1 ]
JORDAN, MI [1 ]
SINGH, SP [1 ]
机构
[1] UNIV MASSACHUSETTS,DEPT COMP SCI,AMHERST,MA 01003
关键词
D O I
10.1162/neco.1994.6.6.1185
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent developments in the area of reinforcement learning have-yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong.
引用
收藏
页码:1185 / 1201
页数:17
相关论文
共 17 条
[1]  
Aoki M., 1967, OPTIMIZATION STOCHAS
[2]  
BARTO AG, 1990, ADV NEURAL INFORMATI, V2, P686
[3]  
BARTO AG, 1994, IN PRESS AI J
[4]  
Bertsekas D.P., 1987, ABSTRACT DYNAMIC PRO
[5]  
Bertsekas D.P., 1989, PARALLEL DISTRIBUTED
[6]  
DAYAN D, 1993, TD CONVERGES PROBABI, V1
[7]   THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA [J].
DAYAN, P .
MACHINE LEARNING, 1992, 8 (3-4) :341-362
[8]  
DVORETZKY A, 1956, 3RD P BERK S MATH ST
[9]  
PENG J, 1993, TD CONVERGES PROBABI
[10]   A STOCHASTIC APPROXIMATION METHOD [J].
ROBBINS, H ;
MONRO, S .
ANNALS OF MATHEMATICAL STATISTICS, 1951, 22 (03) :400-407