ON THE CONVERGENCE OF STOCHASTIC ITERATIVE DYNAMIC-PROGRAMMING ALGORITHMS

被引：376

作者：

JAAKKOLA, T ^{[1
]}

JORDAN, MI ^{[1
]}

SINGH, SP ^{[1
]}

机构：

[1] UNIV MASSACHUSETTS,DEPT COMP SCI,AMHERST,MA 01003

来源：

NEURAL COMPUTATION | 1994年 / 6卷 / 06期

关键词：

D O I：

10.1162/neco.1994.6.6.1185

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent developments in the area of reinforcement learning have-yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong.

引用

页码：1185 / 1201

页数：17

共 17 条

[1]

Aoki M., 1967, OPTIMIZATION STOCHAS

[2]

BARTO AG, 1990, ADV NEURAL INFORMATI, V2, P686

[3]

BARTO AG, 1994, IN PRESS AI J

[4]

Bertsekas D.P., 1987, ABSTRACT DYNAMIC PRO

[5]

Bertsekas D.P., 1989, PARALLEL DISTRIBUTED

[6]

DAYAN D, 1993, TD CONVERGES PROBABI, V1

[7] THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA [J].

DAYAN, P .

MACHINE LEARNING, 1992, 8 (3-4) :341-362

[8]

DVORETZKY A, 1956, 3RD P BERK S MATH ST

[9]

PENG J, 1993, TD CONVERGES PROBABI

[10] A STOCHASTIC APPROXIMATION METHOD [J].

ROBBINS, H ;

MONRO, S .

ANNALS OF MATHEMATICAL STATISTICS, 1951, 22 (03) :400-407

← 1 2 →