THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA

被引：154

作者：

DAYAN, P

机构：

[1] UNIV EDINBURGH,CTR COGNIT SCI,EDINBURGH EH8 9LW,SCOTLAND

[2] UNIV EDINBURGH,DEPT PHYS,EDINBURGH EH8 9LW,SCOTLAND

来源：

MACHINE LEARNING | 1992年 / 8卷 / 3-4期

关键词：

REINFORCEMENT LEARNING; TEMPORAL DIFFERENCES; ASYNCHRONOUS DYNAMIC PROGRAMMING;

D O I：

10.1007/BF00992701

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones. It also considers how this version of TD behaves in the face of linearly dependent representations for states-demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.

引用

页码：341 / 362

页数：22

共 21 条

[1] Albus J. S., 1975, Transactions of the ASME. Series G, Journal of Dynamic Systems, Measurement and Control, V97, P220, DOI 10.1115/1.3426922
[2] [Anonymous], 1989, LEARNING DELAYED REW
[3] Barto A., 1990, LEARNING COMPUTATION
[4] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS
BARTO, AG
SUTTON, RS
ANDERSON, CW
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05): : 834 - 846
[5] Bellman Richard, 1962, APPL DYNAMIC PROGRAM
[6] DAYAN P, 1991, THESIS U EDINBURGH S
[7] Hampson S. E., 1990, CONNECTIONISTIC PROB
[8] Hampson S. E., 1983, THESIS U CALIFORNIA
[9] Holland J. H., 1986, MACHINE LEARNING ART, V2
[10] KLOPF AH, 1972, AFCRL720164 AIR FORC

← 1 2 3 →