THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA

被引:154
作者
DAYAN, P
机构
[1] UNIV EDINBURGH,CTR COGNIT SCI,EDINBURGH EH8 9LW,SCOTLAND
[2] UNIV EDINBURGH,DEPT PHYS,EDINBURGH EH8 9LW,SCOTLAND
关键词
REINFORCEMENT LEARNING; TEMPORAL DIFFERENCES; ASYNCHRONOUS DYNAMIC PROGRAMMING;
D O I
10.1007/BF00992701
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones. It also considers how this version of TD behaves in the face of linearly dependent representations for states-demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.
引用
收藏
页码:341 / 362
页数:22
相关论文
共 21 条
  • [1] Albus J. S., 1975, Transactions of the ASME. Series G, Journal of Dynamic Systems, Measurement and Control, V97, P220, DOI 10.1115/1.3426922
  • [2] [Anonymous], 1989, LEARNING DELAYED REW
  • [3] Barto A., 1990, LEARNING COMPUTATION
  • [4] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS
    BARTO, AG
    SUTTON, RS
    ANDERSON, CW
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05): : 834 - 846
  • [5] Bellman Richard, 1962, APPL DYNAMIC PROGRAM
  • [6] DAYAN P, 1991, THESIS U EDINBURGH S
  • [7] Hampson S. E., 1990, CONNECTIONISTIC PROB
  • [8] Hampson S. E., 1983, THESIS U CALIFORNIA
  • [9] Holland J. H., 1986, MACHINE LEARNING ART, V2
  • [10] KLOPF AH, 1972, AFCRL720164 AIR FORC