An analysis of experience replay in temporal difference learning

被引:16
作者
Cichosz, P [1 ]
机构
[1] Warsaw Univ Technol, Inst Elect Syst, PL-00665 Warsaw, Poland
关键词
D O I
10.1080/019697299125127
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Temporal difference (TD) methods are used by reinforcement learning algorithms for predicting future rewards. This article analyzes theoretically and illustrates experimentally the effects of performing TD(lambda) prediction udpates backwards for a number of past experiences. More exactly, two related techniques described in the literature are examined, referred to as replayed TD and backwards TD. The former is essentially an online learning method which performs at each time step a regular TD(0) update, and then replays updates backwards for a number of previous states. The latter operates in offline mode, after the end of a trial updating backwards the predictions for all visited states. They are both shown to be approximately equivalent to TD(lambda) with variable lambda values selected in a particular way. This is true even if they perform only TD(0) updates. The experimental results show that replayed TD(0) is competitive to TD(lambda) with regard to learning speed and quality.
引用
收藏
页码:341 / 363
页数:23
相关论文
共 18 条
[1]  
Barto A G., 1983, IEEE Trans, on Systems, Man, and Cybernetics, V13, P835
[2]  
Cichosz P., 1995, Journal of Artificial Intelligence Research, V2, P287
[3]  
CICHOSZ P, 1995, P 12 INT C MACH LEAR, P99
[4]  
CICHOSZ P, 1997, THESIS WARSAW U TECH
[5]   Reinforcement learning: A survey [J].
Kaelbling, LP ;
Littman, ML ;
Moore, AW .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :237-285
[6]  
Klopf AH, 1982, HEDONISTIC NEURON TH
[7]  
LIN LJ, 1993, THESIS CARNEGIEMELLO
[8]   AUTOMATIC PROGRAMMING OF BEHAVIOR-BASED ROBOTS USING REINFORCEMENT LEARNING [J].
MAHADEVAN, S ;
CONNELL, J .
ARTIFICIAL INTELLIGENCE, 1992, 55 (2-3) :311-365
[9]  
PENG J, 1994, P 11 INT C MACH LEAR, P226
[10]  
Rummery G. A., 1994, CUEDFINFENGTR166