An analysis of temporal-difference learning with function approximation

被引：651

作者：

Tsitsiklis, JN

VanRoy, B

机构：

[1] Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge

来源：

IEEE TRANSACTIONS ON AUTOMATIC CONTROL | 1997年 / 42卷 / 05期

基金：

美国国家科学基金会;

关键词：

dynamic programming; function approximation; Markov chains; neuro-dynamic programming; reinforcement learning; temporal-difference learning;

D O I：

10.1109/9.580874

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain, The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space, We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error, Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators, First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain, This bet reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning, Second, we present an example illustrating the possibility of divergence when temporal-difference learning is used in the presence of a nonlinear function approximator.

引用

页码：674 / 690

页数：17

共 25 条

[1] [Anonymous], 1994, INCREMENTAL LEARNING
[2] Baird L., 1995, MACHINE LEARNING
[3] Benveniste A, 1990, Adaptive algorithms and stochastic approximations
[4] Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st
[5] Bertsekas D. P., 1995, Dynamic Programming and Optimal Control
[6] A COUNTEREXAMPLE TO TEMPORAL DIFFERENCES LEARNING
BERTSEKAS, DP
[J]. NEURAL COMPUTATION, 1995, 7 (02) : 270 - 279
[7] BOYAN J, 1995, ADV NEURAL INFORMATI, V7
[8] THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA
DAYAN, P
[J]. MACHINE LEARNING, 1992, 8 (3-4) : 341 - 362
[9] DAYAN P, 1994, MACH LEARN, V14, P295
[10] Gordon G., 1995, CMUCS95103

← 1 2 3 →