On the convergence of temporal-difference learning with linear function approximation

被引：29

作者：

Tadic, V ^{[1
]}

机构：

[1] Univ Melbourne, Dept Elect & Elect Engn, Parkville, Vic 3010, Australia

来源：

MACHINE LEARNING | 2001年 / 42卷 / 03期

关键词：

temporal-difference learning; reinforcement learning; neuro-dynamic programming; almost sure convergence; Markov chains; positive Harris recurrence;

D O I：

10.1023/A:1007609817671

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The asymptotic properties of temporal-difference learning algorithms with linear function approximation are analyzed in this paper. The analysis is carried out in the context of the approximation of a discounted cost-to-go function associated with an uncontrolled Markov chain with an uncountable finite-dimensional state-space. Under mild conditions, the almost sure convergence of temporal-difference learning algorithms with linear function approximation is established and an upper bound for their asymptotic approximation error is determined. The obtained results are a generalization and extension of the existing results related to the asymptotic behavior of temporal-difference learning. Moreover, they cover cases to which the existing results cannot be applied, while the adopted assumptions seem to be the weakest possible under which the almost sure convergence of temporal-difference learning algorithms is still possible to be demonstrated.

引用

页码：241 / 267

页数：27

共 21 条

[1]

[Anonymous], 1992, Stochastic Stability of Markov chains

[2]

[Anonymous], 1976, Mathematics in Science and Engineering

[3]

Asmussen S, 2008, APPL PROBABILITY QUE, V51

[4]

Benveniste A, 1990, Adaptive algorithms and stochastic approximations

[5]

Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st

[6]

CHEN H. F., 1991, Identification and Stochastic Adaptive Control

[7] NECESSARY AND SUFFICIENT CONDITIONS FOR THE ROBBINS-MONRO METHOD [J].

CLARK, DS .

STOCHASTIC PROCESSES AND THEIR APPLICATIONS, 1984, 17 (02) :359-367

[8] ON POSITIVE HARRIS RECURRENCE OF MULTICLASS QUEUEING NETWORKS: A UNIFIED APPROACH VIA FLUID LIMIT MODELS [J].

Dai, J. G. .

ANNALS OF APPLIED PROBABILITY, 1995, 5 (01) :49-77

[9] THE CONVERGENCE OF TD(LAMBDA) FOR GENERAL LAMBDA [J].

DAYAN, P .

MACHINE LEARNING, 1992, 8 (3-4) :341-362

[10]

DAYAN P, 1994, MACH LEARN, V14, P295

← 1 2 3 →