Actor-critic-type learning algorithms for Markov decision processes

被引：138

作者：

Konda, VR ^{[1
]}

Borkar, VS

机构：

[1] MIT, Informat & Decis Syst Lab, Cambridge, MA 02139 USA

[2] Tata Inst Fundamental Res, Sch Technol & Comp Sci, Bombay 400005, Maharashtra, India

来源：

SIAM JOURNAL ON CONTROL AND OPTIMIZATION | 1999年 / 38卷 / 01期

关键词：

reinforcement learning; Markov decision processes; actor-critic algorithms; stochastic approximation; asynchronous iterations;

D O I：

10.1137/S036301299731669X

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.

引用

页码：94 / 123

页数：30

共 28 条

[1]

ABOUNADI J, 1988, LIDSP2433 MIT

[2]

ABOUNADI J, 1988, LIDSP2434 MIT

[3]

Barto A G., 1983, IEEE Trans, on Systems, Man, and Cybernetics, V13, P835

[4]

Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st

[5] AN ANALYSIS OF STOCHASTIC SHORTEST-PATH PROBLEMS [J].

BERTSEKAS, DP ;

TSITSIKLIS, JN .

MATHEMATICS OF OPERATIONS RESEARCH, 1991, 16 (03) :580-595

[6] A new value iteration method for the average cost dynamic programming problem [J].

Bertsekas, DP .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1998, 36 (02) :742-759

[7]

Borkar V. S., 1996, APPL MATH, V24, P169

[8] Asynchronous stochastic approximations [J].

Borkar, VS .

SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1998, 36 (03) :840-851

[9] Stochastic approximation with two time scales [J].

Borkar, VS .

SYSTEMS & CONTROL LETTERS, 1997, 29 (05) :291-294

[10] An analog scheme for fixed point computation .1. Theory [J].

Borkar, VS ;

Soumyanath, K .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-FUNDAMENTAL THEORY AND APPLICATIONS, 1997, 44 (04) :351-355

← 1 2 3 →