Actor-critic-type learning algorithms for Markov decision processes

被引:138
作者
Konda, VR [1 ]
Borkar, VS
机构
[1] MIT, Informat & Decis Syst Lab, Cambridge, MA 02139 USA
[2] Tata Inst Fundamental Res, Sch Technol & Comp Sci, Bombay 400005, Maharashtra, India
关键词
reinforcement learning; Markov decision processes; actor-critic algorithms; stochastic approximation; asynchronous iterations;
D O I
10.1137/S036301299731669X
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.
引用
收藏
页码:94 / 123
页数:30
相关论文
共 28 条
[1]  
ABOUNADI J, 1988, LIDSP2433 MIT
[2]  
ABOUNADI J, 1988, LIDSP2434 MIT
[3]  
Barto A G., 1983, IEEE Trans, on Systems, Man, and Cybernetics, V13, P835
[4]  
Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st
[5]   AN ANALYSIS OF STOCHASTIC SHORTEST-PATH PROBLEMS [J].
BERTSEKAS, DP ;
TSITSIKLIS, JN .
MATHEMATICS OF OPERATIONS RESEARCH, 1991, 16 (03) :580-595
[6]   A new value iteration method for the average cost dynamic programming problem [J].
Bertsekas, DP .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1998, 36 (02) :742-759
[7]  
Borkar V. S., 1996, APPL MATH, V24, P169
[8]   Asynchronous stochastic approximations [J].
Borkar, VS .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1998, 36 (03) :840-851
[9]   Stochastic approximation with two time scales [J].
Borkar, VS .
SYSTEMS & CONTROL LETTERS, 1997, 29 (05) :291-294
[10]   An analog scheme for fixed point computation .1. Theory [J].
Borkar, VS ;
Soumyanath, K .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-FUNDAMENTAL THEORY AND APPLICATIONS, 1997, 44 (04) :351-355