Natural actor-critic algorithms

被引:402
作者
Bhatnagar, Shalabh [1 ]
Sutton, Richard S. [2 ]
Ghavamzadeh, Mohammad [3 ]
Lee, Mark [2 ]
机构
[1] Indian Inst Sci, Dept Comp Sci & Automat, Bangalore 560012, Karnataka, India
[2] Univ Alberta, Dept Comp Sci, RLAI Lab, Edmonton, AB T6G 2E8, Canada
[3] INRIA Lille Nord Europe, Team SequeL, Lille, France
关键词
Actor-critic reinforcement learning algorithms; Policy-gradient methods; Approximate dynamic programming; Function approximation; Two-timescale stochastic approximation; Temporal difference learning; Natural gradient; STOCHASTIC-APPROXIMATION; LEARNING ALGORITHMS; CONVERGENCE;
D O I
10.1016/j.automatica.2009.07.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2471 / 2482
页数:12
相关论文
共 66 条
[1]
Reinforcement learning based algorithms for average cost Markov Decision Processes [J].
Abdulla, Mohammed Shahid ;
Bhatnagar, Shalabh .
DISCRETE EVENT DYNAMIC SYSTEMS-THEORY AND APPLICATIONS, 2007, 17 (01) :23-52
[2]
Learning algorithms or Markov decision processes with average cost [J].
Abounadi, J ;
Bertsekas, D ;
Borkar, VS .
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2001, 40 (03) :681-698
[3]
ALEKSANDROV VM, 1968, ENG CYBERN, P11
[4]
Natural gradient works efficiently in learning [J].
Amari, S .
NEURAL COMPUTATION, 1998, 10 (02) :251-276
[5]
[Anonymous], 2009, Advances in Neural Information Processing Systems
[6]
[Anonymous], 1999, Nonlinear Programming
[7]
[Anonymous], 2007, Control Techniques for Complex Networks
[8]
[Anonymous], 2008, Proc. Advances in Neural Information Processing Systems (NIPS)
[9]
Bagnell J. A., 2003, INT JOINT C ART INT
[10]
Baird L. C., 1993, Tech. Rep. WL-TR-93-1146