Infinite-horizon policy-gradient estimation

被引:394
作者
Baxter, J
Bartlett, PL
机构
[1] WhizBang Labs, Pittsburgh, PA 15213 USA
[2] BIOwulf Technol, Berkeley, CA 94704 USA
[3] Australian Natl Univ, Res Sch Informat Sci & Engn, Canberra, ACT, Australia
关键词
D O I
10.1613/jair.806
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter beta is an element of [0, 1) (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter fi is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.
引用
收藏
页码:319 / 350
页数:32
相关论文
共 49 条
[1]  
ABERDEEN D, 2001, POLICY GRADIENT LEAR
[2]  
ALEKSANDROV VM, 1968, ENG CYBERN, P11
[3]  
[Anonymous], 1989, REAL ANAL PROBABILIT
[4]  
BAIRD LC, 1999, ADV NEURAL INFORMATI, V11
[5]  
BARTLETT PL, 1999, HEBBIAN SYNAPTIC MOD
[6]  
BARTLETT PL, 2000, J COMPUTER SYSTEMS S, V62
[7]   NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS [J].
BARTO, AG ;
SUTTON, RS ;
ANDERSON, CW .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05) :834-846
[8]   Learning to play chess using temporal differences [J].
Baxter, J ;
Tridgell, A ;
Weaver, L .
MACHINE LEARNING, 2000, 40 (03) :243-263
[9]  
BAXTER J, 2001, IN PRESS J ARTIFICIA
[10]  
Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st