Least-squares policy iteration

被引：180

作者：

Lagoudakis, MG ^{[1
]}

Parr, R ^{[1
]}

机构：

[1] Duke Univ, Dept Comp Sci, Durham, NC 27708 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2004年 / 4卷 / 06期

关键词：

reinforcement learning; Markov decision processes; approximate policy iteration; value-function approximation; least-squares methods;

D O I：

10.1162/1532443041827907

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.

引用

页码：1107 / 1149

页数：43

共 28 条

[1]

[Anonymous], 1993, P ADV NEUR INF PROC

[2]

[Anonymous], 2000, Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

[3]

[Anonymous], 1998, P 15 INT C MACH LEAR

[4] NEURONLIKE ADAPTIVE ELEMENTS THAT CAN SOLVE DIFFICULT LEARNING CONTROL-PROBLEMS [J].

BARTO, AG ;

SUTTON, RS ;

ANDERSON, CW .

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1983, 13 (05) :834-846

[5] Infinite-horizon policy-gradient estimation [J].

Baxter, J ;

Bartlett, PL .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2001, 15 :319-350

[6]

Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st

[7] Technical update: Least-squares temporal difference learning [J].

Boyan, JA .

MACHINE LEARNING, 2002, 49 (2-3) :233-246

[8]

Bradtke SJ, 1996, MACH LEARN, V22, P33, DOI 10.1007/BF00114723

[9]

Christopher John Cornish Hellaby Watkins, 1989, LEARNING DELAYED REW

[10] SIMULATION STUDY OF ALTERNATIVES TO ORDINARY LEAST-SQUARES [J].

DEMPSTER, AP ;

SCHATZOFF, M ;

WERMUTH, N .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1977, 72 (357) :77-93

← 1 2 3 →