Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

被引：202

作者：

Antos, Andras ^{[1
]}

Szepesvari, Csaba ^{[1
]}

Munos, Remi ^{[2
]}

机构：

[1] Hungarian Acad Sci, Comp & Automat Res Inst, H-1111 Budapest, Hungary

[2] INRIA Lille, Inst Natl Rech Informat & Automat, F-59650 Villeneuve Dascq, France

来源：

MACHINE LEARNING | 2008年 / 71卷 / 01期

基金：

匈牙利科学研究基金会;

关键词：

reinforcement learning; policy iteration; Bellman-residual minimization; least-squares temporal difference learning; off-policy learning; nonparametric regression; least-squares regression; finite-sample bounds;

D O I：

10.1007/s10994-007-5038-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

引用

页码：89 / 129

页数：41

共 42 条

[1]

[Anonymous], APPL MATH STOCHASTIC

[2]

[Anonymous], P 10 YAL WORKSH AD L

[3]

[Anonymous], 2003, J MACH LEARN RES, DOI DOI 10.1162/JMLR.2003.4.6.1107

[4]

Anthony M., 1999, Neural Network Learning: Theoretical Foundations, V9

[5]

ANTOS A, 2007, IN PRESS ADV NEURAL

[6]

ANTOS A, 2007, IEEE S APPR DYN PROG, P330

[7] Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path [J].

Antos, Andras ;

Szepesvari, Csaba ;

Munos, Remi .

LEARNING THEORY, PROCEEDINGS, 2006, 4005 :574-588

[8]

Baraud Y, 2001, ANN STAT, V29, P839

[9]

Bellman R., 1959, Mathematics of Computation, V13, P247

[10]

Bertsekas D. P., 1996, Neuro-dynamic programming

← 1 2 3 4 5 →