Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

被引:202
作者
Antos, Andras [1 ]
Szepesvari, Csaba [1 ]
Munos, Remi [2 ]
机构
[1] Hungarian Acad Sci, Comp & Automat Res Inst, H-1111 Budapest, Hungary
[2] INRIA Lille, Inst Natl Rech Informat & Automat, F-59650 Villeneuve Dascq, France
基金
匈牙利科学研究基金会;
关键词
reinforcement learning; policy iteration; Bellman-residual minimization; least-squares temporal difference learning; off-policy learning; nonparametric regression; least-squares regression; finite-sample bounds;
D O I
10.1007/s10994-007-5038-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.
引用
收藏
页码:89 / 129
页数:41
相关论文
共 42 条
[1]  
[Anonymous], APPL MATH STOCHASTIC
[2]  
[Anonymous], P 10 YAL WORKSH AD L
[3]  
[Anonymous], 2003, J MACH LEARN RES, DOI DOI 10.1162/JMLR.2003.4.6.1107
[4]  
Anthony M., 1999, Neural Network Learning: Theoretical Foundations, V9
[5]  
ANTOS A, 2007, IN PRESS ADV NEURAL
[6]  
ANTOS A, 2007, IEEE S APPR DYN PROG, P330
[7]   Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path [J].
Antos, Andras ;
Szepesvari, Csaba ;
Munos, Remi .
LEARNING THEORY, PROCEEDINGS, 2006, 4005 :574-588
[8]  
Baraud Y, 2001, ANN STAT, V29, P839
[9]  
Bellman R., 1959, Mathematics of Computation, V13, P247
[10]  
Bertsekas D. P., 1996, Neuro-dynamic programming