Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

被引:10
作者
Antos, Andras
Szepesvari, Csaba
Munos, Remi
机构
[1] Hungarian Acad Sci, Comp & Automat Res Inst, H-1111 Budapest, Hungary
[2] Ecole Polytech, Ctr Math Appl, F-91128 Palaiseau, France
来源
LEARNING THEORY, PROCEEDINGS | 2006年 / 4005卷
关键词
D O I
10.1007/11776420_42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functicions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.
引用
收藏
页码:574 / 588
页数:15
相关论文
共 22 条
[1]  
[Anonymous], 2003, J MACH LEARN RES, DOI DOI 10.1162/JMLR.2003.4.6.1107
[2]  
Anthony M., 1999, Neural Network Learning: Theoretical Foundations, V9
[3]  
ANTOS A, 2006, UNPUB ICML 2006
[4]  
Bellman R., 1959, Mathematics of Computation, V13, P247
[5]  
Bertsekas D. P., 1996, Neuro-dynamic programming
[6]  
Bertsekas D. P., 1996, Neuro Dynamic Programming, V1st
[7]  
Dietterich TG., 2002, ADV NEURAL INFORM PR
[8]  
Ernst D, 2005, J MACH LEARN RES, V6, P503
[9]  
GORDON GJ, 1995, P 12 INT C MACH LEAR, P261
[10]  
GUESTRIN C, 2001, P INT JOINT C ART IN