Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

被引:2
作者
Satinder Singh
Tommi Jaakkola
Michael L. Littman
Csaba Szepesvári
机构
[1] AT&T Labs-Research,Department of Computer Science
[2] Massachusetts Institute of Technology,Department of Computer Science
[3] Duke University,undefined
[4] Mindmaker Ltd.,undefined
来源
Machine Learning | 2000年 / 38卷
关键词
reinforcement-learning; on-policy; convergence; Markov decision processes;
D O I
暂无
中图分类号
学科分类号
摘要
An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
引用
收藏
页码:287 / 308
页数:21
相关论文
共 25 条
[1]  
Barto A. G.(1995)Learning to act using real-time dynamic programming Artificial Intelligence 72 81-138
[2]  
Bradtke S. J.(1992)The convergence of TD(λ) for general λ Machine Learning 8 341-362
[3]  
Singh S.(1996)Exploration bonuses and dual control Machine Learning 25 5-22
[4]  
Dayan P.(1994)On the convergence of stochastic iterative dynamic programming algorithms Neural Computation 6 1185-1201
[5]  
Dayan P.(1996)Reinforcement learning: A survey Journal of Artificial Intelligence Research 4 237-285
[6]  
Sejnowski T. J.(1997)Reinforcement learning for dynamic channel allocation in cellular telephone systems Advances in neural information processing systems 9 974-980
[7]  
Jaakkola T.(1996)Reinforcement learning with replacing eligibility traces Machine Learning 22 123-158
[8]  
Jordan M. I.(1994)An upper bound on the loss from approximate optimal-value functions Machine Learning 16 227-233
[9]  
Singh S.(1988)Learning to predict by the method of temporal differences Machine Learning 3 9-44
[10]  
Kaelbling L. P.(1995)Temporal difference learning and TD-gammon Communications of the ACM 38 58-68