Exploration bonuses and dual control

被引：46

作者：

Dayan, P

Sejnowski, TJ

机构：

[1] SALK INST, HOWARD HUGHES MED INST, SAN DIEGO, CA 92186 USA

[2] UNIV CALIF SAN DIEGO, DEPT BIOL, LA JOLLA, CA 92093 USA

来源：

MACHINE LEARNING | 1996年 / 25卷 / 01期

关键词：

reinforcement learning; dynamic programming; exploration bonuses; certainty equivalence; nonstationary environment;

D O I：

10.1023/A:1018357105171

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, Gonzalez-Zubieta & Miller, 1965) arising from a form of dual control. This systematizes and extends existing uses of exploration bonuses in reinforcement learning (Sutton, 1990). The approach has two components: a statistical model of uncertainty in the world and a way of turning this into exploratory behavior. This general approach is applied to two-dimensional mazes with moveable barriers and its performance is compared with Sutton's DYNA system.

引用

页码：5 / 22

页数：18

共 28 条

[1]

[Anonymous], FKI14991 TU MUNCH

[2]

[Anonymous], 1965, DYNAMIC PROGRAMMING

[3] LEARNING TO ACT USING REAL-TIME DYNAMIC-PROGRAMMING [J].

BARTO, AG ;

BRADTKE, SJ ;

SINGH, SP .

ARTIFICIAL INTELLIGENCE, 1995, 72 (1-2) :81-138

[4]

Bertsekas DP., 1978, STOCHASTIC OPTIMAL C

[5]

Cohn D. A., 1994, ADV NEURAL INFORMATI, P679

[6]

COZZOLINO JM, 1965, 11 MIT OP RES CTR

[7] SOME PROPERTIES OF THE DUAL ADAPTIVE STOCHASTIC-CONTROL ALGORITHM [J].

DERSIN, PL ;

ATHANS, M ;

KENDRICK, DA .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1981, 26 (05) :1001-1008

[8]

Fedorov V., 1972, Theory of optimal experiments

[9]

Feldbaum A., 1965, Optimal Control Systems

[10]

Howard RA., 1960, Dynamic Programming and Markov Processes

← 1 2 3 →