Control of exploitation-exploration meta-parameter in reinforcement learning

被引:143
作者
Ishii, S
Yoshida, W
Yoshimoto, J
机构
[1] Nara Inst Sci & Technol, Nara, Japan
[2] Japan Sci & Technol Corp, CREST, Tokyo, Japan
关键词
reinforcement learning; exploitation-exploration problem; neuromodulator; attention; partially observable Markov decision process;
D O I
10.1016/S0893-6080(02)00056-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In reinforcement learning (RL), the duality between exploitation and exploration has long been an important issue. This paper presents a new method that controls the balance between exploitation and exploration. Our learning scheme is based on model-based RL, in which the Bayes inference with forgetting effect estimates the state-transition probability of the environment. The balance parameter, which corresponds to the randomness in action selection, is controlled based on variation of action results and perception of environmental change. When applied to maze tasks, our method successfully obtains good controls by adapting to environmental changes. Recently, Usher et al. [Science 283 (1999) 549] has suggested that noradrenergic neurons in the locus coeruleus may control the exploitation-exploration balance in a real brain and that the balance may correspond to the level of animal's selective attention. According to this scenario, we also discuss a possible implementation in the brain. (C) 2002 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:665 / 687
页数:23
相关论文
共 65 条
[31]  
Matsuno Y., 2001, Proceedings of the Fifth International Conference on Autonomous Agents, P39, DOI 10.1145/375735.375856
[32]   Effects of orbital frontal and anterior cingulate lesions on object and spatial memory in rhesus monkeys [J].
Meunier, M ;
Bachevalier, J ;
Mishkin, M .
NEUROPSYCHOLOGIA, 1997, 35 (07) :999-1015
[33]   PRIORITIZED SWEEPING - REINFORCEMENT LEARNING WITH LESS DATA AND LESS TIME [J].
MOORE, AW ;
ATKESON, CG .
MACHINE LEARNING, 1993, 13 (01) :103-130
[34]   NORADRENERGIC AND SEROTONINERGIC INNERVATION OF CORTICAL, THALAMIC, AND TECTAL VISUAL STRUCTURES IN OLD AND NEW-WORLD MONKEYS [J].
MORRISON, JH ;
FOOTE, SL .
JOURNAL OF COMPARATIVE NEUROLOGY, 1986, 243 (01) :117-138
[35]   Abstract reward and punishment representations in the human orbitofrontal cortex [J].
O'Doherty, J ;
Kringelbach, ML ;
Rolls, ET ;
Hornak, J ;
Andrews, C .
NATURE NEUROSCIENCE, 2001, 4 (01) :95-102
[36]   THE NUCLEUS-ACCUMBENS AS A COMPLEX OF FUNCTIONALLY DISTINCT NEURONAL ENSEMBLES - AN INTEGRATION OF BEHAVIORAL, ELECTROPHYSIOLOGICAL AND ANATOMICAL DATA [J].
PENNARTZ, CMA ;
DASILVA, FHL ;
GROENEWEGEN, HJ .
PROGRESS IN NEUROBIOLOGY, 1994, 42 (06) :719-+
[37]   Motor areas of the medial wall: A review of their location and functional activation [J].
Picard, N ;
Strick, PL .
CEREBRAL CORTEX, 1996, 6 (03) :342-353
[38]  
Posner M., 1996, IMAGES MIND
[39]  
Rajkowski J., 2000, SOC NEUR ABSTR, V26, P2230
[40]   Integration of what and where in the primate prefrontal cortex [J].
Rao, SC ;
Rainer, G ;
Miller, EK .
SCIENCE, 1997, 276 (5313) :821-824