NONPARAMETRIC BANDIT METHODS

被引:23
作者
Yakowitz, Sid [1 ]
Lowe, Wing [1 ]
机构
[1] Univ Arizona, Dept Syst & Ind Engn, Tucson, AZ 85721 USA
关键词
Sequential design; bandit; nonparametric; artificial intelligence;
D O I
10.1007/BF02055587
中图分类号
C93 [管理学]; O22 [运筹学];
学科分类号
070105 ; 12 ; 1201 ; 1202 ; 120202 ;
摘要
Bandits are a finite collection of random variables. Bandit problems are Markov decision problems in which, at each decision time, the decision maker selects a random variable (referred to as a bandit "arm") and observes an outcome. The selection is based on the observation history. The objective is to sequentially choose arms so as to minimize growth (with decision time) rate of the number of suboptimal selections. The appellation "bandit" refers to mechanical gambling machines, and the tradition stems from the question of allocating competing treatments to a sequence of patients having the same disease. Our motivation is "machine learning" in which a game-playing or assembly-line adjusting computer is faced with a sequence of statistically-similar decision problems and, as resource, has access to an expanding data base relevant to these problems. The setting for the present study is nonparametric and infinite horizon. The central aim is to relate a methodology which postulates finite moments or, alternatively, bounded bandit arms. Under these circumstances, strategies proposed are shown to be asymptotically optimal and converge at guaranteed rates. In the bounded-arm case, the rate is optimal. We extend the theory to the case in which the bandit population is infinite, and share some computational experience.
引用
收藏
页码:297 / 312
页数:16
相关论文
共 19 条
[1]   RANDOMIZED ALLOCATION OF TREATMENTS IN SEQUENTIAL TRIALS [J].
BATHER, J .
ADVANCES IN APPLIED PROBABILITY, 1980, 12 (01) :174-182
[2]  
Bellman RE, 1961, ADAPTIVE CONTROL PRO
[3]  
Berry D, 1985, BANDIT PROBLEMS SEQU
[4]  
Chernoff H., SEQUENTIAL ANAL OPTI, P1, DOI DOI 10.1137/1.9781611970593.CH1
[5]   SOME ONE-SIDED THEOREMS ON TAIL DISTRIBUTION OF SAMPLE SUMS WITH APPLICATIONS TO LAST TIME AND LARGEST EXCESS OF BOUNDARY CROSSINGS [J].
CHOW, YS ;
LAI, TL .
TRANSACTIONS OF THE AMERICAN MATHEMATICAL SOCIETY, 1975, 208 (JUL) :51-72
[6]   BAYESIAN NONPARAMETRIC BANDITS [J].
CLAYTON, MK ;
BERRY, DA .
ANNALS OF STATISTICS, 1985, 13 (04) :1523-1534
[7]   UNIFORM CONVERGENCE OF NEAREST NEIGHBOR REGRESSION FUNCTION ESTIMATORS AND THEIR APPLICATION IN OPTIMIZATION [J].
DEVROYE, LP .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1978, 24 (02) :142-151
[8]   PROBABILITY INEQUALITIES FOR SUMS OF INDEPENDENT RANDOM-VARIABLES [J].
FUK, DK ;
NAGAEV, SV .
THEORY OF PROBILITY AND ITS APPLICATIONS,USSR, 1971, 16 (04) :643-660
[9]  
Hernandez-Lerma O., 1989, ADAPTIVE MARKOV CONT
[10]  
Holland J. H., 1973, SIAM Journal on Computing, V2, P88, DOI 10.1137/0202009