SAMPLE-MEAN BASED INDEX POLICIES WITH O(LOG-N) REGRET FOR THE MULTIARMED BANDIT PROBLEM

被引:303
作者
AGRAWAL, R
机构
关键词
UPPER CONFIDENCE BOUNDS; ASYMPTOTICALLY EFFICIENT; LARGE DEVIATIONS; STOCHASTIC ADAPTIVE CONTROL;
D O I
10.2307/1427934
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(log n) lower bound on the regret with a constant that depends on the Kullback-Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. ?hey achieve a O(log n) regret with a constant that is also based on the Kullback-Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a 'contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.
引用
收藏
页码:1054 / 1078
页数:25
相关论文
共 15 条
[1]   CERTAINTY EQUIVALENCE CONTROL WITH FORCING - REVISITED [J].
AGRAWAL, R ;
TENEKETZIS, D .
SYSTEMS & CONTROL LETTERS, 1989, 13 (05) :405-412
[2]   ASYMPTOTICALLY EFFICIENT ADAPTIVE ALLOCATION RULES FOR THE MULTIARMED BANDIT PROBLEM WITH SWITCHING COST [J].
AGRAWAL, R ;
HEGDE, MV ;
TENEKETZIS, D .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1988, 33 (10) :899-905
[3]   ASYMPTOTICALLY EFFICIENT ADAPTIVE ALLOCATION SCHEMES FOR CONTROLLED MARKOV-CHAINS - FINITE PARAMETER SPACE [J].
AGRAWAL, R ;
TENEKETZIS, D ;
ANANTHARAM, V .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1989, 34 (12) :1249-1259
[4]  
Agrawal R., 1990, Stochastics and Stochastics Reports, V29, P437, DOI 10.1080/17442509008833627
[6]  
AGRAWAL R, 1989, IEEE T AUTOMAT CONTR, P258
[7]   ASYMPTOTICALLY EFFICIENT ALLOCATION RULES FOR THE MULTIARMED BANDIT PROBLEM WITH MULTIPLE PLAYS .1. IID REWARDS [J].
ANANTHARAM, V ;
VARAIYA, P ;
WALRAND, J .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1987, 32 (11) :968-976
[8]  
ANANTHARAM V, 1987, IEEE T AUTOMAT CONTR, V32, P975
[9]  
Billingsley P., 1985, PROBABILITY MEASURE
[10]  
BROWN LD, 1986, FUNDAMENTALS STATIST