On the generalization ability of on-line learning algorithms

被引:259
作者
Cesa-Bianchi, N [1 ]
Conconi, A
Gentile, C
机构
[1] Univ Milan, DSI, I-20135 Milan, Italy
[2] Univ Insubria, Dipartimento Informat & Commun, I-21100 Varese, Italy
关键词
kernel functions; on-line learning; pattern recognition; oerceptron algorithm; statistical learning theory;
D O I
10.1109/TIT.2004.833339
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, it is shown how to extract a hypothesis with small risk from the ensemble of hypotheses generated by an arbitrary on-line learning algorithm run on an independent and identically distributed (i.i.d.) sample of data. Using a simple large deviation argument, we prove tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M-n associated with the on-line performance of the ensemble. Via sharp pointwise bounds on M-n, we then obtain risk tail bounds for kernel Perceptron algorithms in terms of the spectrum of the empirical kernel matrix. These bounds reveal that the linear hypotheses found via our approach achieve optimal tradeoffs between hinge loss and margin size over the class of all linear functions, an issue that was left open by previous results. A distinctive feature of our approach is that the key tools for our analysis come from the model of prediction of individual sequences; i.e., a model making no probabilistic assumptions on the source generating the data. In fact, these tools turn out to be so powerful that we only need very elementary statistical facts to obtain our final risk bounds.
引用
收藏
页码:2050 / 2057
页数:8
相关论文
共 41 条
[1]  
AIZERMAN MA, 1965, AUTOMAT REM CONTR+, V25, P821
[2]  
[Anonymous], 1999, The Nature Statist. Learn. Theory
[3]  
[Anonymous], 1998, Encyclopedia of Biostatistics
[4]  
[Anonymous], S MATH THEORY AUTOMA
[5]  
ANTOS A, 2002, J MACHINE LEARNING R, V3, P73
[6]   Tracking the best disjunction [J].
Auer, P ;
Warmuth, MK .
MACHINE LEARNING, 1998, 32 (02) :127-150
[7]  
Azuma K., 1967, TOHOKU MATH J, V19, P357, DOI DOI 10.2748/TMJ/1178243286
[8]  
Bartlett P. L., 2003, Journal of Machine Learning Research, V3, P463, DOI 10.1162/153244303321897690
[9]   The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network [J].
Bartlett, PL .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1998, 44 (02) :525-536
[10]   Model selection and error estimation [J].
Bartlett, PL ;
Boucheron, S ;
Lugosi, G .
MACHINE LEARNING, 2002, 48 (1-3) :85-113