Parameter convergence and learning curves for neural networks

被引:33
作者
Fine, TL [1 ]
Mukherjee, S
机构
[1] Cornell Univ, Sch Elect Engn, Ithaca, NY 14853 USA
[2] Lucent Technol, Bell Labs, Holmdel, NJ 07733 USA
关键词
D O I
10.1162/089976699300016647
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We revisit the oft-studied asymptotic (in sample size) behavior of the parameter or weight estimate returned by any member of a large family of neural network training algorithms. By properly accounting for the characteristic property of neural networks that their empirical and generalization errors possess multiple minima, we rigorously establish conditions under which the parameter estimate converges strongly into the set of minima of the generalization error. Convergence of the parameter estimate to a particular value cannot be guaranteed under our assumptions. We then evaluate the asymptotic distribution of the distance between the parameter estimate and its nearest neighbor among the set of minima of the generalization error. Results on this question have appeared numerous times and generally assert asymptotic normality, the conclusion expected from familiar statistical arguments concerned with maximum likelihood estimators. These conclusions are usually reached on the basis of somewhat informal calculations, although we shall see that the situation is somewhat delicate. The preceding results then provide a derivation of learning curves for generalization and empirical errors that leads to bounds on rates of convergence.
引用
收藏
页码:747 / 769
页数:23
相关论文
共 22 条
[1]   STATISTICAL-THEORY OF LEARNING-CURVES UNDER ENTROPIC LOSS CRITERION [J].
AMARI, S ;
MURATA, N .
NEURAL COMPUTATION, 1993, 5 (01) :140-153
[2]   Asymptotic statistical theory of overtraining and cross-validation [J].
Amari, S ;
Murata, N ;
Muller, KR ;
Finke, M ;
Yang, HH .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1997, 8 (05) :985-996
[3]   A UNIVERSAL THEOREM ON LEARNING-CURVES [J].
AMARI, SI .
NEURAL NETWORKS, 1993, 6 (02) :161-166
[4]  
[Anonymous], 1982, ESTIMATION DEPENDENC
[5]  
Auer P, 1996, ADV NEUR IN, V8, P316
[6]   The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network [J].
Bartlett, PL .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1998, 44 (02) :525-536
[7]   GEOMETRICAL AND STATISTICAL PROPERTIES OF SYSTEMS OF LINEAR INEQUALITIES WITH APPLICATIONS IN PATTERN RECOGNITION [J].
COVER, TM .
IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS, 1965, EC14 (03) :326-&
[8]  
Devroye L., 1996, A probabilistic theory of pattern recognition
[9]  
Fletcher R., 1981, PRACTICAL METHODS OP
[10]  
FUKUMIZU J, 1996, NEURAL NETWORKS, V5, P871