The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network

被引：788

作者：

Bartlett, PL ^{[1
]}

机构：

[1] Australian Natl Univ, Res Sch Informat Sci & Engn, Dept Syst Engn, Canberra, ACT 0200, Australia

来源：

IEEE TRANSACTIONS ON INFORMATION THEORY | 1998年 / 44卷 / 02期

关键词：

computational learning theory; neural networks; pattern recognition; scale-sensitive dimensions; weight decay;

D O I：

10.1109/18.661502

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sample complexity results from computational learning theory, when applied to neural network learning for pattern classification problems, suggest that for good generalization performance the number of training examples should grow at least linearly with the number of adjustable parameters in the network, Results in this paper show that if a large neural network is used for a pattern classification problem and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights, For example, consider a two-layer feedforward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A and the input dimension is n, We show that the misclassification probability is no more than a certain error estimate (that is related to squared error on the training set) plus A(3) root(log n)/m (ignoring log A and log m factors), where m is the number of training patterns, This may explain the generalization performance of neural networks, particularly when the number of training examples is considerably smaller than the number of weights, It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training, The proof techniques appear to be useful for the analysis of other pattern classifiers: when the input domain is a totally bounded metric space, we use the same approach to give upper bounds on misclassification probability for classifiers with decision boundaries that are far from the training examples.

引用

页码：525 / 536

页数：12

共 42 条

[1]

ALON N, 1997, IN PRESS J ASS COMPU

[2]

[Anonymous], 1988, P 1988 CONN MOD SUMM, DOI DOI 10.13140/2.1.3459.2329

[3]

[Anonymous], 1996, UMIACSTR9622

[4]

[Anonymous], P 5 ANN WORKSH COMP

[5]

[Anonymous], 1982, ESTIMATION DEPENDENC

[6]

[Anonymous], 1966, APPROXIMATION FUNCTI

[7] A RESULT OF VAPNIK WITH APPLICATIONS [J].

ANTHONY, M ;

SHAWETAYLOR, J .

DISCRETE APPLIED MATHEMATICS, 1993, 47 (03) :207-217

[8]

ANTHONY M, 1995, COMPUTATIONAL LEARNI

[9] UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION [J].

BARRON, AR .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1993, 39 (03) :930-945

[10] Covering numbers for real-valued function classes [J].

Bartlett, PL ;

Kulkarni, SR ;

Posner, SE .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1997, 43 (05) :1721-1724

← 1 2 3 4 5 →