Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

被引:523
作者
Rose, K [1 ]
机构
[1] Univ Calif Santa Barbara, Dept Elect & Comp Engn, Santa Barbara, CA 93106 USA
基金
美国国家科学基金会;
关键词
classification; clustering; compression; deterministic annealing; maximum entropy; optimization methods; regression; vector quantization;
D O I
10.1109/5.726788
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The deterministic annealing approach to clustering and its extensions has, demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification and statistical regression. The method offers three important features: 1) the ability to avoid many poor local optima; 2) applicability to many different structures/architectures; and 3) the ability to minimize the right cost function even when its gradients vanish almost everywhere, as in the case of the empirical classification error It is derived within a probabilistic framework from basic information theoretic principles (e.g., maximum entropy and random coding). The application-specific cost is minimized subject to a constraint on the randomness (Shannon entropy) of the solution, which is,gradually lowered. We emphasize intuition gained from analogy to statistical physics, where this is an annealing process that avoids many shallow local minima of the specified cost and, at the limit of zero "temperature," produces a nonrandom (hard) solution. Alternatively, the method is derived within rate-distortion theory, where the annealing process is equivalent to computation of Shannon's rate-distortion function, and the annealing temperature is inversely proportional to the slope of the curve. This provides new insights into the method and its performance, as well as new insights into rate-distortion theory itself. The basic algorithm is extended by incorporating structural constraints to allow optimization of numerous popular structures including vector quantizers, decision trees, multilayer perceptrons, radial basis functions, and mixtures of experts. Experimental results show considerable performance gains over standard structure-specific and application-specific training methods. The paper concludes with a brief discussion of extensions of the method that are currently under investigation.
引用
收藏
页码:2210 / 2239
页数:30
相关论文
共 110 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
[Anonymous], THESIS CALTECH PASAD
[4]   THE DESIGN OF JOINT SOURCE AND CHANNEL TRELLIS WAVEFORM CODERS [J].
AYANOGLU, E ;
GRAY, RM ;
GRAY, RM .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1987, 33 (06) :855-865
[5]   A CLUSTERING TECHNIQUE FOR SUMMARIZING MULTIVARIATE DATA [J].
BALL, GH ;
HALL, DJ .
BEHAVIORAL SCIENCE, 1967, 12 (02) :153-&
[6]   A MAXIMIZATION TECHNIQUE OCCURRING IN STATISTICAL ANALYSIS OF PROBABILISTIC FUNCTIONS OF MARKOV CHAINS [J].
BAUM, LE ;
PETRIE, T ;
SOULES, G ;
WEISS, N .
ANNALS OF MATHEMATICAL STATISTICS, 1970, 41 (01) :164-&
[7]   A LEAST BIASED FUZZY CLUSTERING METHOD [J].
BENI, G ;
LIU, XM .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1994, 16 (09) :954-960
[8]   MINIMUM ENTROPY QUANTIZERS AND PERMUTATION CODES [J].
BERGER, T .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1982, 28 (02) :149-157
[9]  
BERGER T, 1971, RAT DISTORTION THEOR