Variable selection for model-based clustering

被引:317
作者
Raftery, AE [1 ]
Dean, N [1 ]
机构
[1] Univ Washington, Dept Stat, Seattle, WA 98195 USA
基金
美国国家卫生研究院;
关键词
Bayes factor; BIC; feature selection; model-based clustering; unsupervised learning; variable selection;
D O I
10.1198/016214506000000113
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We consider the problem of variable or feature selection for model-based clustering. The problem of comparing two nested subsets of variables is recast as a model comparison problem and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples and found that removing irrelevant variables often improved performance. Compared with methods based on all of the variables, our variable selection method consistently yielded more accurate estimates of the number of groups and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
引用
收藏
页码:168 / 178
页数:11
相关论文
共 41 条
[1]  
Anderson E., 1935, Bulletin of the American IRIS Society, V59, P2
[2]  
ANDERSON EDGAR, 1936, ANN MISSOURI BOT GARD, V23, P457, DOI 10.2307/2394164
[3]  
[Anonymous], 1992, COMPUTATION STAT
[4]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[5]   Latent class marginal models for cross-classifications of counts [J].
Becker, MP ;
Yang, IS .
SOCIOLOGICAL METHODOLOGY, VOL. 28 1998, 1998, 28 :293-325
[6]  
Brodatz P, 1966, TEXTURES PHOTOGRAPHI
[7]   A variable-selection heuristic for K-means clustering [J].
Brusco, MJ ;
Cradit, JD .
PSYCHOMETRIKA, 2001, 66 (02) :249-270
[8]   MULTIVARIATE STUDY OF VARIATION IN 2 SPECIES OF ROCK CRAB OF GENUS LEPTOGRAPSUS [J].
CAMPBELL, NA ;
MAHON, RJ .
AUSTRALIAN JOURNAL OF ZOOLOGY, 1974, 22 (03) :417-425
[9]   GAUSSIAN PARSIMONIOUS CLUSTERING MODELS [J].
CELEUX, G ;
GOVAERT, G .
PATTERN RECOGNITION, 1995, 28 (05) :781-793
[10]  
Chakrabarti K., 2000, VLDB C