Variable selection for model-based clustering

被引:317
作者
Raftery, AE [1 ]
Dean, N [1 ]
机构
[1] Univ Washington, Dept Stat, Seattle, WA 98195 USA
基金
美国国家卫生研究院;
关键词
Bayes factor; BIC; feature selection; model-based clustering; unsupervised learning; variable selection;
D O I
10.1198/016214506000000113
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We consider the problem of variable or feature selection for model-based clustering. The problem of comparing two nested subsets of variables is recast as a model comparison problem and addressed using approximate Bayes factors. A greedy search algorithm is proposed for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples and found that removing irrelevant variables often improved performance. Compared with methods based on all of the variables, our variable selection method consistently yielded more accurate estimates of the number of groups and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
引用
收藏
页码:168 / 178
页数:11
相关论文
共 41 条
[31]  
McCallum A., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P169, DOI 10.1145/347090.347123
[32]   A mixture model-based approach to the clustering of microarray expression data [J].
McLachlan, GJ ;
Bean, RW ;
Peel, D .
BIOINFORMATICS, 2002, 18 (03) :413-422
[33]  
MCLACHLAN GJ, 2000, FINITE FIXTURE MODEL
[34]  
MCLACHLAN GJ, 1998, LECT NOTES COMPUTER, V1451, P658
[35]  
Miller A, 2002, Subset Selection in Regression
[36]   Unsupervised feature selection using feature similarity [J].
Mitra, P ;
Murthy, CA ;
Pal, SK .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (03) :301-312
[37]  
Ripley B. D., 1996, PATTERN RECOGNITION
[38]  
Talavera L., 2000, Intelligent Data Analysis, V4, P19
[39]  
VAITHYANATHAN S, 1999, P NEUR INF PROC SYST, P970
[40]  
Wolfe J.H., 1963, Ph.D. thesis