Cluster validation by prediction strength

被引:366
作者
Tibshirani, R [1 ]
Walther, G
机构
[1] Stanford Univ, Dept Hlth Res & Policy, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
关键词
number of clusters; prediction; unsupervised learning;
D O I
10.1198/106186005X59243
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
This article proposes a new quantity for assessing the number of groups or clusters in a dataset. The key idea is to View Clustering as a supervised classification problem, in which We must also estimate the "true" class labels. The resulting "prediction strength" measure assesses how many groups can be predicted from the data, and how well. In the process, we develop novel notions of bias and variance for unlabeled data. Prediction strength performs well in simulation studies, and we apply it to clusters of breast cancer samples from a DNA microarray study. Finally, some consistency properties of the method are established.
引用
收藏
页码:511 / 528
页数:18
相关论文
共 16 条
[1]  
[Anonymous], 1998, THESIS STANFORD U
[2]  
Ben-Hur Asa, 2002, Pac Symp Biocomput, P6
[3]  
Calinski T., 1974, COMMUN STAT, V3, P1, DOI [10.1080/03610927408827101, DOI 10.1080/03610927408827101]
[4]   How many clusters? Which clustering method? Answers via model-based cluster analysis [J].
Fraley, C ;
Raftery, AE .
COMPUTER JOURNAL, 1998, 41 (08) :578-588
[5]  
Gordon A, 1999, Classification
[6]   Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments [J].
Kerr, MK ;
Churchill, GA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (16) :8961-8965
[7]   A CRITERION FOR DETERMINING THE NUMBER OF GROUPS IN A DATA SET USING SUM-OF-SQUARES CLUSTERING [J].
KRZANOWSKI, WJ ;
LAI, YT .
BIOMETRICS, 1988, 44 (01) :23-34
[8]   AN EXAMINATION OF PROCEDURES FOR DETERMINING THE NUMBER OF CLUSTERS IN A DATA SET [J].
MILLIGAN, GW ;
COOPER, MC .
PSYCHOMETRIKA, 1985, 50 (02) :159-179
[9]   U-PROCESSES - RATES OF CONVERGENCE [J].
NOLAN, D ;
POLLARD, D .
ANNALS OF STATISTICS, 1987, 15 (02) :780-799
[10]  
Olshen R.A, 1999, APPL CLUSTER ANAL HL