Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood

被引:14
作者
Coleman, DA [1 ]
Woodruff, DL
机构
[1] Cytokinetics, S San Francisco, CA 94080 USA
[2] Univ Calif Davis, Grad Sch Management, Davis, CA 95616 USA
关键词
classification; EM algorithm; local search;
D O I
10.2307/1391087
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The primary model for cluster analysis is the latent class model. This model yields the mixture likelihood. Due to numerous local maxims, the success of the EM algorithm in maximizing the mixture likelihood depends on the initial starting point of the algorithm. In this article, good starting points for the EM algorithm are obtained by applying classification methods to randomly selected subsamples of the data. The performance of the resulting two-step algorithm, classification followed by EM, is compared to, and found superior to, the baseline algorithm of EM started from a random partition of the data. Though the algorithm is not complicated, comparing it to the baseline algorithm and assessing its performance with several classification methods is nontrivial. The strategy employed for comparing the algorithms is to identify canonical forms for the easiest and most difficult datasets to cluster within a large collection of cluster datasets and then to compare the performance of the two algorithms on these datasets. This has led to the discovery that, in the case of three homogeneous clusters, the most difficult datasets to cluster are those in which the clusters are arranged on a line and the easiest are those in which the clusters are arranged an an equilateral triangle. The performance of the two-step algorithm is assessed using several classification methods and is shown to be able to cluster large, difficult datasets consisting of three highly overlapping clusters arranged on a line with 10,000 observations and 8 variables.
引用
收藏
页码:672 / 688
页数:17
相关论文
共 21 条
[1]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[2]   ASYMPTOTIC-BEHAVIOR OF CLASSIFICATION MAXIMUM LIKELIHOOD ESTIMATES [J].
BRYANT, P ;
WILLIAMSON, JA .
BIOMETRIKA, 1978, 65 (02) :273-281
[3]   LARGE-SAMPLE RESULTS FOR OPTIMIZATION-BASED CLUSTERING METHODS [J].
BRYANT, PG .
JOURNAL OF CLASSIFICATION, 1991, 8 (01) :31-44
[4]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332
[5]   Some computational issues in cluster analysis with no a priori metric [J].
Coleman, D ;
Dong, XP ;
Hardin, J ;
Rocke, DM ;
Woodruff, DL .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1999, 31 (01) :1-11
[6]  
Cox D. R., 1984, ANAL SURVIVAL DATA
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]   MAXIMUM-LIKELIHOOD ESTIMATION OF THE PARAMETERS IN A MIXTURE OF 2 UNIVARIATE NORMAL-DISTRIBUTIONS - A COMPARISON OF DIFFERENT ALGORITHMS [J].
EVERITT, BS .
STATISTICIAN, 1984, 33 (02) :205-215
[9]   ON SOME INVARIANT CRITERIA FOR GROUPING DATA [J].
FRIEDMAN, HP ;
RUBIN, J .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1967, 62 (320) :1159-&
[10]  
GEHAN EA, 1965, BIOMETRIKA, V52, P650, DOI 10.2307/2333721