Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood

被引：14

作者：

Coleman, DA ^{[1
]}

Woodruff, DL

机构：

[1] Cytokinetics, S San Francisco, CA 94080 USA

[2] Univ Calif Davis, Grad Sch Management, Davis, CA 95616 USA

来源：

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS | 2000年 / 9卷 / 04期

关键词：

classification; EM algorithm; local search;

D O I：

10.2307/1391087

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

The primary model for cluster analysis is the latent class model. This model yields the mixture likelihood. Due to numerous local maxims, the success of the EM algorithm in maximizing the mixture likelihood depends on the initial starting point of the algorithm. In this article, good starting points for the EM algorithm are obtained by applying classification methods to randomly selected subsamples of the data. The performance of the resulting two-step algorithm, classification followed by EM, is compared to, and found superior to, the baseline algorithm of EM started from a random partition of the data. Though the algorithm is not complicated, comparing it to the baseline algorithm and assessing its performance with several classification methods is nontrivial. The strategy employed for comparing the algorithms is to identify canonical forms for the easiest and most difficult datasets to cluster within a large collection of cluster datasets and then to compare the performance of the two algorithms on these datasets. This has led to the discovery that, in the case of three homogeneous clusters, the most difficult datasets to cluster are those in which the clusters are arranged on a line and the easiest are those in which the clusters are arranged an an equilateral triangle. The performance of the two-step algorithm is assessed using several classification methods and is shown to be able to cluster large, difficult datasets consisting of three highly overlapping clusters arranged on a line with 10,000 observations and 8 variables.

引用

页码：672 / 688

页数：17

共 21 条

[1] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].