Some computational issues in cluster analysis with no a priori metric

被引:15
作者
Coleman, D
Dong, XP
Hardin, J
Rocke, DM
Woodruff, DL
机构
[1] Hyseq Inc, Sunnyvale, CA USA
[2] Cadence Design Syst Inc, San Jose, CA USA
[3] Univ Calif Davis, Ctr Image Proc & Integrated Computing, Davis, CA 95616 USA
[4] Univ Calif Davis, Grad Sch Management, Davis, CA 95616 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
cluster analysis; data mining; EM algorithm; mixture models;
D O I
10.1016/S0167-9473(99)00009-2
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recent interest in data mining and knowledge discovery underscores the need for methods by which patterns can be discovered in data without any prior knowledge of their existence. In this paper, we explore computational methods of finding clusters of multivariate data points when there is no metric given a priori. We are given a sample, X, of n points in R-p that come from g distinct multivariate normal populations with unknown parameters each of which contributes in excess of p points. Based on the assumption that we are given the number of groups, g, and a computational budget of T seconds of computer time, the paper reviews choices for cluster finding that have been described in the literature and introduces a new method that is a structured combination of two of them. We investigate these algorithms on some real data sets and describe simulation experiments. A principal conclusion is strong support for the contention that a two-stage algorithm based on a combinatorial search followed by the EM algorithm is the best way to find clusters. (C) 1999 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:1 / 11
页数:11
相关论文
共 12 条
[1]  
AMMORIM SG, 1992, J CLASSIF, V9, P17
[2]   Mechanisms for local search [J].
Anderson, EJ .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 1996, 88 (01) :139-151
[3]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[4]  
Dorndorf U., 1994, ORSA Journal on Computing, V6, P141, DOI 10.1287/ijoc.6.2.141
[5]  
Everitt B., 1993, CLUSTER ANAL
[6]   The use of multiple measurements in taxonomic problems [J].
Fisher, RA .
ANNALS OF EUGENICS, 1936, 7 :179-188
[7]  
Gersho A., 1992, VECTOR QUANTIZATION
[8]  
McLachlan G.J., 1988, Mixture models: inference and applications to clustering, V38
[9]  
MULVEY JM, 1989, MANAGE SCI, V25, P329
[10]   Identification of outliers in multivariate data [J].
Rocke, DM ;
Woodruff, DL .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1996, 91 (435) :1047-1061