Fast and robust general purpose clustering algorithms

被引:32
作者
Estivill-Castro, V [1 ]
Yang, J
机构
[1] Griffith Univ, Sch Comp & Informat Technol, Nathan, Qld 4111, Australia
[2] Univ Western Sydney Macarthur, Sch Comp & Informat Technol, Campbelltown, NSW 2560, Australia
关键词
clustering; k-MEANS; medoids; 1-median problem; combinatorial optimization; EXPECTATION MAXIMIZATION;
D O I
10.1023/B:DAMI.0000015869.08323.b3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-MEANS has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-MEANS has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers. We achieve this by using medians rather than means as estimators for the centers of clusters. Comparison with k-MEANS, EXPECTATION MAXIMIZATION and GIBBS sampling demonstrates the advantages of our algorithm.
引用
收藏
页码:127 / 150
页数:24
相关论文
共 50 条
[11]  
Bradley P. S., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P9
[12]  
Bradley PS, 1997, ADV NEUR IN, V9, P368
[13]  
BRADLEY PS, 1998, P 15 INT C MACH LEAR, P91
[14]  
Bulmer MG., 1979, Principles of Statistics
[15]  
CASELA G., 1990, STAT INFERENCE
[16]   GAUSSIAN PARSIMONIOUS CLUSTERING MODELS [J].
CELEUX, G ;
GOVAERT, G .
PATTERN RECOGNITION, 1995, 28 (05) :781-793
[17]  
Cherkassky V.S., 1998, LEARNING DATA CONCEP, V1st ed.
[18]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[19]  
Dowe DL, 1998, LECT NOTES ARTIF INT, V1394, P87
[20]  
Duda R. O., 1973, PATTERN CLASSIFICATI