An iterative initial-points refinement algorithm for categorical data clustering

被引:78
作者
Sun, Y [1 ]
Zhu, QM [1 ]
Chen, ZX [1 ]
机构
[1] Univ Nebraska, Dept Comp Sci, Digital Imaging & Comp Vis Lab, Omaha, NE 68182 USA
关键词
data clustering; pattern classification; refinement algorithm; data mining;
D O I
10.1016/S0167-8655(01)00163-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21-34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is the case with most data clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. The differences on the initial points often lead to considerable distinct cluster results. In this paper we present an experimental study on applying Bradley and Fayyad's iterative initial-point refinement algorithm to the k-modes clustering to improve the accurate and repetitiveness of the clustering results [cf. P. Bradley, U. Fayyad, Refining initial points for k-mean clustering, in: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1998]. Experiments show that the k-modes clustering algorithm using refined initial points leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many data mining applications with categorical data. (C) 2002 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:875 / 884
页数:10
相关论文
共 18 条
[1]   EXTENSIONS OF FORSYTHES METHOD FOR RANDOM SAMPLING FROM NORMAL DISTRIBUTION [J].
AHRENS, JH ;
DIETER, U .
MATHEMATICS OF COMPUTATION, 1973, 27 (124) :927-937
[2]  
Anderberg M.R., 1973, Probability and Mathematical Statistics
[3]  
[Anonymous], P 4 INT C KNOWL DISC
[4]  
[Anonymous], 1998, 15 INT C MACH LEARN
[5]   C-MEANS CLUSTERING WITH THE L1 AND L-INFINITY NORMS [J].
BOBROWSKI, L ;
BEZDEK, JC .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1991, 21 (03) :545-554
[6]  
Fisher D. H., 1987, Machine Learning, V2, P139, DOI 10.1007/BF00114265
[7]  
GOWER J, 1991, PATTERN RECOGN, V24, P567
[8]  
Hartigan J. A., 1975, CLUSTERING ALGORITHM
[9]  
Huang Z., 1997, PROC 1 PACIFIC ASIA, P21, DOI DOI 10.4236/OJS.2017.72013
[10]   Extensions to the k-means algorithm for clustering large data sets with categorical values [J].
Huang, ZX .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (03) :283-304