An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data

被引:446
作者
Jing, Liping
Ng, Michael K.
Huang, Joshua Zhexue
机构
[1] Univ Hong Kong, Dept Math, Hong Kong, Hong Kong, Peoples R China
[2] Hong Kong Baptist Univ, Dept Math, Kowloon Tong, Hong Kong, Peoples R China
[3] Univ Hong Kong, E Business Technol Inst, Hong Kong, Hong Kong, Peoples R China
关键词
k-means clustering; variable weighting; subspace clustering; text clustering; high-dimensional data;
D O I
10.1109/TKDE.2007.1048
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a new k-means type algorithm for clustering high-dimensional objects in subspaces. In high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. For example, in text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. The keywords for one cluster may not occur in the documents of other clusters. This is a data sparsity problem faced in clustering high-dimensional data. In the new algorithm, we extend the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. The new algorithm is also scalable to large data sets.
引用
收藏
页码:1026 / 1041
页数:16
相关论文
共 39 条
[1]  
Aggarwal CC, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P61, DOI 10.1145/304181.304188
[2]  
AGGARWAL CC, 2000, P ACM SIGMOD INT C M, P70, DOI DOI 10.1145/335191
[3]  
[Anonymous], 2002, COMP AGGLOMERATIVE P
[4]  
[Anonymous], 2002, THESIS KOREA ADV I S
[5]  
[Anonymous], 1996, BOW TOOLKIT STAT LAN
[6]  
[Anonymous], THESIS
[7]  
Cai D., 2005, IEEE T KNOWLEDGE DAT, V17
[8]  
CHAKRABARTI K, 2000, P 26 INT C VER LARG, P89
[9]   An optimization algorithm for clustering using weighted dissimilarity measures [J].
Chan, EY ;
Ching, WK ;
Ng, MK ;
Huang, JZ .
PATTERN RECOGNITION, 2004, 37 (05) :943-952
[10]  
Cheng C.-H., 1999, Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD'99), P84, DOI [10.1145/312129.312199, DOI 10.1145/312129.312199]