Information theoretic clustering of sparse co-occurrence data

被引:16
作者
Dhillon, IS [1 ]
Guan, YQ [1 ]
机构
[1] Univ Texas, Dept Comp Sci, Austin, TX 78712 USA
来源
THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2003年
关键词
D O I
10.1109/ICDM.2003.1250966
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel approach to clustering co-occurrence data poses it as an optimization problem in information theory which minimizes the resulting loss in mutual information. A divisive clustering algorithm that monotonically reduces this loss function was recently proposed. In this paper we show that sparse high-dimensional data presents special challenges which can result in the algorithm getting stuck at poor local minima. We propose two solutions to this problem: (a) a "prior" to overcome infinite relative entropy values as in the supervised Naive Bayes algorithm, and (b) local search to escape local minima. Finally, we combine these solutions to get a robust algorithm that is computationally efficient. We present experimental results to show that the proposed method is effective in clustering document collections and outperforms previous information-theoretic clustering approaches.
引用
收藏
页码:517 / 520
页数:4
相关论文
共 9 条
[1]  
[Anonymous], 1995, ICML
[2]  
Cover T. M., 2005, ELEM INF THEORY, DOI 10.1002/047174882X
[3]  
Dhillon I. S., 2003, Journal of Machine Learning Research, V3, P1265, DOI 10.1162/153244303322753661
[4]  
DHILLON IS, 2002, P IEEE INT C DAT MIN
[5]  
DHILLON IS, 2003, TR0339 U TEX DEPT CO
[6]  
Duda R. O., 2000, Pattern Classification and Scene Analysis, V2nd
[7]   Deterministic annealing for clustering, compression, classification, regression, and related optimization problems [J].
Rose, K .
PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2210-2239
[8]  
Slonim N., 2000, ACM SIGIR, P208
[9]  
Slonim N., 2002, ACM SIGIR