Text document clustering based on neighbors

被引:62
作者
Luo, Congnan [2 ]
Li, Yanjun [3 ]
Chung, Soon M. [1 ]
机构
[1] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
[2] Teradata Corp, San Diego, CA 92127 USA
[3] Fordham Univ, Dept Comp & Informat Sci, Bronx, NY 10458 USA
关键词
Document clustering; Text mining; k-means; Bisecting k-means; Performance analysis; K-MEANS;
D O I
10.1016/j.datak.2009.06.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345-366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data Sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:1271 / 1288
页数:18
相关论文
共 37 条
[1]  
[Anonymous], LEMUR TOOLKIT LANGUA
[2]  
Bartal Y., 2001, Proceedings of the thirty-third annual ACM symposium on Theory of computing, STOC '01, P11
[3]  
BUCKLEY C., 1985, SIGIR 85, P97
[4]  
Choi FYY, 2001, PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P109
[5]  
CUTTING DR, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P318
[6]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[7]  
2-9
[8]   A SIMPLE HEURISTIC FOR THE P-CENTER PROBLEM [J].
DYER, ME ;
FRIEZE, AM .
OPERATIONS RESEARCH LETTERS, 1985, 3 (06) :285-288
[9]  
Ertöz L, 2004, NETWORK THEORY APPLI, V11, P83
[10]  
Ester M., 1996, P 2 INT C KNOWLEDGE, P226, DOI DOI 10.5555/3001460.3001507