Text document clustering based on frequent word meaning sequences

被引:127
作者
Li, Yanjun [2 ]
Chung, Soon M. [1 ]
Holt, John D. [1 ]
机构
[1] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
[2] Fordham Univ, Dept Comp & Informat Sci, Bronx, NY 10458 USA
关键词
text documents; clustering; frequent word sequences; frequent word meaning sequences; web search; WordNet;
D O I
10.1016/j.datak.2007.08.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:381 / 404
页数:24
相关论文
共 35 条
[1]  
Agrawal R., 1994, Proceedings of the 20th International Conference on Very Large Data Bases. VLDB'94, P487
[2]  
Ahonen-Myka H., 2002, P ESF EXPL WORKSH PA, P16
[3]  
Ahonen-Myka H., 1999, P 16 INT C MACH LEAR, P11
[4]  
Allan J., 2003, Proceedings of the Twelfth Text Retrieval Conference (TREC-12), P24
[5]  
Beil Florian., 2002, KDD 02, P436, DOI DOI 10.1145/775047.775110
[6]  
Choudhary B., 2002, P 11 INT WORLD WID W, P1
[7]  
CUTTING DR, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P318
[8]  
Doucet, 2004, 2 ACL WORKSH MULT EX, P88, DOI DOI 10.3115/1613186.1613198
[9]   Overcoming the memory bottleneck in suffix tree construction [J].
Farach, M ;
Ferragina, P ;
Muthukrishnan, S .
39TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 1998, :174-183
[10]   Optimal suffix tree construction with large alphabets [J].
Farach, M .
38TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 1997, :137-143