Text document clustering based on frequent word meaning sequences

被引:127
作者
Li, Yanjun [2 ]
Chung, Soon M. [1 ]
Holt, John D. [1 ]
机构
[1] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
[2] Fordham Univ, Dept Comp & Informat Sci, Bronx, NY 10458 USA
关键词
text documents; clustering; frequent word sequences; frequent word meaning sequences; web search; WordNet;
D O I
10.1016/j.datak.2007.08.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:381 / 404
页数:24
相关论文
共 35 条
[21]  
2-O
[22]   FAST PARALLEL AND SERIAL APPROXIMATE STRING MATCHING [J].
LANDAU, GM ;
VISHKIN, U .
JOURNAL OF ALGORITHMS, 1989, 10 (02) :157-169
[23]  
Larsen B., 1999, P 5 ACM SIGKDD INT C, P16, DOI [10.1145/312129.312186, DOI 10.1145/312129.312186]
[24]   SPACE-ECONOMICAL SUFFIX TREE CONSTRUCTION ALGORITHM [J].
MCCREIGHT, EM .
JOURNAL OF THE ACM, 1976, 23 (02) :262-272
[25]  
Rijsbergen V., 1979, INFORM RETRIEVAL, VSecond Edi
[26]  
SEDDING J, 2004, P COLING 2004 WORKSH
[27]  
Steinbach M., 2000, KDD 2000 WORKSH TEXT
[28]   ONLINE CONSTRUCTION OF SUFFIX TREES [J].
UKKONEN, E .
ALGORITHMICA, 1995, 14 (03) :249-260
[29]   Clustering transactions using large items [J].
Wang, K ;
Xu, C ;
Liu, B .
PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, :483-490
[30]  
Weiner P., 1973, 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, P1