Text document clustering based on frequent word meaning sequences

被引：127

作者：

Li, Yanjun ^{[2
]}

Chung, Soon M. ^{[1
]}

Holt, John D. ^{[1
]}

机构：

[1] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA

[2] Fordham Univ, Dept Comp & Informat Sci, Bronx, NY 10458 USA

来源：

DATA & KNOWLEDGE ENGINEERING | 2008年 / 64卷 / 01期

关键词：

text documents; clustering; frequent word sequences; frequent word meaning sequences; web search; WordNet;

D O I：

10.1016/j.datak.2007.08.001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms. (c) 2007 Elsevier B.V. All rights reserved.

引用

页码：381 / 404

页数：24

共 35 条

[21]

2-O

[22] FAST PARALLEL AND SERIAL APPROXIMATE STRING MATCHING [J].

LANDAU, GM ;

VISHKIN, U .

JOURNAL OF ALGORITHMS, 1989, 10 (02) :157-169

[23]

Larsen B., 1999, P 5 ACM SIGKDD INT C, P16, DOI [10.1145/312129.312186, DOI 10.1145/312129.312186]

[24] SPACE-ECONOMICAL SUFFIX TREE CONSTRUCTION ALGORITHM [J].

MCCREIGHT, EM .

JOURNAL OF THE ACM, 1976, 23 (02) :262-272

[25]

Rijsbergen V., 1979, INFORM RETRIEVAL, VSecond Edi

[26]

SEDDING J, 2004, P COLING 2004 WORKSH

[27]

Steinbach M., 2000, KDD 2000 WORKSH TEXT

[28] ONLINE CONSTRUCTION OF SUFFIX TREES [J].

UKKONEN, E .

ALGORITHMICA, 1995, 14 (03) :249-260

[29] Clustering transactions using large items [J].

Wang, K ;

Xu, C ;

Liu, B .

PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, :483-490

[30]

Weiner P., 1973, 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, P1

← 1 2 3 4 →