Text clustering using frequent itemsets

被引:140
作者
Zhang, Wen [1 ]
Yoshida, Taketoshi [3 ]
Tang, Xijin [2 ]
Wang, Qing [1 ]
机构
[1] Chinese Acad Sci, Inst Software, Lab Internet Software Technol, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Inst Syst Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
[3] Japan Adv Inst Sci & Technol, Sch Knowledge Sci, Tatsunokuchi, Ishikawa 9231292, Japan
基金
中国国家自然科学基金;
关键词
Document clustering; Frequent itemsets; Maximum capturing; Similarity measure; Competitive learning;
D O I
10.1016/j.knosys.2010.01.011
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
Frequent itemset originates from association rule mining. Recently, it has been applied in text mining such as document categorization, clustering, etc. In this paper, we conduct a study on text clustering using frequent itemsets. The main contribution of this paper is three manifolds. First, we present a review on existing methods of document clustering using frequent patterns. Second, a new method called Maximum Capturing is proposed for document clustering. Maximum Capturing includes two procedures: constructing document clusters and assigning cluster topics. We develop three versions of Maximum Capturing based on three similarity measures. We propose a normalization process based on frequency sensitive competitive learning for Maximum Capturing to merge cluster candidates into predefined number of clusters. Third, experiments are carried out to evaluate the proposed method in comparison with CFWS, CMS, FTC and FIHC methods. Experiment results show that in clustering, Maximum Capturing has better performances than other methods mentioned above. Particularly, Maximum Capturing with representation using individual words and similarity measure using asymmetrical binary similarity achieves the best performance. Moreover, topics produced by Maximum Capturing distinguished clusters from each other and can be used as labels of document clusters. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:379 / 388
页数:10
相关论文
共 19 条
[1]
Aggarwal C., 1999, P 5 ACM SIGKDD INT C
[2]
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[3]
COMPETITIVE LEARNING ALGORITHMS FOR VECTOR QUANTIZATION [J].
AHALT, SC ;
KRISHNAMURTHY, AK ;
CHEN, PK ;
MELTON, DE .
NEURAL NETWORKS, 1990, 3 (03) :277-290
[4]
[Anonymous], 2011, Pei. data mining concepts and techniques
[5]
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775110
[6]
Edith H., 2006, LECT NOTES ARTIF INT, V4139, P257
[7]
Fung BC, 2003, P 3 SIAM INT C DAT M
[8]
García-Hernández RA, 2004, LECT NOTES COMPUT SC, V3287, P478
[9]
HAN J, 2000, P 2000 ACM SIGMOD IN, P1, DOI DOI 10.1145/342009.335372
[10]
HARRISON MA, 1978, COMPUTER ALGORITHMS, P127