Performance evaluation of density-based clustering methods

被引:90
作者
Aliguliyev, Ramiz M. [1 ]
机构
[1] Azerbaijan Natl Acad Sci, Inst Informat Technol, Dept Artificial Intelligence & Comp Sci, Baku AZ1141, Azerbaijan
关键词
Text mining; Partitional clustering; Density-based clustering methods; Validity indices; Modified DE algorithm; ALGORITHM; VALIDITY; EVOLUTION; SEARCH; COSINE;
D O I
10.1016/j.ins.2009.06.012
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of the World Wide Web, document clustering is receiving more and more attention as an important and fundamental technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. A good document clustering approach can assist computers in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation, which is very valuable for complementing the deficiencies of traditional information retrieval technologies. In this paper, we study the performance of different density-based criterion functions, which can be classified as internal, external or hybrid, in the context of partitional clustering of document datasets. In our study, a weight was assigned to each document, which defined its relative position in the entire collection. To show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. To verify the robustness of the proposed approach, experiments were conducted on datasets with a wide variety of numbers of clusters, documents and terms. To evaluate the criterion functions, we used the Webkb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are currently the most widely used benchmarks in document clustering research. To evaluate the quality of a clustering solution, a wide spectrum of indices, three internal validity indices and seven external validity indices, were used. The internal validity indices were used for evaluating the within-cluster scatter and between cluster separations. The external validity indices were used for comparing the clustering solutions produced by the proposed criterion functions with the "ground truth" results. Experiments showed that our approach significantly improves clustering quality. In this paper, we developed a modified differential evolution (DE) algorithm to optimize the criterion functions. This modification accelerates the convergence of DE and, unlike the basic DE algorithm, guarantees that the received solution will be feasible. (C) 2009 Elsevier Inc. All rights reserved.
引用
收藏
页码:3583 / 3602
页数:20
相关论文
共 51 条
[1]  
Abraham A, 2006, IEEE C EVOL COMPUTAT, P1769
[2]  
Alguliev R. M., 2005, Automatic Control and Computer Sciences, V39, P42
[3]   Automatic Text Documents Summarization through Sentences Clustering [J].
Alguliev, R. M. ;
Alyguliev, R. M. .
JOURNAL OF AUTOMATION AND INFORMATION SCIENCES, 2008, 40 (09) :53-63
[4]  
ALGULIEV RM, 2005, ARTIF INTELL, V3, P698
[5]  
[Алыгулиев Р.М. Aliguliyev R.M.], 2007, [Вычислительные технологии, Vychislitel'nye tekhnologii], V12, P5
[6]   A novel partitioning-based clustering method and generic document summarization [J].
Aliguliyev, Ramiz M. .
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS PROCEEDINGS, 2006, :626-629
[7]  
ALIGULIYEV RM, 2006, ARTIF INTELL, V4, P651
[8]  
Allan James, 2002, Topic Detection and Tracking: Event-based Information Organization
[9]  
[Anonymous], MATH CLASSIFICATION
[10]  
[Anonymous], 2008, P 2008 SIAM INT C DA