Category cluster discovery from distributed WWW directories

被引:32
作者
Shyu, ML
Haruechaiyasak, C
Chen, SC
机构
[1] Univ Miami, Dept Elect & Comp Engn, Coral Gables, FL 33124 USA
[2] Florida Int Univ, Lab Sch Comp Sci, Distributed Multimedia Informat Syst, Miami, FL 33199 USA
关键词
distributed information sources; information integration; cluster analysis; web mining; document classification;
D O I
10.1016/S0020-0255(03)00169-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the inherently distributed nature of many networks, including the Internet, information and knowledge are generated and organized independently by different groups of people. To discover and exploit all the knowledge from different sources, a method of knowledge integration is usually required. Considering the document category sets as information sources, we define a problem of information integration called category merging. The purpose of category merging is to automatically construct a unified category set which represents and exploits document information from several different sources. This merging process is based on the clustering concept where categories with similar characteristics are merged into the same cluster under certain distributed constraints. To evaluate the quality of the merged category set, we measure the precision and recall values under three classification methods, Naive Bayes, Vector Space Model, and K-Nearest Neighbor. In addition, we propose a performance measure called cluster entropy, which determines how well the categories from different sources are distributed over the resulting clusters. We perform the merging process by using the real data sets collected from three different Web directories. The results show that our merging process improves the classification performance over the non-merged approach and also provides a better representation for all categories from distributed directories. (C) 2003 Elsevier Inc. All rights reserved.
引用
收藏
页码:181 / 197
页数:17
相关论文
共 24 条
[1]  
[Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
[2]  
BRAZDIL P, 1990, CURRENT TRENDS ARTIF, P412
[3]  
Broder A. Z., 1997, P 6 INT WORLD WID WE, V29, P1157, DOI [DOI 10.1016/S0169-7552(97)00031-7, 10.1016/S0169-7552(97)00031-7]
[4]  
Chakrabarti S., 1998, SIGMOD Record, V27, P307, DOI 10.1145/276305.276332
[5]   Web mining: Information and pattern discovery on the World Wide Web [J].
Cooley, R ;
Mobasher, B ;
Srivastava, J .
NINTH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 1997, :558-567
[6]  
Halkidi M, 2002, SIGMOD RECORD, V31, P40, DOI 10.1145/565117.565124
[7]  
Haruechaiyasak C., 2002, International Journal of Computational Intelligence and Applications, V2, P277, DOI 10.1142/S1469026802000609
[8]   Web document classification based on fuzzy association [J].
Haruechaiyasak, C ;
Shyu, ML ;
Chen, SC ;
Li, XQ .
26TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2002, :487-492
[9]  
Jia Wang, 1999, Computer Communication Review, V29, P36, DOI 10.1145/505696.505701
[10]  
KAUFMAN L, 1990, FINDING GROU0S DATA