CONCEPTS AND EFFECTIVENESS OF THE COVER-COEFFICIENT-BASED CLUSTERING METHODOLOGY FOR TEXT DATABASES

被引:57
作者
CAN, F [1 ]
OZKARAHAN, EA [1 ]
机构
[1] PENN STATE UNIV,SCH BUSINESS,ERIE,PA 16563
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 1990年 / 15卷 / 04期
关键词
ALGORITHMS; DESIGN; PERFORMANCE; THEORY; VERIFICATION; CLUSTERING-INDEXING RELATIONSHIPS; CLUSTER VALIDITY; COVER COEFFICIENT; DECOUPLING COEFFICIENT; DOCUMENT RETRIEVAL; RETRIEVAL EFFECTIVENESS;
D O I
10.1145/99935.99938
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the experiments, two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents.
引用
收藏
页码:483 / 517
页数:35
相关论文
共 34 条
[21]   TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL [J].
SALTON, G ;
BUCKLEY, C .
INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) :513-523
[22]  
Salton G., 1978, ACM Transactions on Database Systems, V3, P321, DOI 10.1145/320289.320291
[23]  
SALTON G, 1983, INTRO MODERN INFORMA
[24]  
SALTON G, 1975, DYNAMIC INFORMATION
[25]  
Salton G., 1989, AUTOMATIC TEXT PROCE
[26]  
Salton Gerard, 1971, SMART RETRIEVAL SYST, P223
[27]  
VANRIJSBERGEN C, 1979, INFORMATION RETRIEVA
[28]   IMPLEMENTING AGGLOMERATIVE HIERARCHICAL-CLUSTERING ALGORITHMS FOR USE IN DOCUMENT-RETRIEVAL [J].
VOORHEES, EM .
INFORMATION PROCESSING & MANAGEMENT, 1986, 22 (06) :465-476
[29]  
VOORHEES EM, 1985, 8TH P ANN INT ACM SI, P188
[30]  
VOORHEES EM, 1986, THESIS CORNELL U ITH