TopCat: Data mining for topic identification in a text corpus

被引:51
作者
Clifton, C
Cooley, R
Rennie, J
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] KXEN Inc, San Francisco, CA 94103 USA
[3] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA
关键词
topic detection; data mining; clustering;
D O I
10.1109/TKDE.2004.32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
TopCat ( Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
引用
收藏
页码:949 / 964
页数:16
相关论文
共 49 条
[11]  
FANO R, 1961, TRANSMISSION INFORMA
[12]  
Feldman R., 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, P167
[13]  
FELDMAN R, 1997, P WORKSH RES ISS DAT
[14]  
FELDMAN R, 1999, P IJCAI 99 WORKSH TE
[15]  
FELDMAN R, 1998, J INTELL INF SYST, V9, P83
[16]  
Han HC, 1997, RES NONDESTRUCT EVAL, V9, P97
[17]  
HARMAN D, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P1
[18]  
HATZIVASSILOGLO.V, 2000, P 23 ANN INT ACM SIG
[19]  
HETZLER B, 1998, STRUCTURES RELATIONS, P168
[20]   Efficient mining of association rules in text databases [J].
Holt, JD ;
Chung, SM .
PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, :234-242