TopCat: Data mining for topic identification in a text corpus

被引:51
作者
Clifton, C
Cooley, R
Rennie, J
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] KXEN Inc, San Francisco, CA 94103 USA
[3] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA
关键词
topic detection; data mining; clustering;
D O I
10.1109/TKDE.2004.32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
TopCat ( Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
引用
收藏
页码:949 / 964
页数:16
相关论文
共 49 条
[1]  
Agarwal R., 1994, P 20 INT C VER LARG, V487, P499
[2]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[3]  
AHONEN H, 1997, P 1 EUR S PRINC DAT
[4]  
[Anonymous], P ACM IEEE DES AUT C
[5]   Machine learning of event segmentation for news on demand [J].
Boykin, S ;
Merlino, A .
COMMUNICATIONS OF THE ACM, 2000, 43 (02) :35-41
[6]   Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies [J].
Chakrabarti, S ;
Dom, B ;
Agrawal, R ;
Raghavan, P .
VLDB JOURNAL, 1998, 7 (03) :163-178
[7]  
Church K. W., 1990, Computational Linguistics, V16, P22
[8]  
CLIFTON C, 1991, 17 INT C DAT ENG APR
[9]  
COOLEY R, 1999, IJCAI 99 WORKSH TEXT
[10]  
DAY D, 1997, P 5 C APPL NAT LANG