TopCat: Data mining for topic identification in a text corpus

被引：51

作者：

Clifton, C

Cooley, R

Rennie, J

机构：

[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA

[2] KXEN Inc, San Francisco, CA 94103 USA

[3] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2004年 / 16卷 / 08期

关键词：

topic detection; data mining; clustering;

D O I：

10.1109/TKDE.2004.32

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

TopCat ( Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.

引用

页码：949 / 964

页数：16

共 49 条

[1]

Agarwal R., 1994, P 20 INT C VER LARG, V487, P499

[2]

Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072

[3]

AHONEN H, 1997, P 1 EUR S PRINC DAT

[4]

[Anonymous], P ACM IEEE DES AUT C

[5] Machine learning of event segmentation for news on demand [J].

Boykin, S ;

Merlino, A .

COMMUNICATIONS OF THE ACM, 2000, 43 (02) :35-41

[6] Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies [J].

Chakrabarti, S ;

Dom, B ;

Agrawal, R ;

Raghavan, P .

VLDB JOURNAL, 1998, 7 (03) :163-178

[7]

Church K. W., 1990, Computational Linguistics, V16, P22

[8]

CLIFTON C, 1991, 17 INT C DAT ENG APR

[9]

COOLEY R, 1999, IJCAI 99 WORKSH TEXT

[10]

DAY D, 1997, P 5 C APPL NAT LANG

← 1 2 3 4 5 →