TopCat: Data mining for topic identification in a text corpus

被引：51

作者：

Clifton, C

Cooley, R

Rennie, J

机构：

[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA

[2] KXEN Inc, San Francisco, CA 94103 USA

[3] MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2004年 / 16卷 / 08期

关键词：

topic detection; data mining; clustering;

D O I：

10.1109/TKDE.2004.32

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

TopCat ( Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.

引用

页码：949 / 964

页数：16

共 49 条

[11]

FANO R, 1961, TRANSMISSION INFORMA

[12]

Feldman R., 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, P167

[13]

FELDMAN R, 1997, P WORKSH RES ISS DAT

[14]

FELDMAN R, 1999, P IJCAI 99 WORKSH TE

[15]

FELDMAN R, 1998, J INTELL INF SYST, V9, P83

[16]

Han HC, 1997, RES NONDESTRUCT EVAL, V9, P97

[17]

HARMAN D, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P1

[18]

HATZIVASSILOGLO.V, 2000, P 23 ANN INT ACM SIG

[19]

HETZLER B, 1998, STRUCTURES RELATIONS, P168

[20] Efficient mining of association rules in text databases [J].

Holt, JD ;

Chung, SM .

PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, :234-242

← 1 2 3 4 5 →