Identifying document topics using the Wikipedia category network

被引:38
作者
Schoenhofen, Peter [1 ]
机构
[1] Hungarian Acad Sci, Comp & Automat Res Inst, Budapest, Hungary
来源
2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS) | 2006年
关键词
D O I
10.1109/WI.2006.92
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the last few years the size and coverage of Wikipedia, a freely available on-line encyclopedia has reached the point where it can be utilized similar to an ontology or taxonomy to identify the topics discussed in a document. In this paper we will show that even a simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and by performing classification and clustering on 20 Newsgroups and RCV1, representing documents by their Wikipedia categories instead of their texts.
引用
收藏
页码:456 / 462
页数:7
相关论文
共 22 条
[1]  
Adafre S., 2005, Proceedings of the 3rd International Workshop on Link Dis- covery, P90
[2]  
Aery Manu, 2003, CSE200325 U TEX ARL
[3]  
AHN D, 2004, P 13 TREC
[4]  
[Anonymous], 1996, BOW TOOLKIT STAT LAN
[5]  
Baeza-Yates R.A., 1999, Modern Information Retrieval
[6]  
Bellomi F., 2005, P WIK 2005 1 INT WIK
[7]  
Cannata N., 2005, PLOS COMPUTATIONAL B, V1
[8]  
FOSSATI D, 2006, P 11 C EUR ASS COMP
[9]  
Gilbert A., 2003, Annales UMCS Informatica
[10]  
Karypis George, 2002, Technical Report