Classifying web documents in a hierarchy of categories: a comprehensive study

被引:60
作者
Ceci, Michelangelo [1 ]
Malerba, Donato [1 ]
机构
[1] Univ Bari, Dipartimento Informat, I-70126 Bari, Italy
关键词
text categorization; hierarchical models; supervised learning; feature selection; performance evaluation; web content mining;
D O I
10.1007/s10844-006-0003-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naive Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.
引用
收藏
页码:37 / 78
页数:42
相关论文
共 58 条
[41]   Hierarchical text categorization using neural networks [J].
Ruiz, ME ;
Srinivasan, P .
INFORMATION RETRIEVAL, 2002, 5 (01) :87-118
[42]  
Sahami M., 1996, P 2 INT C KNOWLEDGE, P335
[43]   TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL [J].
SALTON, G ;
BUCKLEY, C .
INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) :513-523
[44]  
Schapire R. E., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P215, DOI 10.1145/290941.290996
[45]   BoosTexter: A boosting-based system for text categorization [J].
Schapire, RE ;
Singer, Y .
MACHINE LEARNING, 2000, 39 (2-3) :135-168
[46]   Machine learning in automated text categorization [J].
Sebastiani, F .
ACM COMPUTING SURVEYS, 2002, 34 (01) :1-47
[47]  
SHEN Y, 2003, IMPROVING PERFORMANC
[48]  
SONA D, 2004, P ECML PKDD 04 WORKS, P50
[49]   Hierarchical text classification and evaluation [J].
Sun, AX ;
Lim, EP .
2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, :521-528
[50]  
THEERAMUNKONG T, 2002, P 19 INT C COMP LING, P1