Boosting multi-label hierarchical text categorization

被引:38
作者
Esuli, Andrea [1 ]
Fagni, Tiziano [1 ]
Sebastiani, Fabrizio [1 ]
机构
[1] Inst Sci & Technol Informaz, I-56124 Pisa, Italy
来源
INFORMATION RETRIEVAL | 2008年 / 11卷 / 04期
关键词
hierarchical text classification; boosting;
D O I
10.1007/s10791-008-9047-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for "flat" classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TREEBOOST.MH, a multi-label HTC algorithm consisting of a hierarchical variant of ADABOOST.MH, a very well-known member of the family of "boosting" learning algorithms. TREEBOOST.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed "locally", i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated "locally". All these intuitions are embodied within TREEBOOST.MH in an elegant and simple way, i.e. by defining TREEBOOST.MH as a recursive algorithm that uses ADABOOST.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TREEBOOST.MH on three HTC benchmarks, and discuss analytically its computational cost.
引用
收藏
页码:287 / 313
页数:27
相关论文
共 29 条
[1]  
[Anonymous], STATISTICS
[2]  
[Anonymous], P ICML 97
[3]  
[Anonymous], 2004, P 13 ACM INT C INF K
[4]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[5]   Classifying web documents in a hierarchy of categories: a comprehensive study [J].
Ceci, Michelangelo ;
Malerba, Donato .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2007, 28 (01) :37-78
[6]   Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies [J].
Chakrabarti, S ;
Dom, B ;
Agrawal, R ;
Raghavan, P .
VLDB JOURNAL, 1998, 7 (03) :163-178
[7]  
CHEN Y, 2001, SCI CHINA SER B, V5, P433
[8]  
Dumais S., 2000, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '00, New York, NY, USA, P256
[9]  
Fagni T, 2007, P 3 LANG TECHN C, P24
[10]  
Forman G., 2004, P 21 INT C MACH LEAR