Exploiting structural information for semi-structured document categorization

被引:25
作者
Bratko, A
Filipic, B
机构
[1] Jozef Stefan Inst, Dept Intelligent Syst, SI-1000 Ljubljana, Slovenia
[2] Klika Informac Tehnol Doo, SI-1000 Ljubljana, Slovenia
关键词
text categorization; semi-structured documents; document structure; stacked generalization; support vector machines;
D O I
10.1016/j.ipm.2005.06.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction. (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:679 / 694
页数:16
相关论文
共 30 条
[1]  
[Anonymous], 1998, P 1998 ACM SIGMOD IN
[2]  
[Anonymous], 2002, P 11 WWW
[3]  
BARRET R, 1995, 9983 IBM RC
[4]  
Bekkerman R., 2003, Journal of Machine Learning Research, V3, P1183, DOI 10.1162/153244303322753625
[5]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[6]   The combination of text classifiers using reliability indicators [J].
Bennett, PN ;
Dumais, ST ;
Horvitz, E .
INFORMATION RETRIEVAL, 2005, 8 (01) :67-100
[7]  
BRUTLAG JD, 2000, P 17 INT C MACH LEAR, P103
[8]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[9]  
Debole F., 2003, PROCEEDING 18 ACM S, P784, DOI 10.1145/ 952532.952688
[10]   Bayesian network model for semi-structured document classification [J].
Denoyer, L ;
Gallinari, P .
INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (05) :807-827