Bayesian network model for semi-structured document classification

被引:45
作者
Denoyer, L [1 ]
Gallinari, P [1 ]
机构
[1] Lab Informat Paris 6, F-75015 Paris, France
关键词
statistical learning; Bayesian networks; categorization; structured documents; XML; machine learning;
D O I
10.1016/j.ipm.2004.04.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages. (C) 2004 Elsevier Ltd. All rights reserved.
引用
收藏
页码:807 / 827
页数:21
相关论文
共 28 条
[1]  
[Anonymous], 1999, P 22 ANN INT ACM SIG
[2]  
[Anonymous], 1998, P 1998 ACM SIGMOD IN
[3]  
[Anonymous], P ICML 97
[4]  
BAEZAYATES R, 2002, J AM SOC INFORMATION
[5]  
Blei D., 2003, P 26 ANN INT ACM SIG, P127, DOI DOI 10.1145/860435.860460
[6]  
Cai L., 2003, P 26 ANN INT ACM SIG, P182
[7]  
Callan J. P., 1992, DEXA 92. Database and Expert Systems Applications. Proceedings of the International Conference, P78
[8]  
CLINE M, 1999, THESIS U TEXAS
[9]  
Denoyer Ludovic, 2001, P ECIR, P126
[10]  
Diligenti M., 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition, P849, DOI 10.1109/ICDAR.2001.953907