Semi-structured document categorization with a semantic kernel

被引:17
作者
Aseervatham, Sujeevan [1 ]
Bennani, Younes [1 ]
机构
[1] Univ Paris 13, LIPN, UMR 7030, CNRS, F-93430 Villetaneuse, France
关键词
Mercer kernel; Support vector machine; Text categorization; Semantic similarity; Semi-structured data;
D O I
10.1016/j.patcog.2008.10.024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has mainly focused on "flat texts" whereas many documents are now semi-structured and especially under the XML format. In this paper, we propose a semantic kernel for semi-structured biomedical documents. The semantic meanings of words are extracted using the unified medical language system (UMLS) framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the semantic kernel outperforms the linear kernel and the naive Bayes classifier. Moreover, this kernel was ranked in the top 10 of the best algorithms among 44 classification methods at the 2007 Computational Medicine Center (CMC) Medical NLP International Challenge. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2067 / 2076
页数:10
相关论文
共 26 条
[1]  
[Anonymous], 2004, KERNEL METHODS PATTE
[2]  
[Anonymous], DEP COMPUT
[3]  
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[4]  
Basili R, 2006, INFORM-J COMPUT INFO, V30, P163
[5]  
Berg C., 1984, Harmonic Analysis on Semigroups. GTM, DOI DOI 10.1007/978-1-4612-1128-0
[6]  
Bloehdorn S, 2006, IEEE DATA MINING, P808
[7]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[8]   Latent semantic kernels [J].
Cristianini, N ;
Shawe-Taylor, J ;
Lodhi, H .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2002, 18 (2-3) :127-152
[9]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[10]  
2-9