Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation

被引:20
作者
Dobrokhotov, Pavel B. [1 ]
Goutte, Cyril [2 ]
Veuthey, Anne-Lise [1 ]
Gaussier, Eric [2 ]
机构
[1] Swiss Inst Bioinformat, CMU, CH-1211 Geneva 4, Switzerland
[2] Xerox Res Ctr Europe, F-38240 Meylan, France
关键词
D O I
10.1093/bioinformatics/btg1011
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to SwissProt annotation, and to identify significant terms in the documents. Results: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic preprocessing of documents, which in turn can improve the overall classifier performance.
引用
收藏
页码:i91 / i94
页数:4
相关论文
共 6 条
[1]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[2]  
DOBROKHOTOV PB, 2003, P MIE2003 IN PRESS
[3]  
Gaussier E, 2002, LECT NOTES COMPUT SC, V2291, P229
[4]  
Hagege C., 2002, Advances in Natural Language Processing. Third International Conference, PorTAL 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2389), P197
[5]   Mining literature for protein-protein interactions [J].
Marcotte, EM ;
Xenarios, I ;
Eisenberg, D .
BIOINFORMATICS, 2001, 17 (04) :359-363
[6]  
WILBUR JW, 2000, P AMIA S, P918