NLProt: extracting protein names and sequences from papers

被引:23
作者
Mika, S
Rost, B
机构
[1] Columbia Univ, CUBIC, New York, NY 10032 USA
[2] Columbia Univ, Dept Biochem & Mol Biophys, New York, NY 10032 USA
[3] Columbia Univ, Ctr Computat Biol & Bioinformat C2B2, New York, NY 10032 USA
[4] Univ Witten Herdecke, Inst Phys Biochem, D-58448 Witten, Germany
基金
美国国家卫生研究院;
关键词
D O I
10.1093/nar/gkh427
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.
引用
收藏
页码:W634 / W637
页数:4
相关论文
共 23 条
[1]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]  
[Anonymous], 1998, GENOME INFORM
[4]  
Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkw1099, 10.1093/nar/gkh131]
[5]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[6]   GAPSCORE:: finding gene and protein names one word at a time [J].
Chang, JT ;
Schütze, H ;
Altman, RB .
BIOINFORMATICS, 2004, 20 (02) :216-225
[7]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[8]   Protein names and how to find them [J].
Franzén, K ;
Eriksson, G ;
Olsson, F ;
Asker, L ;
Lidén, P ;
Cöster, J .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) :49-61
[9]  
Fukuda K, 1998, Pac Symp Biocomput, P707
[10]  
Hanisch Daniel, 2003, Pac Symp Biocomput, P403