Filtering erroneous protein annotation

被引:20
作者
Wieser, D. [1 ]
Kretschmann, E. [1 ]
Apweiler, R. [1 ]
机构
[1] European Bioinformat Inst, Sequence Database Grp, Cambridge CB10 1SD, England
关键词
D O I
10.1093/bioinformatics/bth938
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Automatically generated annotation on protein data of UniProt (Universal Protein Resource) is planned to be publicly available on the UniProt web pages in April 2004. It is expected that the data content of over 500 000 protein entries in the TrEMBL section will be enhanced by the output of an automated annotation pipeline. However, a part of the automatically added data will be erroneous, as are parts of the information coming from other sources. We present a post-processing system called Xanthippe that is based on a simple exclusion mechanism and a decision tree approach using the C4.5 data-mining algorithm. Results: It is shown that Xanthippe detects and flags a large part of the annotation errors and considerably increases the reliability of both automatically generated data and annotation from other sources. As a cross-validation to Swiss-Prot shows, errors in protein descriptions, comments and keywords are successfully filtered out. Xanthippe is a contradictive application that can be combined seamlessly with predictive systems. It can be used either to improve the precision of automated annotation at a constant level of recall or increase the recall at a constant level of precision.
引用
收藏
页码:342 / 347
页数:6
相关论文
共 11 条
  • [1] Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkw1099, 10.1093/nar/gkh131]
  • [2] PRINTS and its automatic supplement, prePRINTS
    Attwood, TK
    Bradley, P
    Flower, DR
    Gaulton, A
    Maudling, N
    Mitchell, AL
    Moulton, G
    Nordle, A
    Paine, K
    Taylor, P
    Uddin, A
    Zygouri, C
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 400 - 402
  • [3] Biswas Margaret, 2002, Brief Bioinform, V3, P285, DOI 10.1093/bib/3.3.285
  • [4] A novel method for automatic functional annotation of proteins
    Fleischmann, W
    Möller, S
    Gateau, A
    Apweiler, R
    [J]. BIOINFORMATICS, 1999, 15 (03) : 228 - 233
  • [5] Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT
    Kretschmann, E
    Fleischmann, W
    Apweiler, R
    [J]. BIOINFORMATICS, 2001, 17 (10) : 920 - 926
  • [6] The EMBL nucleotide sequence database
    Kulikova, T
    Aldebert, P
    Althorpe, N
    Baker, W
    Bates, K
    Browne, P
    van den Broek, A
    Cochrane, G
    Duggan, K
    Eberhardt, R
    Faruque, N
    Garcia-Pastor, M
    Harte, N
    Kanz, C
    Leinonen, R
    Lin, Q
    Lombard, V
    Lopez, R
    Mancuso, R
    McHale, M
    Nardone, F
    Silventoinen, V
    Stoehr, P
    Stoesser, G
    Tuli, MA
    Tzouvara, K
    Vaughan, R
    Wu, D
    Zhu, WM
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D27 - D30
  • [7] The InterPro Database, 2003 brings increased coverage and new features
    Mulder, NJ
    Apweiler, R
    Attwood, TK
    Bairoch, A
    Barrell, D
    Bateman, A
    Binns, D
    Biswas, M
    Bradley, P
    Bork, P
    Bucher, P
    Copley, RR
    Courcelle, E
    Das, U
    Durbin, R
    Falquet, L
    Fleischmann, W
    Griffiths-Jones, S
    Haft, D
    Harte, N
    Hulo, N
    Kahn, D
    Kanapin, A
    Krestyaninova, M
    Lopez, R
    Letunic, I
    Lonsdale, D
    Silventoinen, V
    Orchard, SE
    Pagni, M
    Peyruc, D
    Ponting, CP
    Selengut, JD
    Servant, F
    Sigrist, CJA
    Vaughan, R
    Zdobnov, EM
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 315 - 318
  • [8] WILMA - automated annotation of protein sequences
    Prlic, A
    Domingues, FS
    Lackner, P
    Sippl, MJ
    [J]. BIOINFORMATICS, 2004, 20 (01) : 127 - 128
  • [9] Quinlan J. R., 2014, C4 5 PROGRAMS MACHIN
  • [10] SMART: a web-based tool for the study of genetically mobile domains
    Schultz, J
    Copley, RR
    Doerks, T
    Ponting, CP
    Bork, P
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 231 - 234