Applying negative rule mining to improve genome annotation

被引:8
作者
Artamonova, Irena I.
Frishman, Goar
Frishman, Dmitrij
机构
[1] GSF Natl Res Ctr Environm & Hlth, Inst Bioinformat, D-85764 Neuherberg, Germany
[2] RAS, Grp Bioinformat, Vavilov Inst Gen Genet, Moscow 119991, Russia
[3] Tech Univ Munich, Dept Genome Oriented Bioinformat, Wissenschaftzentrum Weihenstephan, D-85350 Freising Weihenstephan, Germany
来源
BMC BIOINFORMATICS | 2007年 / 8卷
关键词
PROTEIN SEQUENCES; DATABASE; MIPS;
D O I
10.1186/1471-2105-8-261
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. Results: Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity- transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. Conclusion: Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.
引用
收藏
页数:10
相关论文
共 29 条
[1]  
Agrawal R., 1994, Proc. 20th Int. Conf. Very Large Data Bases, V1215, P487
[2]   SCOP database in 2004: refinements integrate structure and sequence family data [J].
Andreeva, A ;
Howorth, D ;
Brenner, SE ;
Hubbard, TJP ;
Chothia, C ;
Murzin, AG .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D226-D229
[3]   Mining sequence annotation databanks for association patterns [J].
Artamonova, II ;
Frishman, G ;
Gelfand, MS ;
Frishman, D .
BIOINFORMATICS, 2005, 21 :49-57
[4]   The universal protein resource (UniProt) [J].
Bairoch, Amos ;
Bougueleret, Lydie ;
Altairac, Severine ;
Amendolia, Valeria ;
Auchincloss, Andrea ;
Puy, Ghislaine Argoud ;
Axelsen, Kristian ;
Baratin, Delphine ;
Blatter, Marie-Claude ;
Boeckmann, Brigitte ;
Bollondi, Laurent ;
Boutet, Emmanuel ;
Quintaje, Silvia Braconi ;
Breuza, Lionel ;
Bridge, Alan ;
deCastro, Edouard ;
Coral, Danielle ;
Coudert, Elisabeth ;
Cusin, Isabelle ;
Dobrokhotov, Pavel ;
Dornevil, Dolnide ;
Duvaud, Severine ;
Estreicher, Anne ;
Famiglietti, Livia ;
Feuermann, Marc ;
Gehant, Sebastian ;
Farriol-Mathis, Nathalie ;
Ferro, Serenella ;
Gasteiger, Elisabeth ;
Gateau, Alain ;
Gerritsen, Vivienne ;
Gos, Arnaud ;
Gruaz-Gumowski, Nadine ;
Hinz, Ursula ;
Hulo, Chantal ;
Hulo, Nicolas ;
Ioannidis, Vassilios ;
Ivanyi, Ivan ;
James, Janet ;
Jain, Eric ;
Jimenez, Silvia ;
Jungo, Florence ;
Junker, Vivien ;
Keller, Guillaume ;
Lachaize, Corinne ;
Lane-Guermonprez, Lydie ;
Langendijk-Genevaux, Petra ;
Lara, Vicente ;
Lemercier, Philippe ;
Le Saux, Virginie .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D193-D197
[5]   A global analysis of Caenorhabditis elegans operons [J].
Blumenthal, T ;
Evans, D ;
Link, CD ;
Guffanti, A ;
Lawson, D ;
Thierry-Mieg, J ;
Thierry-Mieg, D ;
Chiu, WL ;
Duke, K ;
Kiraly, M ;
Kim, SK .
NATURE, 2002, 417 (6891) :851-854
[6]  
Borgelt C, 2002, COMPSTAT 2002: PROCEEDINGS IN COMPUTATIONAL STATISTICS, P395
[7]   Powers and pitfalls in sequence analysis: The 70% hurdle [J].
Bork, P .
GENOME RESEARCH, 2000, 10 (04) :398-400
[8]   Go hunting in sequence databases but watch out for the traps [J].
Bork, P .
TRENDS IN GENETICS, 1996, 12 (10) :425-427
[9]   Functional classification using phylogenomic inference [J].
Brown, Duncan ;
Sjolander, Kimmen .
PLOS COMPUTATIONAL BIOLOGY, 2006, 2 (06) :479-483
[10]  
Durbin R., 1999, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids