Issues in bioinformatics benchmarking: the case study of multiple sequence alignment

被引:48
作者
Aniba, Mohamed Radhouene [1 ,2 ,3 ]
Poch, Olivier [1 ,2 ,3 ]
Thompson, Julie D. [1 ,2 ,3 ]
机构
[1] INSERM, U596, F-75654 Paris 13, France
[2] CNRS, UMR7104, F-67400 Illkirch Graffenstaden, France
[3] Univ Strasbourg, F-67000 Strasbourg, France
关键词
PROTEIN SEQUENCES; TERTIARY STRUCTURE; CLUSTAL-W; DATABASE; ACCURACY; CLASSIFICATION; EVOLUTION; CHALLENGES; ALGORITHM; BALIBASE;
D O I
10.1093/nar/gkq625
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The post-genomic era presents many new challenges for the field of bioinformatics. Novel computational approaches are now being developed to handle the large, complex and noisy datasets produced by high throughput technologies. Objective evaluation of these methods is essential (i) to assure high quality, (ii) to identify strong and weak points of the algorithms, (iii) to measure the improvements introduced by new methods and (iv) to enable non-specialists to choose an appropriate tool. Here, we discuss the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field. We consider several criteria for building good benchmarks and the advantages to be gained when they are used intelligently. To illustrate these principles, we present a more detailed discussion of benchmarks for multiple alignments of protein sequences. As in many other domains, significant progress has been achieved in the multiple alignment field and the datasets have become progressively more challenging as the existing algorithms have evolved. Finally, we propose directions for future developments that will ensure that the bioinformatics benchmarks correspond to the challenges posed by the high throughput data.
引用
收藏
页码:7353 / 7363
页数:11
相关论文
共 71 条
[1]   Data growth and its impact on the SCOP database: new developments [J].
Andreeva, Antonina ;
Howorth, Dave ;
Chandonia, John-Marc ;
Brenner, Steven E. ;
Hubbard, Tim J. P. ;
Chothia, Cyrus ;
Murzin, Alexey G. .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D419-D425
[2]   BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations [J].
Bahr, A ;
Thompson, JD ;
Thierry, JC ;
Poch, O .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :323-326
[3]   Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[4]   A STRATEGY FOR THE RAPID MULTIPLE ALIGNMENT OF PROTEIN SEQUENCES - CONFIDENCE LEVELS FROM TERTIARY STRUCTURE COMPARISONS [J].
BARTON, GJ ;
STERNBERG, MJE .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 198 (02) :327-337
[5]   Announcing the worldwide Protein Data Bank [J].
Berman, H ;
Henrick, K ;
Nakamura, H .
NATURE STRUCTURAL BIOLOGY, 2003, 10 (12) :980-980
[6]  
Blackshields Gordon, 2006, In Silico Biol, V6, P321
[7]   Better bioinformatics through usability analysis [J].
Bolchini, Davide ;
Finkelstein, Anthony ;
Perrone, Vito ;
Nagl, Sylvia .
BIOINFORMATICS, 2009, 25 (03) :406-412
[8]   Population statistics of protein structures: Lessons from structural classifications [J].
Brenner, SE ;
Chothia, C ;
Hubbard, TJP .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 1997, 7 (03) :369-376
[9]   Evaluation of gene structure prediction programs [J].
Burset, M ;
Guigo, R .
GENOMICS, 1996, 34 (03) :353-367
[10]   DNA reference alignment benchmarks based on tertiary structure of encoded proteins [J].
Carroll, Hyrum ;
Beckstead, Wesley ;
O'Connor, Timothy ;
Ebbert, Mark ;
Clement, Mark ;
Snell, Quinn ;
McClellan, David .
BIOINFORMATICS, 2007, 23 (19) :2648-2649