Comparing De Novo Genome Assembly: The Long and Short of It

被引:73
作者
Narzisi, Giuseppe [1 ]
Mishra, Bud [1 ,2 ]
机构
[1] NYU, Courant Inst Math Sci, New York, NY 10003 USA
[2] NYU, Sch Med, New York, NY USA
来源
PLOS ONE | 2011年 / 6卷 / 04期
基金
美国国家科学基金会;
关键词
SHORT DNA-SEQUENCES; ALGORITHM; VIRULENCE; MILLIONS; READS;
D O I
10.1371/journal.pone.0019175
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers - both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies - are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation'' assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.
引用
收藏
页数:14
相关论文
共 52 条
[1]   Limitations of next-generation genome sequence assembly [J].
Alkan, Can ;
Sajjadian, Saba ;
Eichler, Evan E. .
NATURE METHODS, 2011, 8 (01) :61-65
[2]  
Anantharaman TS, 2005, PACIFIC SYMPOSIUM ON BIOCOMPUTING 2005, P385
[3]   Genomics via optical mapping .2. Ordered restriction maps [J].
Anantharaman, TS ;
Mishra, B ;
Schwartz, DC .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1997, 4 (02) :91-118
[4]  
Anantharaman TS., 1997, GENOMICS VIA OPTICAL, P18
[5]  
[Anonymous], GENOME SCI TECHNOL, DOI DOI 10.1089/GST.1995.1.9
[6]  
Antoniotti M, 2001, GENOMICS VIA OPTICAL
[7]   PE-Assembler: de novo assembler using short paired-end reads [J].
Ariyaratne, Pramila Nuwantha ;
Sung, Wing-Kin .
BIOINFORMATICS, 2011, 27 (02) :167-174
[8]   Optical mapping and its potential for large-scale sequencing projects [J].
Aston, C ;
Mishra, B ;
Schwartz, DC .
TRENDS IN BIOTECHNOLOGY, 1999, 17 (07) :297-302
[9]   Genome and virulence determinants of high virulence community-acquired MRSA [J].
Baba, T ;
Takeuchi, F ;
Kuroda, M ;
Yuzawa, H ;
Aoki, K ;
Oguchi, A ;
Nagai, Y ;
Iwama, N ;
Asano, K ;
Naimi, T ;
Kuroda, H ;
Cui, L ;
Yamamoto, K ;
Hiramatsu, K .
LANCET, 2002, 359 (9320) :1819-1827
[10]  
Batzoglou S, 2002, GENOME RES, V12, P177, DOI 10.1101/gr.208902