GAGE: A critical evaluation of genome assemblies and assembly algorithms

被引:457
作者
Salzberg, Steven L. [1 ]
Phillippy, Adam M. [2 ]
Zimin, Aleksey [3 ]
Puiu, Daniela [1 ]
Magoc, Tanja [1 ]
Koren, Sergey [2 ,4 ]
Treangen, Todd J. [1 ]
Schatz, Michael C. [5 ]
Delcher, Arthur L. [6 ]
Roberts, Michael [3 ]
Marcais, Guillaume [3 ]
Pop, Mihai [4 ]
Yorke, James A. [3 ]
机构
[1] Johns Hopkins Univ, Sch Med, McKusick Nathans Inst Genet Med, Baltimore, MD 21205 USA
[2] Battelle Natl Biodef Inst, Natl Biodef Anal & Countermeasures Ctr, Frederick, MD 21702 USA
[3] Univ Maryland, Inst Phys Sci & Technol, College Pk, MD 20742 USA
[4] Univ Maryland, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
[5] Cold Spring Harbor Lab, Simons Ctr Quantitat Biol, Cold Spring Harbor, NY 11724 USA
[6] Univ Maryland, Sch Med, Inst Genome Sci, Baltimore, MD 21201 USA
关键词
SEQUENCE DATA; DNA;
D O I
10.1101/gr.131383.111
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 it in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome., second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.
引用
收藏
页码:557 / 567
页数:11
相关论文
共 24 条
[1]   Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis [J].
Dalloul, Rami A. ;
Long, Julie A. ;
Zimin, Aleksey V. ;
Aslam, Luqman ;
Beal, Kathryn ;
Blomberg, Le Ann ;
Bouffard, Pascal ;
Burt, David W. ;
Crasta, Oswald ;
Crooijmans, Richard P. M. A. ;
Cooper, Kristal ;
Coulombe, Roger A. ;
De, Supriyo ;
Delany, Mary E. ;
Dodgson, Jerry B. ;
Dong, Jennifer J. ;
Evans, Clive ;
Frederickson, Karin M. ;
Flicek, Paul ;
Florea, Liliana ;
Folkerts, Otto ;
Groenen, Martien A. M. ;
Harkins, Tim T. ;
Herrero, Javier ;
Hoffmann, Steve ;
Megens, Hendrik-Jan ;
Jiang, Andrew ;
de Jong, Pieter ;
Kaiser, Pete ;
Kim, Heebal ;
Kim, Kyu-Won ;
Kim, Sungwon ;
Langenberger, David ;
Lee, Mi-Kyung ;
Lee, Taeheon ;
Mane, Shrinivasrao ;
Marcais, Guillaume ;
Marz, Manja ;
McElroy, Audrey P. ;
Modise, Thero ;
Nefedov, Mikhail ;
Notredame, Cedric ;
Paton, Ian R. ;
Payne, William S. ;
Pertea, Geo ;
Prickett, Dennis ;
Puiu, Daniela ;
Qioa, Dan ;
Raineri, Emanuele ;
Ruffier, Magali .
PLOS BIOLOGY, 2010, 8 (09)
[2]   Multiple whole-genome alignments without a reference organism [J].
Dubchak, Inna ;
Poliakov, Alexander ;
Kislyuk, Andrey ;
Brudno, Michael .
GENOME RESEARCH, 2009, 19 (04) :682-689
[3]   Assemblathon 1: A competitive assessment of de novo short read assembly methods [J].
Earl, Dent ;
Bradnam, Keith ;
St John, John ;
Darling, Aaron ;
Lin, Dawei ;
Fass, Joseph ;
Hung On Ken Yu ;
Buffalo, Vince ;
Zerbino, Daniel R. ;
Diekhans, Mark ;
Ngan Nguyen ;
Ariyaratne, Pramila Nuwantha ;
Sung, Wing-Kin ;
Ning, Zemin ;
Haimel, Matthias ;
Simpson, Jared T. ;
Fonseca, Nuno A. ;
Birol, Inanc ;
Docking, T. Roderick ;
Ho, Isaac Y. ;
Rokhsar, Daniel S. ;
Chikhi, Rayan ;
Lavenier, Dominique ;
Chapuis, Guillaume ;
Naquin, Delphine ;
Maillet, Nicolas ;
Schatz, Michael C. ;
Kelley, David R. ;
Phillippy, Adam M. ;
Koren, Sergey ;
Yang, Shiaw-Pyng ;
Wu, Wei ;
Chou, Wen-Chi ;
Srivastava, Anuj ;
Shaw, Timothy I. ;
Ruby, J. Graham ;
Skewes-Cox, Peter ;
Betegon, Miguel ;
Dimon, Michelle T. ;
Solovyev, Victor ;
Seledtsov, Igor ;
Kosarev, Petr ;
Vorobyev, Denis ;
Ramirez-Gonzalez, Ricardo ;
Leggett, Richard ;
MacLean, Dan ;
Xia, Fangfang ;
Luo, Ruibang ;
Li, Zhenyu ;
Xie, Yinlong .
GENOME RESEARCH, 2011, 21 (12) :2224-2241
[4]   High-quality draft assemblies of mammalian genomes from massively parallel sequence data [J].
Gnerre, Sante ;
MacCallum, Iain ;
Przybylski, Dariusz ;
Ribeiro, Filipe J. ;
Burton, Joshua N. ;
Walker, Bruce J. ;
Sharpe, Ted ;
Hall, Giles ;
Shea, Terrance P. ;
Sykes, Sean ;
Berlin, Aaron M. ;
Aird, Daniel ;
Costello, Maura ;
Daza, Riza ;
Williams, Louise ;
Nicol, Robert ;
Gnirke, Andreas ;
Nusbaum, Chad ;
Lander, Eric S. ;
Jaffe, David B. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 (04) :1513-1518
[5]   Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals [J].
Ju, Young Seok ;
Kim, Jong-Il ;
Kim, Sheehyun ;
Hong, Dongwan ;
Park, Hansoo ;
Shin, Jong-Yeon ;
Lee, Seungbok ;
Lee, Won-Chul ;
Kim, Sujung ;
Yu, Saet-Byeol ;
Park, Sung-Soo ;
Seo, Seung-Hyun ;
Yun, Ji-Young ;
Kim, Hyun-Jin ;
Lee, Dong-Sung ;
Yavartanoo, Maryam ;
Kang, Hyunseok Peter ;
Gokcumen, Omer ;
Govindaraju, Diddahally R. ;
Jung, Jung Hee ;
Chong, Hyonyong ;
Yang, Kap-Seok ;
Kim, Hyungtae ;
Lee, Charles ;
Seo, Jeong-Sun .
NATURE GENETICS, 2011, 43 (08) :745-U47
[6]   Quake: quality-aware detection and correction of sequencing errors [J].
Kelley, David R. ;
Schatz, Michael C. ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2010, 11 (11)
[7]   Detection and correction of false segmental duplications caused by genome mis-assembly [J].
Kelley, David R. ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2010, 11 (03)
[8]   Bambus 2: scaffolding metagenomes [J].
Koren, Sergey ;
Treangen, Todd J. ;
Pop, Mihai .
BIOINFORMATICS, 2011, 27 (21) :2964-2971
[9]   Versatile and open software for comparing large genomes [J].
Kurtz, S ;
Phillippy, A ;
Delcher, AL ;
Smoot, M ;
Shumway, M ;
Antonescu, C ;
Salzberg, SL .
GENOME BIOLOGY, 2004, 5 (02)
[10]   Initial sequencing and analysis of the human genome [J].
Lander, ES ;
Int Human Genome Sequencing Consortium ;
Linton, LM ;
Birren, B ;
Nusbaum, C ;
Zody, MC ;
Baldwin, J ;
Devon, K ;
Dewar, K ;
Doyle, M ;
FitzHugh, W ;
Funke, R ;
Gage, D ;
Harris, K ;
Heaford, A ;
Howland, J ;
Kann, L ;
Lehoczky, J ;
LeVine, R ;
McEwan, P ;
McKernan, K ;
Meldrim, J ;
Mesirov, JP ;
Miranda, C ;
Morris, W ;
Naylor, J ;
Raymond, C ;
Rosetti, M ;
Santos, R ;
Sheridan, A ;
Sougnez, C ;
Stange-Thomann, N ;
Stojanovic, N ;
Subramanian, A ;
Wyman, D ;
Rogers, J ;
Sulston, J ;
Ainscough, R ;
Beck, S ;
Bentley, D ;
Burton, J ;
Clee, C ;
Carter, N ;
Coulson, A ;
Deadman, R ;
Deloukas, P ;
Dunham, A ;
Dunham, I ;
Durbin, R ;
French, L .
NATURE, 2001, 409 (6822) :860-921