ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies

被引:114
作者
Clark, Scott C. [1 ]
Egan, Rob [2 ,3 ]
Frazier, Peter I. [4 ]
Wang, Zhong [2 ,3 ]
机构
[1] Cornell Univ, Ctr Appl Math, Ithaca, NY 14853 USA
[2] Joint Genome Inst, Dept Energy, Walnut Creek, CA 94598 USA
[3] Univ Calif Berkeley, Lawrence Berkeley Natl Lab, Genom Div, Berkeley, CA 94720 USA
[4] Cornell Univ, Sch Operat Res & Informat Engn, Ithaca, NY 14853 USA
基金
美国国家科学基金会;
关键词
SEQUENCE; INSIGHTS;
D O I
10.1093/bioinformatics/bts723
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies. Results: In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.
引用
收藏
页码:435 / 443
页数:9
相关论文
共 36 条
[1]   Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries [J].
Aird, Daniel ;
Ross, Michael G. ;
Chen, Wei-Sheng ;
Danielsson, Maxwell ;
Fennell, Timothy ;
Russ, Carsten ;
Jaffe, David B. ;
Nusbaum, Chad ;
Gnirke, Andreas .
GENOME BIOLOGY, 2011, 12 (02)
[2]   A machine-learning approach to combined evidence validation of genome assemblies [J].
Choi, Jeong-Hyeon ;
Kim, Sun ;
Tang, Haixu ;
Andrews, Justen ;
Gilbert, Don G. ;
Colbourne, John K. .
BIOINFORMATICS, 2008, 24 (06) :744-750
[3]   Genome analyses of three strains of Rhodobacter sphaeroides:: Evidence of rapid evolution of chromosome II [J].
Choudhary, M. ;
Xie Zanhua ;
Fu, Y. X. ;
Kaplan, S. .
JOURNAL OF BACTERIOLOGY, 2007, 189 (05) :1914-1921
[4]   Mauve Assembly Metrics [J].
Darling, Aaron E. ;
Tritt, Andrew ;
Eisen, Jonathan A. ;
Facciotti, Marc T. .
BIOINFORMATICS, 2011, 27 (19) :2756-2757
[5]   The complete genome sequence of Escherichia coli DH10B:: Insights into the biology of a laboratory workhorse [J].
Durfee, Tim ;
Nelson, Richard ;
Baldwin, Schuyler ;
Plunkett, Guy, III ;
Burland, Valerie ;
Mau, Bob ;
Petrosino, Joseph F. ;
Qin, Xiang ;
Muzny, Donna M. ;
Ayele, Mulu ;
Gibbs, Richard A. ;
Csoergo, Balint ;
Posfai, Gyoergy ;
Weinstock, George M. ;
Blattner, Frederick R. .
JOURNAL OF BACTERIOLOGY, 2008, 190 (07) :2597-2606
[6]   Assemblathon 1: A competitive assessment of de novo short read assembly methods [J].
Earl, Dent ;
Bradnam, Keith ;
St John, John ;
Darling, Aaron ;
Lin, Dawei ;
Fass, Joseph ;
Hung On Ken Yu ;
Buffalo, Vince ;
Zerbino, Daniel R. ;
Diekhans, Mark ;
Ngan Nguyen ;
Ariyaratne, Pramila Nuwantha ;
Sung, Wing-Kin ;
Ning, Zemin ;
Haimel, Matthias ;
Simpson, Jared T. ;
Fonseca, Nuno A. ;
Birol, Inanc ;
Docking, T. Roderick ;
Ho, Isaac Y. ;
Rokhsar, Daniel S. ;
Chikhi, Rayan ;
Lavenier, Dominique ;
Chapuis, Guillaume ;
Naquin, Delphine ;
Maillet, Nicolas ;
Schatz, Michael C. ;
Kelley, David R. ;
Phillippy, Adam M. ;
Koren, Sergey ;
Yang, Shiaw-Pyng ;
Wu, Wei ;
Chou, Wen-Chi ;
Srivastava, Anuj ;
Shaw, Timothy I. ;
Ruby, J. Graham ;
Skewes-Cox, Peter ;
Betegon, Miguel ;
Dimon, Michelle T. ;
Solovyev, Victor ;
Seledtsov, Igor ;
Kosarev, Petr ;
Vorobyev, Denis ;
Ramirez-Gonzalez, Ricardo ;
Leggett, Richard ;
MacLean, Dan ;
Xia, Fangfang ;
Luo, Ruibang ;
Li, Zhenyu ;
Xie, Yinlong .
GENOME RESEARCH, 2011, 21 (12) :2224-2241
[7]   Real-Time DNA Sequencing from Single Polymerase Molecules [J].
Eid, John ;
Fehr, Adrian ;
Gray, Jeremy ;
Luong, Khai ;
Lyle, John ;
Otto, Geoff ;
Peluso, Paul ;
Rank, David ;
Baybayan, Primo ;
Bettman, Brad ;
Bibillo, Arkadiusz ;
Bjornson, Keith ;
Chaudhuri, Bidhan ;
Christians, Frederick ;
Cicero, Ronald ;
Clark, Sonya ;
Dalal, Ravindra ;
deWinter, Alex ;
Dixon, John ;
Foquet, Mathieu ;
Gaertner, Alfred ;
Hardenbol, Paul ;
Heiner, Cheryl ;
Hester, Kevin ;
Holden, David ;
Kearns, Gregory ;
Kong, Xiangxu ;
Kuse, Ronald ;
Lacroix, Yves ;
Lin, Steven ;
Lundquist, Paul ;
Ma, Congcong ;
Marks, Patrick ;
Maxham, Mark ;
Murphy, Devon ;
Park, Insil ;
Pham, Thang ;
Phillips, Michael ;
Roy, Joy ;
Sebra, Robert ;
Shen, Gene ;
Sorenson, Jon ;
Tomaney, Austin ;
Travers, Kevin ;
Trulson, Mark ;
Vieceli, John ;
Wegener, Jeffrey ;
Wu, Dawn ;
Yang, Alicia ;
Zaccarin, Denis .
SCIENCE, 2009, 323 (5910) :133-138
[8]   Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing [J].
Fujimoto, Akihiro ;
Nakagawa, Hidewaki ;
Hosono, Naoya ;
Nakano, Kaoru ;
Abe, Tetsuo ;
Boroevich, Keith A. ;
Nagasaki, Masao ;
Yamaguchi, Rui ;
Shibuya, Tetsuo ;
Kubo, Michiaki ;
Miyano, Satoru ;
Nakamura, Yusuke ;
Tsunoda, Tatsuhiko .
NATURE GENETICS, 2010, 42 (11) :931-U39
[9]   Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results [J].
Haiminen, Niina ;
Kuhn, David N. ;
Parida, Laxmi ;
Rigoutsos, Isidore .
PLOS ONE, 2011, 6 (09)
[10]   Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen [J].
Hess, Matthias ;
Sczyrba, Alexander ;
Egan, Rob ;
Kim, Tae-Wan ;
Chokhawala, Harshal ;
Schroth, Gary ;
Luo, Shujun ;
Clark, Douglas S. ;
Chen, Feng ;
Zhang, Tao ;
Mackie, Roderick I. ;
Pennacchio, Len A. ;
Tringe, Susannah G. ;
Visel, Axel ;
Woyke, Tanja ;
Wang, Zhong ;
Rubin, Edward M. .
SCIENCE, 2011, 331 (6016) :463-467