The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

被引:19797
作者
McKenna, Aaron [1 ]
Hanna, Matthew [1 ]
Banks, Eric [1 ]
Sivachenko, Andrey [1 ]
Cibulskis, Kristian [1 ]
Kernytsky, Andrew [1 ]
Garimella, Kiran [1 ]
Altshuler, David [1 ,2 ]
Gabriel, Stacey [1 ]
Daly, Mark [1 ,2 ]
DePristo, Mark A. [1 ]
机构
[1] Broad Inst Harvard & MIT, Program Med & Populat Genet, Cambridge, MA 02142 USA
[2] Massachusetts Gen Hosp, Richard B Simches Res Ctr, Ctr Human Genet Res, Boston, MA 02114 USA
关键词
STRUCTURAL VARIATION; QUALITY ASSESSMENT; SHORT-READ; ALIGNMENT; MHC;
D O I
10.1101/gr.107524.110
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS the 1000 Genome pilot alone includes nearly five terabases-make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
引用
收藏
页码:1297 / 1303
页数:7
相关论文
共 27 条
[11]   VarScan: variant detection in massively parallel sequencing of individual and pooled samples [J].
Koboldt, Daniel C. ;
Chen, Ken ;
Wylie, Todd ;
Larson, David E. ;
McLellan, Michael D. ;
Mardis, Elaine R. ;
Weinstock, George M. ;
Wilson, Richard K. ;
Ding, Li .
BIOINFORMATICS, 2009, 25 (17) :2283-2285
[12]   Mapping short DNA sequencing reads and calling variants using mapping quality scores [J].
Li, Heng ;
Ruan, Jue ;
Durbin, Richard .
GENOME RESEARCH, 2008, 18 (11) :1851-1858
[13]   The Sequence Alignment/Map format and SAMtools [J].
Li, Heng ;
Handsaker, Bob ;
Wysoker, Alec ;
Fennell, Tim ;
Ruan, Jue ;
Homer, Nils ;
Marth, Gabor ;
Abecasis, Goncalo ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (16) :2078-2079
[14]   Fast and accurate short read alignment with Burrows-Wheeler transform [J].
Li, Heng ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (14) :1754-1760
[15]   SOAP: short oligonucleotide alignment program [J].
Li, Ruiqiang ;
Li, Yingrui ;
Kristiansen, Karsten ;
Wang, Jun .
BIOINFORMATICS, 2008, 24 (05) :713-714
[16]   Genome sequencing in microfabricated high-density picolitre reactors [J].
Margulies, M ;
Egholm, M ;
Altman, WE ;
Attiya, S ;
Bader, JS ;
Bemben, LA ;
Berka, J ;
Braverman, MS ;
Chen, YJ ;
Chen, ZT ;
Dewell, SB ;
Du, L ;
Fierro, JM ;
Gomes, XV ;
Godwin, BC ;
He, W ;
Helgesen, S ;
Ho, CH ;
Irzyk, GP ;
Jando, SC ;
Alenquer, MLI ;
Jarvie, TP ;
Jirage, KB ;
Kim, JB ;
Knight, JR ;
Lanza, JR ;
Leamon, JH ;
Lefkowitz, SM ;
Lei, M ;
Li, J ;
Lohman, KL ;
Lu, H ;
Makhijani, VB ;
McDade, KE ;
McKenna, MP ;
Myers, EW ;
Nickerson, E ;
Nobile, JR ;
Plant, R ;
Puc, BP ;
Ronan, MT ;
Roth, GT ;
Sarkis, GJ ;
Simons, JF ;
Simpson, JW ;
Srinivasan, M ;
Tartaro, KR ;
Tomasz, A ;
Vogt, KA ;
Volkmer, GA .
NATURE, 2005, 437 (7057) :376-380
[17]   PIQA: pipeline for Illumina G1 genome analyzer data quality assessment [J].
Martinez-Alcantara, A. ;
Ballesteros, E. ;
Feng, C. ;
Rojas, M. ;
Koshinsky, H. ;
Fofanov, V. Y. ;
Havlak, P. ;
Fofanov, Y. .
BIOINFORMATICS, 2009, 25 (18) :2438-2439
[18]   Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding [J].
McKernan, Kevin Judd ;
Peckham, Heather E. ;
Costa, Gina L. ;
McLaughlin, Stephen F. ;
Fu, Yutao ;
Tsung, Eric F. ;
Clouser, Christopher R. ;
Duncan, Cisyla ;
Ichikawa, Jeffrey K. ;
Lee, Clarence C. ;
Zhang, Zheng ;
Ranade, Swati S. ;
Dimalanta, Eileen T. ;
Hyland, Fiona C. ;
Sokolsky, Tanya D. ;
Zhang, Lei ;
Sheridan, Andrew ;
Fu, Haoning ;
Hendrickson, Cynthia L. ;
Li, Bin ;
Kotler, Lev ;
Stuart, Jeremy R. ;
Malek, Joel A. ;
Manning, Jonathan M. ;
Antipova, Alena A. ;
Perez, Damon S. ;
Moore, Michael P. ;
Hayashibara, Kathleen C. ;
Lyons, Michael R. ;
Beaudoin, Robert E. ;
Coleman, Brittany E. ;
Laptewicz, Michael W. ;
Sannicandro, Adam E. ;
Rhodes, Michael D. ;
Gottimukkala, Rajesh K. ;
Yang, Shan ;
Bafna, Vineet ;
Bashir, Ali ;
MacBride, Andrew ;
Alkan, Can ;
Kidd, Jeffrey M. ;
Eichler, Evan E. ;
Reese, Martin G. ;
De la Vega, Francisco M. ;
Blanchard, Alan P. .
GENOME RESEARCH, 2009, 19 (09) :1527-1541
[19]   ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data [J].
Morgan, Martin ;
Anders, Simon ;
Lawrence, Michael ;
Aboyoun, Patrick ;
Pages, Herve ;
Gentleman, Robert .
BIOINFORMATICS, 2009, 25 (19) :2607-2608
[20]   SSAHA: A fast search method for large DNA databases [J].
Ning, ZM ;
Cox, AJ ;
Mullikin, JC .
GENOME RESEARCH, 2001, 11 (10) :1725-1729