The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

被引:19797
作者
McKenna, Aaron [1 ]
Hanna, Matthew [1 ]
Banks, Eric [1 ]
Sivachenko, Andrey [1 ]
Cibulskis, Kristian [1 ]
Kernytsky, Andrew [1 ]
Garimella, Kiran [1 ]
Altshuler, David [1 ,2 ]
Gabriel, Stacey [1 ]
Daly, Mark [1 ,2 ]
DePristo, Mark A. [1 ]
机构
[1] Broad Inst Harvard & MIT, Program Med & Populat Genet, Cambridge, MA 02142 USA
[2] Massachusetts Gen Hosp, Richard B Simches Res Ctr, Ctr Human Genet Res, Boston, MA 02114 USA
关键词
STRUCTURAL VARIATION; QUALITY ASSESSMENT; SHORT-READ; ALIGNMENT; MHC;
D O I
10.1101/gr.107524.110
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS the 1000 Genome pilot alone includes nearly five terabases-make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
引用
收藏
页码:1297 / 1303
页数:7
相关论文
共 27 条
[1]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[2]  
BHANDARKAR M, 2009, USENIX
[3]   Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing [J].
Campbell, Peter J. ;
Stephens, Philip J. ;
Pleasance, Erin D. ;
O'Meara, Sarah ;
Li, Heng ;
Santarius, Thomas ;
Stebbings, Lucy A. ;
Leroy, Catherine ;
Edkins, Sarah ;
Hardy, Claire ;
Teague, Jon W. ;
Menzies, Andrew ;
Goodhead, Ian ;
Turner, Daniel J. ;
Clee, Christopher M. ;
Quail, Michael A. ;
Cox, Antony ;
Brown, Clive ;
Durbin, Richard ;
Hurles, Matthew E. ;
Edwards, Paul A. W. ;
Bignell, Graham R. ;
Stratton, Michael R. ;
Futreal, P. Andrew .
NATURE GENETICS, 2008, 40 (06) :722-729
[4]  
Chen K, 2009, NAT METHODS, V6, P677, DOI [10.1038/NMETH.1363, 10.1038/nmeth.1363]
[5]   A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC [J].
de Bakker, Paul I. W. ;
McVean, Gil ;
Sabeti, Pardis C. ;
Miretti, Marcos M. ;
Green, Todd ;
Marchini, Jonathan ;
Ke, Xiayi ;
Monsuur, Alienke J. ;
Whittaker, Pamela ;
Delgado, Marcos ;
Morrison, Jonathan ;
Richardson, Angela ;
Walsh, Emily C. ;
Gao, Xiaojiang ;
Galver, Luana ;
Hart, John ;
Hafler, David A. ;
Pericak-Vance, Margaret ;
Todd, John A. ;
Daly, Mark J. ;
Trowsdale, John ;
Wijmenga, Cisca ;
Vyse, Tim J. ;
Beck, Stephan ;
Murray, Sarah Shaw ;
Carrington, Mary ;
Gregory, Simon ;
Deloukas, Panos ;
Rioux, John D. .
NATURE GENETICS, 2006, 38 (10) :1166-1172
[6]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[7]   Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays [J].
Drmanac, Radoje ;
Sparks, Andrew B. ;
Callow, Matthew J. ;
Halpern, Aaron L. ;
Burns, Norman L. ;
Kermani, Bahram G. ;
Carnevali, Paolo ;
Nazarenko, Igor ;
Nilsen, Geoffrey B. ;
Yeung, George ;
Dahl, Fredrik ;
Fernandez, Andres ;
Staker, Bryan ;
Pant, Krishna P. ;
Baccash, Jonathan ;
Borcherding, Adam P. ;
Brownley, Anushka ;
Cedeno, Ryan ;
Chen, Linsu ;
Chernikoff, Dan ;
Cheung, Alex ;
Chirita, Razvan ;
Curson, Benjamin ;
Ebert, Jessica C. ;
Hacker, Coleen R. ;
Hartlage, Robert ;
Hauser, Brian ;
Huang, Steve ;
Jiang, Yuan ;
Karpinchyk, Vitali ;
Koenig, Mark ;
Kong, Calvin ;
Landers, Tom ;
Le, Catherine ;
Liu, Jia ;
McBride, Celeste E. ;
Morenzoni, Matt ;
Morey, Robert E. ;
Mutch, Karl ;
Perazich, Helena ;
Perry, Kimberly ;
Peters, Brock A. ;
Peterson, Joe ;
Pethiyagoda, Charit L. ;
Pothuraju, Kaliprasad ;
Richter, Claudia ;
Rosenbaum, Abraham M. ;
Roy, Shaunak ;
Shafto, Jay ;
Sharanhovich, Uladzislau .
SCIENCE, 2010, 327 (5961) :78-81
[8]   The International HapMap Project [J].
Gibbs, RA ;
Belmont, JW ;
Hardenbol, P ;
Willis, TD ;
Yu, FL ;
Yang, HM ;
Ch'ang, LY ;
Huang, W ;
Liu, B ;
Shen, Y ;
Tam, PKH ;
Tsui, LC ;
Waye, MMY ;
Wong, JTF ;
Zeng, CQ ;
Zhang, QR ;
Chee, MS ;
Galver, LM ;
Kruglyak, S ;
Murray, SS ;
Oliphant, AR ;
Montpetit, A ;
Hudson, TJ ;
Chagnon, F ;
Ferretti, V ;
Leboeuf, M ;
Phillips, MS ;
Verner, A ;
Kwok, PY ;
Duan, SH ;
Lind, DL ;
Miller, RD ;
Rice, JP ;
Saccone, NL ;
Taillon-Miller, P ;
Xiao, M ;
Nakamura, Y ;
Sekine, A ;
Sorimachi, K ;
Tanaka, T ;
Tanaka, Y ;
Tsunoda, T ;
Yoshino, E ;
Bentley, DR ;
Deloukas, P ;
Hunt, S ;
Powell, D ;
Altshuler, D ;
Gabriel, SB ;
Qiu, RZ .
NATURE, 2003, 426 (6968) :789-796
[9]   Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing [J].
Gnirke, Andreas ;
Melnikov, Alexandre ;
Maguire, Jared ;
Rogov, Peter ;
LeProust, Emily M. ;
Brockman, William ;
Fennell, Timothy ;
Giannoukos, Georgia ;
Fisher, Sheila ;
Russ, Carsten ;
Gabriel, Stacey ;
Jaffe, David B. ;
Lander, Eric S. ;
Nusbaum, Chad .
NATURE BIOTECHNOLOGY, 2009, 27 (02) :182-189
[10]   The human genome browser at UCSC [J].
Kent, WJ ;
Sugnet, CW ;
Furey, TS ;
Roskin, KM ;
Pringle, TH ;
Zahler, AM ;
Haussler, D .
GENOME RESEARCH, 2002, 12 (06) :996-1006