A framework for variation discovery and genotyping using next-generation DNA sequencing data

被引:8587
作者
DePristo, Mark A. [1 ]
Banks, Eric [1 ]
Poplin, Ryan [1 ]
Garimella, Kiran V. [1 ]
Maguire, Jared R. [1 ]
Hartl, Christopher [1 ]
Philippakis, Anthony A. [1 ,2 ,3 ]
del Angel, Guillermo [1 ]
Rivas, Manuel A. [1 ,4 ]
Hanna, Matt [1 ]
McKenna, Aaron [1 ]
Fennell, Tim J. [1 ]
Kernytsky, Andrew M. [1 ]
Sivachenko, Andrey Y. [1 ]
Cibulskis, Kristian [1 ]
Gabriel, Stacey B. [1 ]
Altshuler, David [1 ,3 ,4 ]
Daly, Mark J. [1 ,3 ,4 ]
机构
[1] Broad Inst Harvard & MIT, Program Med & Populat Genet, Cambridge, MA USA
[2] Brigham & Womens Hosp, Boston, MA 02115 USA
[3] Harvard Univ, Sch Med, Boston, MA USA
[4] Massachusetts Gen Hosp, Richard B Simches Res Ctr, Ctr Human Genet Res, Boston, MA 02114 USA
关键词
SHORT-READ; QUALITY SCORES; SNP DISCOVERY; GENOME; ALIGNMENT;
D O I
10.1038/ng.806
中图分类号
Q3 [遗传学];
学科分类号
071007 [遗传学];
摘要
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (similar to 4x) 1000 Genomes Project datasets.
引用
收藏
页码:491 / +
页数:11
相关论文
共 39 条
[1]
Mapping Human Genetic Diversity in Asia [J].
Abdulla, Mahmood Ameen ;
Ahmed, Ikhlak ;
Assawamakin, Anunchai ;
Bhak, Jong ;
Brahmachari, Samir K. ;
Calacal, Gayvelline C. ;
Chaurasia, Amit ;
Chen, Chien-Hsiun ;
Chen, Jieming ;
Chen, Yuan-Tsong ;
Chu, Jiayou ;
Cutiongco-de la Paz, Eva Maria C. ;
De Ungria, Maria Corazon A. ;
Delfin, Frederick C. ;
Edo, Juli ;
Fuchareon, Suthat ;
Ghang, Ho ;
Gojobori, Takashi ;
Han, Junsong ;
Ho, Sheng-Feng ;
Hoh, Boon Peng ;
Huang, Wei ;
Inoko, Hidetoshi ;
Jha, Pankaj ;
Jinam, Timothy A. ;
Jin, Li ;
Jung, Jongsun ;
Kangwanpong, Daoroong ;
Kampuansai, Jatupol ;
Kennedy, Giulia C. ;
Khurana, Preeti ;
Kim, Hyung-Lae ;
Kim, Kwangjoong ;
Kim, Sangsoo ;
Kim, Woo-Yeon ;
Kimm, Kuchan ;
Kimura, Ryosuke ;
Koike, Tomohiro ;
Kulawonganunchai, Supasak ;
Kumar, Vikrant ;
Lai, Poh San ;
Lee, Jong-Young ;
Lee, Sunghoon ;
Liu, Edison T. ;
Majumder, Partha P. ;
Mandapati, Kiran Kumar ;
Marzuki, Sangkot ;
Mitchell, Wayne ;
Mukerji, Mitali ;
Naritomi, Kenji .
SCIENCE, 2009, 326 (5959) :1541-1545
[2]
A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[3]
Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[4]
The landscape of somatic copy-number alteration across human cancers [J].
Beroukhim, Rameen ;
Mermel, Craig H. ;
Porter, Dale ;
Wei, Guo ;
Raychaudhuri, Soumya ;
Donovan, Jerry ;
Barretina, Jordi ;
Boehm, Jesse S. ;
Dobson, Jennifer ;
Urashima, Mitsuyoshi ;
Mc Henry, Kevin T. ;
Pinchback, Reid M. ;
Ligon, Azra H. ;
Cho, Yoon-Jae ;
Haery, Leila ;
Greulich, Heidi ;
Reich, Michael ;
Winckler, Wendy ;
Lawrence, Michael S. ;
Weir, Barbara A. ;
Tanaka, Kumiko E. ;
Chiang, Derek Y. ;
Bass, Adam J. ;
Loo, Alice ;
Hoffman, Carter ;
Prensner, John ;
Liefeld, Ted ;
Gao, Qing ;
Yecies, Derek ;
Signoretti, Sabina ;
Maher, Elizabeth ;
Kaye, Frederic J. ;
Sasaki, Hidefumi ;
Tepper, Joel E. ;
Fletcher, Jonathan A. ;
Tabernero, Josep ;
Baselga, Jose ;
Tsao, Ming-Sound ;
Demichelis, Francesca ;
Rubin, Mark A. ;
Janne, Pasi A. ;
Daly, Mark J. ;
Nucera, Carmelo ;
Levine, Ross L. ;
Ebert, Benjamin L. ;
Gabriel, Stacey ;
Rustgi, Anil K. ;
Antonescu, Cristina R. ;
Ladanyi, Marc ;
Letai, Anthony .
NATURE, 2010, 463 (7283) :899-905
[5]
Bishop C.M., 2006, Pattern recognition and machine learning, DOI DOI 10.1007/978-0-387-45528-0
[6]
Quality scores and SNP detection in sequencing-by-synthesis systems [J].
Brockman, William ;
Alvarez, Pablo ;
Young, Sarah ;
Garber, Manuel ;
Giannoukos, Georgia ;
Lee, William L. ;
Russ, Carsten ;
Lander, Eric S. ;
Nusbaum, Chad ;
Jaffe, David B. .
GENOME RESEARCH, 2008, 18 (05) :763-770
[7]
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies [J].
Browning, Brian L. ;
Yu, Zhaoxia .
AMERICAN JOURNAL OF HUMAN GENETICS, 2009, 85 (06) :847-861
[8]
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[9]
Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays [J].
Drmanac, Radoje ;
Sparks, Andrew B. ;
Callow, Matthew J. ;
Halpern, Aaron L. ;
Burns, Norman L. ;
Kermani, Bahram G. ;
Carnevali, Paolo ;
Nazarenko, Igor ;
Nilsen, Geoffrey B. ;
Yeung, George ;
Dahl, Fredrik ;
Fernandez, Andres ;
Staker, Bryan ;
Pant, Krishna P. ;
Baccash, Jonathan ;
Borcherding, Adam P. ;
Brownley, Anushka ;
Cedeno, Ryan ;
Chen, Linsu ;
Chernikoff, Dan ;
Cheung, Alex ;
Chirita, Razvan ;
Curson, Benjamin ;
Ebert, Jessica C. ;
Hacker, Coleen R. ;
Hartlage, Robert ;
Hauser, Brian ;
Huang, Steve ;
Jiang, Yuan ;
Karpinchyk, Vitali ;
Koenig, Mark ;
Kong, Calvin ;
Landers, Tom ;
Le, Catherine ;
Liu, Jia ;
McBride, Celeste E. ;
Morenzoni, Matt ;
Morey, Robert E. ;
Mutch, Karl ;
Perazich, Helena ;
Perry, Kimberly ;
Peters, Brock A. ;
Peterson, Joe ;
Pethiyagoda, Charit L. ;
Pothuraju, Kaliprasad ;
Richter, Claudia ;
Rosenbaum, Abraham M. ;
Roy, Shaunak ;
Shafto, Jay ;
Sharanhovich, Uladzislau .
SCIENCE, 2010, 327 (5961) :78-81
[10]
Durbin R., 1998, Biological sequence analysis: probabilistic models of proteins and nucleic acids