Computational Techniques for Human Genome Resequencing Using Mated Gapped Reads

被引:69
作者
Carnevali, Paolo [1 ]
Baccash, Jonathan [1 ]
Halpern, Aaron L. [1 ]
Nazarenko, Igor [1 ]
Nilsen, Geoffrey B. [1 ]
Pant, Krishna P. [1 ]
Ebert, Jessica C. [1 ]
Brownley, Anushka [1 ]
Morenzoni, Matt [1 ]
Karpinchyk, Vitali [1 ]
Martin, Bruce [1 ]
Ballinger, Dennis G. [1 ]
Drmanac, Radoje [1 ]
机构
[1] Complete Genom Inc, Mountain View, CA 94043 USA
关键词
genomics; sequence assembly; sequence analysis; statistical models; DNA-SEQUENCING DATA; GENERATION; DISCOVERY; FRAMEWORK; ALIGNMENT; ACCURACY; PATIENT;
D O I
10.1089/cmb.2011.0201
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.
引用
收藏
页码:279 / 292
页数:14
相关论文
共 23 条
[11]   The mutation spectrum revealed by paired genome sequences from a lung cancer patient [J].
Lee, William ;
Jiang, Zhaoshi ;
Liu, Jinfeng ;
Haverty, Peter M. ;
Guan, Yinghui ;
Stinson, Jeremy ;
Yue, Peng ;
Zhang, Yan ;
Pant, Krishna P. ;
Bhatt, Deepali ;
Ha, Connie ;
Johnson, Stephanie ;
Kennemer, Michael I. ;
Mohan, Sankar ;
Nazarenko, Igor ;
Watanabe, Colin ;
Sparks, Andrew B. ;
Shames, David S. ;
Gentleman, Robert ;
de Sauvage, Frederic J. ;
Stern, Howard ;
Pandita, Ajay ;
Ballinger, Dennis G. ;
Drmanac, Radoje ;
Modrusan, Zora ;
Seshagiri, Somasekar ;
Zhang, Zemin .
NATURE, 2010, 465 (7297) :473-477
[12]   The first human acute myeloid leukemia genome ever fully sequenced [J].
Falini, Brunangelo .
HAEMATOLOGICA, 2024, 109 (01) :1-2
[13]   Mapping short DNA sequencing reads and calling variants using mapping quality scores [J].
Li, Heng ;
Ruan, Jue ;
Durbin, Richard .
GENOME RESEARCH, 2008, 18 (11) :1851-1858
[14]   A survey of sequence alignment algorithms for next-generation sequencing [J].
Li, Heng ;
Homer, Nils .
BRIEFINGS IN BIOINFORMATICS, 2010, 11 (05) :473-483
[15]   Whole-Genome Sequencing in a Patient with Charcot-Marie-Tooth Neuropathy. [J].
Lupski, James R. ;
Reid, Jeffrey G. ;
Gonzaga-Jauregui, Claudia ;
Deiros, David Rio ;
Chen, David C. Y. ;
Nazareth, Lynne ;
Bainbridge, Matthew ;
Dinh, Huyen ;
Jing, Chyn ;
Wheeler, David A. ;
McGuire, Amy L. ;
Zhang, Feng ;
Stankiewicz, Pawel ;
Halperin, John J. ;
Yang, Chengyong ;
Gehman, Curtis ;
Guo, Danwei ;
Irikat, Rola K. ;
Tom, Warren ;
Fantin, Nick J. ;
Muzny, Donna M. ;
Gibbs, Richard A. .
NEW ENGLAND JOURNAL OF MEDICINE, 2010, 362 (13) :1181-1191
[16]   A general approach to single-nucleotide polymorphism discovery [J].
Marth, GT ;
Korf, I ;
Yandell, MD ;
Yeh, RT ;
Gu, ZJ ;
Zakeri, H ;
Stitziel, NO ;
Hillier, L ;
Kwok, PY ;
Gish, WR .
NATURE GENETICS, 1999, 23 (04) :452-456
[17]   The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data [J].
McKenna, Aaron ;
Hanna, Matthew ;
Banks, Eric ;
Sivachenko, Andrey ;
Cibulskis, Kristian ;
Kernytsky, Andrew ;
Garimella, Kiran ;
Altshuler, David ;
Gabriel, Stacey ;
Daly, Mark ;
DePristo, Mark A. .
GENOME RESEARCH, 2010, 20 (09) :1297-1303
[18]   Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding [J].
McKernan, Kevin Judd ;
Peckham, Heather E. ;
Costa, Gina L. ;
McLaughlin, Stephen F. ;
Fu, Yutao ;
Tsung, Eric F. ;
Clouser, Christopher R. ;
Duncan, Cisyla ;
Ichikawa, Jeffrey K. ;
Lee, Clarence C. ;
Zhang, Zheng ;
Ranade, Swati S. ;
Dimalanta, Eileen T. ;
Hyland, Fiona C. ;
Sokolsky, Tanya D. ;
Zhang, Lei ;
Sheridan, Andrew ;
Fu, Haoning ;
Hendrickson, Cynthia L. ;
Li, Bin ;
Kotler, Lev ;
Stuart, Jeremy R. ;
Malek, Joel A. ;
Manning, Jonathan M. ;
Antipova, Alena A. ;
Perez, Damon S. ;
Moore, Michael P. ;
Hayashibara, Kathleen C. ;
Lyons, Michael R. ;
Beaudoin, Robert E. ;
Coleman, Brittany E. ;
Laptewicz, Michael W. ;
Sannicandro, Adam E. ;
Rhodes, Michael D. ;
Gottimukkala, Rajesh K. ;
Yang, Shan ;
Bafna, Vineet ;
Bashir, Ali ;
MacBride, Andrew ;
Alkan, Can ;
Kidd, Jeffrey M. ;
Eichler, Evan E. ;
Reese, Martin G. ;
De la Vega, Francisco M. ;
Blanchard, Alan P. .
GENOME RESEARCH, 2009, 19 (09) :1527-1541
[19]   Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics [J].
Moore, Barry ;
Hu, Hao ;
Singleton, Marc ;
Reese, Martin G. ;
De La Vega, Francisco M. ;
Yandell, Mark .
GENETICS IN MEDICINE, 2011, 13 (03) :210-217
[20]   An Eulerian path approach to DNA fragment assembly [J].
Pevzner, PA ;
Tang, HX ;
Waterman, MS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (17) :9748-9753