Reducing INDEL calling errors in whole genome and exome sequencing data

被引:113
作者
Fang, Han [1 ,2 ,3 ]
Wu, Yiyang [1 ,2 ]
Narzisi, Giuseppe [3 ,4 ]
O'Rawe, Jason A. [1 ,2 ]
Jimenez Barron, Laura T. [1 ,5 ]
Rosenbaum, Julie [3 ]
Ronemus, Michael [3 ]
Iossifov, Ivan [3 ]
Schatz, Michael C. [3 ]
Lyon, Gholson J. [1 ,2 ]
机构
[1] Cold Spring Harbor Lab, Stanley Inst Cognit Genom, Cold Spring Harbor, NY 11724 USA
[2] SUNY Stony Brook, Stony Brook, NY 11794 USA
[3] Cold Spring Harbor Lab, Simons Ctr Quantitat Biol, Cold Spring Harbor, NY 11724 USA
[4] New York Genome Ctr, New York, NY USA
[5] Univ Nacl Autonoma Mexico, Ctr Ciencias Genom, Cuernavaca 62191, Morelos, Mexico
来源
GENOME MEDICINE | 2014年 / 6卷
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
DE-NOVO; SMALL INSERTIONS; VARIANTS; ACCURATE; GENE; DELETIONS; ALIGNMENT; MEDICINE; IDENTIFY; READ;
D O I
10.1186/s13073-014-0089-z
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background: INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. Methods: We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low-quality INDELs (7% vs. 51%). Results: Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (53%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (84% vs. 57%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data. Conclusions: Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (for example, capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.
引用
收藏
页数:17
相关论文
共 55 条
  • [1] Accurate and comprehensive sequencing of personal genomes
    Ajay, Subramanian S.
    Parker, Stephen C. J.
    Abaan, Hatice Ozel
    Fajardo, Karin V. Fuentes
    Margulies, Elliott H.
    [J]. GENOME RESEARCH, 2011, 21 (09) : 1498 - 1505
  • [2] The Centers for Mendelian Genomics: A new large-scale initiative to identify the genes underlying rare Mendelian conditions
    Bamshad, Michael J.
    Shendure, Jay A.
    Valle, David
    Hamosh, Ada
    Lupski, James R.
    Gibbs, Richard A.
    Boerwinkle, Eric
    Lifton, Richard P.
    Gerstein, Mark
    Gunel, Murat
    Mane, Shrikant
    Nickerson, Deborah A.
    [J]. AMERICAN JOURNAL OF MEDICAL GENETICS PART A, 2012, 158A (07) : 1523 - 1525
  • [3] Exome sequencing as a tool for Mendelian disease gene discovery
    Bamshad, Michael J.
    Ng, Sarah B.
    Bigham, Abigail W.
    Tabor, Holly K.
    Emond, Mary J.
    Nickerson, Deborah A.
    Shendure, Jay
    [J]. NATURE REVIEWS GENETICS, 2011, 12 (11) : 745 - 755
  • [4] Biesecker LG, 2014, NEW ENGL J MED, V371, P1170, DOI [10.1056/NEJMra1312543, 10.1056/NEJMc1408914]
  • [5] The Noncoding RNA Revolution-Trashing Old Rules to Forge New Ones
    Cech, Thomas R.
    Steitz, Joan A.
    [J]. CELL, 2014, 157 (01) : 77 - 94
  • [6] Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes
    Chen, Rui
    Mias, George I.
    Li-Pook-Than, Jennifer
    Jiang, Lihua
    Lam, Hugo Y. K.
    Chen, Rong
    Miriami, Elana
    Karczewski, Konrad J.
    Hariharan, Manoj
    Dewey, Frederick E.
    Cheng, Yong
    Clark, Michael J.
    Im, Hogune
    Habegger, Lukas
    Balasubramanian, Suganthi
    O'Huallachain, Maeve
    Dudley, Joel T.
    Hillenmeyer, Sara
    Haraksingh, Rajini
    Sharon, Donald
    Euskirchen, Ghia
    Lacroute, Phil
    Bettinger, Keith
    Boyle, Alan P.
    Kasowski, Maya
    Grubert, Fabian
    Seki, Scott
    Garcia, Marco
    Whirl-Carrillo, Michelle
    Gallardo, Mercedes
    Blasco, Maria A.
    Greenberg, Peter L.
    Snyder, Phyllis
    Klein, Teri E.
    Altman, Russ B.
    Butte, Atul J.
    Ashley, Euan A.
    Gerstein, Mark
    Nadeau, Kari C.
    Tang, Hua
    Snyder, Michael
    [J]. CELL, 2012, 148 (06) : 1293 - 1307
  • [7] Performance comparison of exome DNA sequencing technologies
    Clark, Michael J.
    Chen, Rui
    Lam, Hugo Y. K.
    Karczewski, Konrad J.
    Chen, Rong
    Euskirchen, Ghia
    Butte, Atul J.
    Snyder, Michael
    [J]. NATURE BIOTECHNOLOGY, 2011, 29 (10) : 908 - U206
  • [8] Will Formal Genetics Become Dispensable?
    Clerget-Darpoux, Francoise
    Elston, Robert C.
    [J]. HUMAN HEREDITY, 2013, 76 (02) : 47 - 52
  • [9] A framework for variation discovery and genotyping using next-generation DNA sequencing data
    DePristo, Mark A.
    Banks, Eric
    Poplin, Ryan
    Garimella, Kiran V.
    Maguire, Jared R.
    Hartl, Christopher
    Philippakis, Anthony A.
    del Angel, Guillermo
    Rivas, Manuel A.
    Hanna, Matt
    McKenna, Aaron
    Fennell, Tim J.
    Kernytsky, Andrew M.
    Sivachenko, Andrey Y.
    Cibulskis, Kristian
    Gabriel, Stacey B.
    Altshuler, David
    Daly, Mark J.
    [J]. NATURE GENETICS, 2011, 43 (05) : 491 - +
  • [10] Clinical Interpretation and Implications of Whole-Genome Sequencing
    Dewey, Frederick E.
    Grove, Megan E.
    Pan, Cuiping
    Goldstein, Benjamin A.
    Bernstein, Jonathan A.
    Chaib, Hassan
    Merker, Jason D.
    Goldfeder, Rachel L.
    Enns, Gregory M.
    David, Sean P.
    Pakdaman, Neda
    Ormond, Kelly E.
    Caleshu, Colleen
    Kingham, Kerry
    Klein, Teri E.
    Whirl-Carrillo, Michelle
    Sakamoto, Kenneth
    Wheeler, Matthew T.
    Butte, Atul J.
    Ford, James M.
    Boxer, Linda
    Ioannidis, John P. A.
    Yeung, Alan C.
    Altman, Russ B.
    Assimes, Themistocles L.
    Snyder, Michael
    Ashley, Euan A.
    Quertermous, Thomas
    [J]. JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2014, 311 (10): : 1035 - 1044