Characterizing Bias in Population Genetic Inferences from Low-Coverage Sequencing Data

被引:68
作者
Han, Eunjung [1 ]
Sinsheimer, Janet S. [1 ,2 ]
Novembre, John [3 ,4 ]
机构
[1] Univ Calif Los Angeles, Dept Biostat, Los Angeles, CA USA
[2] Univ Calif Los Angeles, Dept Human Genet & Biomath, Los Angeles, CA USA
[3] Univ Calif Los Angeles, Dept Ecol & Evolut, Los Angeles, CA 90095 USA
[4] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
基金
美国国家卫生研究院;
关键词
site frequency spectrum; base-calling errors; maximum likelihood; accuracy; STATISTICAL TESTS; SEGREGATING SITES; GENOME ANALYSIS; MUTATION-RATE; NEUTRALITY; GENOTYPE; HITCHHIKING; FRAMEWORK; POLYMORPHISM; DISCOVERY;
D O I
10.1093/molbev/mst229
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The site frequency spectrum (SFS) is of primary interest in population genetic studies, because the SFS compresses variation data into a simple summary from which many population genetic inferences can proceed. However, inferring the SFS from sequencing data is challenging because genotype calls from sequencing data are often inaccurate due to high error rates and if not accounted for, this genotype uncertainty can lead to serious bias in downstream analysis based on the inferred SFS. Here, we compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). We find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. We characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. Our work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.
引用
收藏
页码:723 / 735
页数:13
相关论文
共 40 条
[1]   Testing for neutrality in samples with sequencing errors [J].
Achat, Guillaume .
GENETICS, 2008, 179 (03) :1409-1424
[2]   Frequency Spectrum Neutrality Tests: One for All and All for One [J].
Achaz, Guillaume .
GENETICS, 2009, 183 (01) :249-258
[3]   An integrated map of genetic variation from 1,092 human genomes [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Schmidt, Jeanette P. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Dinh, Huyen ;
Kovar, Christie ;
Lee, Sandra ;
Lewis, Lora ;
Muzny, Donna ;
Reid, Jeff ;
Wang, Min ;
Wang, Jun ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Li, Zhuo ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Su, Zhe ;
Tai, Shuaishuai ;
Tang, Meifang .
NATURE, 2012, 491 (7422) :56-65
[4]   Chromosome-scale selective sweeps shape Caenorhabditis elegans genomic diversity [J].
Andersen, Erik C. ;
Gerke, Justin P. ;
Shapiro, Joshua A. ;
Crissman, Jonathan R. ;
Ghosh, Rajarshi ;
Bloom, Joshua S. ;
Felix, Marie-Anne ;
Kruglyak, Leonid .
NATURE GENETICS, 2012, 44 (03) :285-U83
[5]   Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome [J].
Andolfatto, Peter .
GENOME RESEARCH, 2007, 17 (12) :1755-1762
[6]   The genomic signature of dog domestication reveals adaptation to a starch-rich diet [J].
Axelsson, Erik ;
Ratnakumar, Abhirami ;
Arendt, Maja-Louise ;
Maqbool, Khurram ;
Webster, Matthew T. ;
Perloski, Michele ;
Liberg, Olof ;
Arnemo, Jon M. ;
Hedhammar, Ake ;
Lindblad-Toh, Kerstin .
NATURE, 2013, 495 (7441) :360-364
[7]   Approximate Bayesian Computation in Evolution and Ecology [J].
Beaumont, Mark A. .
ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS, VOL 41, 2010, 41 :379-406
[8]   Population genomics:: Whole-genome analysis of polymorphism and divergence in Drosophila simulans [J].
Begun, David J. ;
Holloway, Alisha K. ;
Stevens, Kristian ;
Hillier, LaDeana W. ;
Poh, Yu-Ping ;
Hahn, Matthew W. ;
Nista, Phillip M. ;
Jones, Corbin D. ;
Kern, Andrew D. ;
Dewey, Colin N. ;
Pachter, Lior ;
Myers, Eugene ;
Langley, Charles H. .
PLOS BIOLOGY, 2007, 5 (11) :2534-2559
[9]   Accurate whole human genome sequencing using reversible terminator chemistry [J].
Bentley, David R. ;
Balasubramanian, Shankar ;
Swerdlow, Harold P. ;
Smith, Geoffrey P. ;
Milton, John ;
Brown, Clive G. ;
Hall, Kevin P. ;
Evers, Dirk J. ;
Barnes, Colin L. ;
Bignell, Helen R. ;
Boutell, Jonathan M. ;
Bryant, Jason ;
Carter, Richard J. ;
Cheetham, R. Keira ;
Cox, Anthony J. ;
Ellis, Darren J. ;
Flatbush, Michael R. ;
Gormley, Niall A. ;
Humphray, Sean J. ;
Irving, Leslie J. ;
Karbelashvili, Mirian S. ;
Kirk, Scott M. ;
Li, Heng ;
Liu, Xiaohai ;
Maisinger, Klaus S. ;
Murray, Lisa J. ;
Obradovic, Bojan ;
Ost, Tobias ;
Parkinson, Michael L. ;
Pratt, Mark R. ;
Rasolonjatovo, Isabelle M. J. ;
Reed, Mark T. ;
Rigatti, Roberto ;
Rodighiero, Chiara ;
Ross, Mark T. ;
Sabot, Andrea ;
Sankar, Subramanian V. ;
Scally, Aylwyn ;
Schroth, Gary P. ;
Smith, Mark E. ;
Smith, Vincent P. ;
Spiridou, Anastassia ;
Torrance, Peta E. ;
Tzonev, Svilen S. ;
Vermaas, Eric H. ;
Walter, Klaudia ;
Wu, Xiaolin ;
Zhang, Lu ;
Alam, Mohammed D. ;
Anastasi, Carole .
NATURE, 2008, 456 (7218) :53-59
[10]   Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies [J].
Browning, Brian L. ;
Yu, Zhaoxia .
AMERICAN JOURNAL OF HUMAN GENETICS, 2009, 85 (06) :847-861