Estimating the number of unseen variants in the human genome

被引:52
作者
Ionita-Laza, Iuliana [1 ]
Lange, Christoph [1 ]
Laird, Nan M. [1 ]
机构
[1] Harvard Univ, Sch Publ Hlth, Dept Biostat, Boston, MA 02115 USA
关键词
1000 Genomes Project; beta-binomial model; CNVs; sequence data; SNP; WIDE ASSOCIATION; ALLELES; REGIONS;
D O I
10.1073/pnas.0807815106
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The different genetic variation discovery projects (The SNP Consortium, the International HapMap Project, the 1000 Genomes Project, etc.) aim to identify as much as possible of the underlying genetic variation in various human populations. The question we address in this article is how many new variants are yet to be found. This is an instance of the species problem in ecology, where the goal is to estimate the number of species in a closed population. We use a parametric beta-binomial model that allows us to calculate the expected number of new variants with a desired minimum frequency to be discovered in a new dataset of individuals of a specified size. The method can also be used to predict the number of individuals necessary to sequence in order to capture all (or a fraction of) the variation with a specified minimum frequency. We apply the method to three datasets: the ENCODE dataset, the SeattleSNPs dataset, and the National Institute of Environmental Health Sciences SNPs dataset. Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population in-between. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small (similar to 350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals (similar to 150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (>3,000 individuals) is necessary to find all of those variants. Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.
引用
收藏
页码:5008 / 5013
页数:6
相关论文
共 16 条
[1]   A haplotype map of the human genome [J].
Altshuler, D ;
Brooks, LD ;
Chakravarti, A ;
Collins, FS ;
Daly, MJ ;
Donnelly, P ;
Gibbs, RA ;
Belmont, JW ;
Boudreau, A ;
Leal, SM ;
Hardenbol, P ;
Pasternak, S ;
Wheeler, DA ;
Willis, TD ;
Yu, FL ;
Yang, HM ;
Zeng, CQ ;
Gao, Y ;
Hu, HR ;
Hu, WT ;
Li, CH ;
Lin, W ;
Liu, SQ ;
Pan, H ;
Tang, XL ;
Wang, J ;
Wang, W ;
Yu, J ;
Zhang, B ;
Zhang, QR ;
Zhao, HB ;
Zhao, H ;
Zhou, J ;
Gabriel, SB ;
Barry, R ;
Blumenstiel, B ;
Camargo, A ;
Defelice, M ;
Faggart, M ;
Goyette, M ;
Gupta, S ;
Moore, J ;
Nguyen, H ;
Onofrio, RC ;
Parkin, M ;
Roy, J ;
Stahl, E ;
Winchester, E ;
Ziaugra, L ;
Shen, Y .
NATURE, 2005, 437 (7063) :1299-1320
[2]   Estimating coverage and power for genetic association studies using near-complete variation data [J].
Bhangale, Tushar R. ;
Rieder, Mark J. ;
Nickerson, Deborah A. .
NATURE GENETICS, 2008, 40 (07) :841-843
[3]   Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls [J].
Burton, Paul R. ;
Clayton, David G. ;
Cardon, Lon R. ;
Craddock, Nick ;
Deloukas, Panos ;
Duncanson, Audrey ;
Kwiatkowski, Dominic P. ;
McCarthy, Mark I. ;
Ouwehand, Willem H. ;
Samani, Nilesh J. ;
Todd, John A. ;
Donnelly, Peter ;
Barrett, Jeffrey C. ;
Davison, Dan ;
Easton, Doug ;
Evans, David ;
Leung, Hin-Tak ;
Marchini, Jonathan L. ;
Morris, Andrew P. ;
Spencer, Chris C. A. ;
Tobin, Martin D. ;
Attwood, Antony P. ;
Boorman, James P. ;
Cant, Barbara ;
Everson, Ursula ;
Hussey, Judith M. ;
Jolley, Jennifer D. ;
Knight, Alexandra S. ;
Koch, Kerstin ;
Meech, Elizabeth ;
Nutland, Sarah ;
Prowse, Christopher V. ;
Stevens, Helen E. ;
Taylor, Niall C. ;
Walters, Graham R. ;
Walker, Neil M. ;
Watkins, Nicholas A. ;
Winzer, Thilo ;
Jones, Richard W. ;
McArdle, Wendy L. ;
Ring, Susan M. ;
Strachan, David P. ;
Pembrey, Marcus ;
Breen, Gerome ;
St Clair, David ;
Caesar, Sian ;
Gordon-Smith, Katherine ;
Jones, Lisa ;
Fraser, Christine ;
Green, Elain K. .
NATURE, 2007, 447 (7145) :661-678
[4]   IMPROVING POPULATION-SPECIFIC ALLELE FREQUENCY ESTIMATES BY ADAPTING SUPPLEMENTAL DATA: AN EMPIRICAL BAYES APPROACH [J].
Coram, Marc ;
Tang, Hua .
ANNALS OF APPLIED STATISTICS, 2007, 1 (02) :459-479
[5]   A genome-wide association study identifies IL23R as an inflammatory bowel disease gene [J].
Duerr, Richard H. ;
Taylor, Kent D. ;
Brant, Steven R. ;
Rioux, John D. ;
Silverberg, Mark S. ;
Daly, Mark J. ;
Steinhart, A. Hillary ;
Abraham, Clara ;
Regueiro, Miguel ;
Griffiths, Anne ;
Dassopoulos, Themistocles ;
Bitton, Alain ;
Yang, Huiying ;
Targan, Stephan ;
Datta, Lisa Wu ;
Kistner, Emily O. ;
Schumm, L. Philip ;
Lee, Annette T. ;
Gregersen, Peter K. ;
Barmada, M. Michael ;
Rotter, Jerome I. ;
Nicolae, Dan L. ;
Cho, Judy H. .
SCIENCE, 2006, 314 (5804) :1461-1463
[6]  
EFRON B, 1976, BIOMETRIKA, V63, P435, DOI 10.2307/2335721
[7]   Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays [J].
Hacia, JG ;
Fan, JB ;
Ryder, O ;
Jin, L ;
Edgemon, K ;
Ghandour, G ;
Mayer, RA ;
Sun, B ;
Hsie, L ;
Robbins, CM ;
Brody, LC ;
Wang, D ;
Lander, ES ;
Lipshutz, R ;
Fodor, SPA ;
Collins, FS .
NATURE GENETICS, 1999, 22 (02) :164-167
[8]   A common genetic variant is associated with adult and childhood obesity [J].
Herbert, A ;
Gerry, NP ;
McQueen, MB ;
Heid, IM ;
Pfeufer, A ;
Illig, T ;
Wichmann, HE ;
Meitinger, T ;
Hunter, D ;
Hu, FB ;
Colditz, G ;
Hinney, A ;
Hebebrand, J ;
Koberwitz, K ;
Zhu, XF ;
Cooper, R ;
Ardlie, K ;
Lyon, H ;
Hirschhorn, JN ;
Laird, NM ;
Lenburg, ME ;
Lange, C ;
Christman, MF .
SCIENCE, 2006, 312 (5771) :279-283
[9]   A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer [J].
Hunter, David J. ;
Kraft, Peter ;
Jacobs, Kevin B. ;
Cox, David G. ;
Yeager, Meredith ;
Hankinson, Susan E. ;
Wacholder, Sholom ;
Wang, Zhaoming ;
Welch, Robert ;
Hutchinson, Amy ;
Wang, Junwen ;
Yu, Kai ;
Chatterjee, Nilanjan ;
Orr, Nick ;
Willett, Walter C. ;
Colditz, Graham A. ;
Ziegler, Regina G. ;
Berg, Christine D. ;
Buys, Saundra S. ;
McCarty, Catherine A. ;
Feigelson, Heather Spencer ;
Calle, Eugenia E. ;
Thun, Michael J. ;
Hayes, Richard B. ;
Tucker, Margaret ;
Gerhard, Daniela S. ;
Fraumeni, Joseph F., Jr. ;
Hoover, Robert N. ;
Thomas, Gilles ;
Chanock, Stephen J. .
NATURE GENETICS, 2007, 39 (07) :870-874
[10]   Complement factor H polymorphism in age-related macular degeneration [J].
Klein, RJ ;
Zeiss, C ;
Chew, EY ;
Tsai, JY ;
Sackler, RS ;
Haynes, C ;
Henning, AK ;
SanGiovanni, JP ;
Mane, SM ;
Mayne, ST ;
Bracken, MB ;
Ferris, FL ;
Ott, J ;
Barnstable, C ;
Hoh, J .
SCIENCE, 2005, 308 (5720) :385-389