Inference of population structure under a Dirichlet process model

被引:210
作者
Huelsenbeck, John P. [1 ]
Andolfatto, Peter
机构
[1] Univ Calif San Diego, Sect Ecol Behav & Evolut, Div Biol Sci, La Jolla, CA 92093 USA
[2] Univ Calif Berkeley, Dept Integrat Biol, Berkeley, CA 94720 USA
关键词
IMPALA AEPYCEROS-MELAMPUS; MULTILOCUS GENOTYPE DATA; NONPARAMETRIC PROBLEMS; BAYESIAN-ANALYSIS; GENETIC-STRUCTURE; NEUTRAL MODEL; ASSOCIATION; MIXTURE; LOCI; DIFFERENTIATION;
D O I
10.1534/genetics.106.061317
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Inferring population structure from genetic data sampled from some number of individuals is a formidable statistical problem. One widely used approach considers the number of populations to be fixed and calculates the posterior probability of assigning individuals to each population. More recently, the assignment of individuals to populations and the number of populations have both been considered random variables that follow a Dirichlet process prior. We examined the statistical behavior of assignment of individuals to populations tinder a Dirichlet process prior. First, we examined a best-case scenario, in which all of the assumptions of the Dirichlet process prior were satisfied, by generating data under a Dirichlet process prior. Second, we examined the performance of the method when the genetic data were generated under a population genetics model with symmetric migration between populations. We examined the accuracy of population assignment rising a distance on partitions. The method can be quite accurate with a moderate number of loci. As expected, inferences on the number of populations are more accurate when theta = 4N(c)u is large and when the migration rate (4N(c)m) is low. We also examined the sensitivity of inferences of population structure to choice of the parameter of the Dirichlet process model. Although inferences could be sensitive to the choice of the prior on the number of populations, this sensitivity occurred when the number of loci sampled was small; inferences are more robust to the prior on the number of populations when the number of sampled loci is large. Finally, we discuss several methods for summarizing the results of a Bayesian Markov chain Monte Carlo (MCMC) analysis of population structure. We develop the notion of the mean population partition, which is the partition of individuals to populations that minimizes the squared partition distance to the partitions sampled by the MCMC algorithm.
引用
收藏
页码:1787 / 1802
页数:16
相关论文
共 48 条
[1]  
Akaike H., 1973, 2 INT S INFORM THEOR, P267
[2]  
Andolfatto P, 2000, GENETICS, V156, P257
[3]   MIXTURES OF DIRICHLET PROCESSES WITH APPLICATIONS TO BAYESIAN NONPARAMETRIC PROBLEMS [J].
ANTONIAK, CE .
ANNALS OF STATISTICS, 1974, 2 (06) :1152-1174
[4]   A METHOD FOR QUANTIFYING DIFFERENTIATION BETWEEN POPULATIONS AT MULTI-ALLELIC LOCI AND ITS IMPLICATIONS FOR INVESTIGATING IDENTITY AND PATERNITY [J].
BALDING, DJ ;
NICHOLS, RA .
GENETICA, 1995, 96 (1-2) :3-12
[5]  
Bell ET., 1934, AM MATH MONTHLY, V41, P411
[6]   BAPS 2:: enhanced possibilities for the analysis of genetic population structure [J].
Corander, J ;
Waldmann, P ;
Marttinen, P ;
Sillanpää, MJ .
BIOINFORMATICS, 2004, 20 (15) :2363-2369
[7]  
Corander J, 2003, GENETICS, V163, P367
[8]   A Bayesian approach to the identification of panmictic populations and the assignment of individuals [J].
Dawson, KJ ;
Belkhir, K .
GENETICAL RESEARCH, 2001, 78 (01) :59-77
[9]   Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study [J].
Evanno, G ;
Regnaut, S ;
Goudet, J .
MOLECULAR ECOLOGY, 2005, 14 (08) :2611-2620
[10]  
Falush D, 2003, GENETICS, V164, P1567