A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase

被引:1441
作者
Scheet, P [1 ]
Stephens, M [1 ]
机构
[1] Univ Washington, Dept Stat, Seattle, WA 98195 USA
关键词
D O I
10.1086/502802
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium ( LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets ( e. g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods ( e. g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.
引用
收藏
页码:629 / 644
页数:16
相关论文
共 28 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]   A haplotype map of the human genome [J].
Altshuler, D ;
Brooks, LD ;
Chakravarti, A ;
Collins, FS ;
Daly, MJ ;
Donnelly, P ;
Gibbs, RA ;
Belmont, JW ;
Boudreau, A ;
Leal, SM ;
Hardenbol, P ;
Pasternak, S ;
Wheeler, DA ;
Willis, TD ;
Yu, FL ;
Yang, HM ;
Zeng, CQ ;
Gao, Y ;
Hu, HR ;
Hu, WT ;
Li, CH ;
Lin, W ;
Liu, SQ ;
Pan, H ;
Tang, XL ;
Wang, J ;
Wang, W ;
Yu, J ;
Zhang, B ;
Zhang, QR ;
Zhao, HB ;
Zhao, H ;
Zhou, J ;
Gabriel, SB ;
Barry, R ;
Blumenstiel, B ;
Camargo, A ;
Defelice, M ;
Faggart, M ;
Goyette, M ;
Gupta, S ;
Moore, J ;
Nguyen, H ;
Onofrio, RC ;
Parkin, M ;
Roy, J ;
Stahl, E ;
Winchester, E ;
Ziaugra, L ;
Shen, Y .
NATURE, 2005, 437 (7063) :1299-1320
[3]   COMBINATION OF FORECASTS [J].
BATES, JM ;
GRANGER, CWJ .
OPERATIONAL RESEARCH QUARTERLY, 1969, 20 (04) :451-&
[4]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[5]   Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power [J].
Chapman, JM ;
Cooper, JD ;
Todd, JA ;
Clayton, DG .
HUMAN HEREDITY, 2003, 56 (1-3) :18-31
[6]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[7]   Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data [J].
Fallin, D ;
Schork, NJ .
AMERICAN JOURNAL OF HUMAN GENETICS, 2000, 67 (04) :947-959
[8]  
Falush D, 2003, GENETICS, V164, P1567
[9]  
Freund Y, 1996, ICML
[10]   Model-based inference of haplotype block variation [J].
Greenspan, G ;
Geiger, D .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2004, 11 (2-3) :495-506