HAPLOFREQ - Estimating haplotype frequencies efficiently

被引:10
作者
Halperin, E
Hazan, E
机构
[1] Int Comp Sci Inst, Berkeley, CA USA
[2] Princeton Univ, Dept Comp Sci, Princeton, NJ 08540 USA
关键词
phasing; haplotypes; SNPs; expectation maximization;
D O I
10.1089/cmb.2006.13.481
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
A commonly used tool in disease association studies is the search for discrepancies between the haplotype distribution in the case and control populations. In order to find this discrepancy, the haplotypes frequency in each of the populations is estimated from the genotypes. We present a new method HAPLOFREQ to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data or sequencing errors. Our approach incorporates a maximum likelihood model based on a simple random generative model which assumes that the genotypes are independently sampled from the population. We first show that if the phased haplotypes are given, possibly with missing data, we can estimate the frequency of the haplotypes in the population by finding the global optimum of the likelihood function in polynomial time. If the haplotypes are not phased, finding the maximum value of the likelihood function is NP-hard. In this case, we define an alternative likelihood function which can be thought of as a relaxed likelihood function. We show that the maximum relaxed likelihood can be found in polynomial time and that the optimal solution of the relaxed likelihood approaches asymptotically to the haplotype frequencies in the population. In contrast to previous approaches, our algorithms are guaranteed to converge in polynomial time to a global maximum of the different likelihood functions. We compared the performance of our algorithm to the widely used program PHASE, and we found that our estimates are at least 10% more accurate than PHASE and about ten times faster than PHASE. Our techniques involve new algorithms in convex optimization. These algorithms may be of independent interest. Particularly, they may be helpful in other maximum likelihood problems arising from survey sampling.
引用
收藏
页码:481 / 500
页数:20
相关论文
共 27 条
[1]  
ALON N, 1992, PROBABILISTIC METHOD
[2]  
BELLARE M, 1992, J MATH PROGRAMMING B, V69, P429
[3]  
CLARK AG, 1990, MOL BIOL EVOL, V7, P111
[4]   High-resolution haplotype structure in the human genome [J].
Daly, MJ ;
Rioux, JD ;
Schaffner, SE ;
Hudson, TJ ;
Lander, ES .
NATURE GENETICS, 2001, 29 (02) :229-232
[5]  
EXCOFFIER L, 1995, MOL BIOL EVOL, V12, P921
[6]   Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data [J].
Fallin, D ;
Schork, NJ .
AMERICAN JOURNAL OF HUMAN GENETICS, 2000, 67 (04) :947-959
[7]   The structure of haplotype blocks in the human genome [J].
Gabriel, SB ;
Schaffner, SF ;
Nguyen, H ;
Moore, JM ;
Roy, J ;
Blumenstiel, B ;
Higgins, J ;
DeFelice, M ;
Lochner, A ;
Faggart, M ;
Liu-Cordero, SN ;
Rotimi, C ;
Adeyemo, A ;
Cooper, R ;
Ward, R ;
Lander, ES ;
Daly, MJ ;
Altshuler, D .
SCIENCE, 2002, 296 (5576) :2225-2229
[8]  
Grotschel M, 2012, Geometric algorithms and combinatorial optimization, V2
[9]   Inference of haplotypes from samples of diploid populations: Complexity and algorithms [J].
Gusfield, D .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2001, 8 (03) :305-323
[10]  
GUSFIELD D, 2000, P 8 INT C INT SYST M