HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

被引:79
作者
Aguiar, Derek [1 ,2 ]
Istrail, Sorin [1 ,2 ]
机构
[1] Brown Univ, Dept Comp Sci, Providence, RI 02912 USA
[2] Brown Univ, Ctr Computat Mol Biol, Providence, RI 02912 USA
基金
美国国家科学基金会;
关键词
algorithms; sequence analysis; haplotypes; next generation sequencing; graph theory;
D O I
10.1089/cmb.2012.0084
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data-known as haplotype assembly-have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download (www.brown.edu/Research/Istrail_Lab/).
引用
收藏
页码:577 / 590
页数:14
相关论文
共 25 条
[1]
A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[2]
Polynomial and APX-hard cases of the individual haplotyping problem [J].
Bafna, V ;
Istrail, S ;
Lancia, G ;
Rizzi, R .
THEORETICAL COMPUTER SCIENCE, 2005, 335 (01) :109-125
[3]
HapCUT: an efficient and accurate algorithm for the haplotype assembly problem [J].
Bansal, Vikas ;
Bafna, Vineet .
BIOINFORMATICS, 2008, 24 (16) :I153-I159
[4]
An MCMC algorithm for haplotype assembly from whole-genome sequence data [J].
Bansal, Vikas ;
Halpern, Aaron L. ;
Axelrod, Nelson ;
Bafna, Vineet .
GENOME RESEARCH, 2008, 18 (08) :1336-1346
[5]
Haplotype phasing: existing methods and new developments [J].
Browning, Sharon R. ;
Browning, Brian L. .
NATURE REVIEWS GENETICS, 2011, 12 (10) :703-714
[6]
ALGORITHMS FOR GENERATING FUNDAMENTAL CYCLES IN A GRAPH [J].
DEO, N ;
PRABHU, GM ;
KRISHNAMOORTHY, MS .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1982, 8 (01) :26-42
[7]
A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[8]
Halldórsson BV, 2011, BIOCOMPUT-PAC SYM, P88
[9]
Halldórsson BV, 2004, LECT N BIOINFORMAT, V2983, P26
[10]
Optimal algorithms for haplotype assembly from whole-genome sequence data [J].
He, Dan ;
Choi, Arthur ;
Pipatsrisawat, Knot ;
Darwiche, Adnan ;
Eskin, Eleazar .
BIOINFORMATICS, 2010, 26 (12) :i183-i190