AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

被引:69
作者
Abyzov, Alexej [1 ,2 ]
Gerstein, Mark [1 ,2 ,3 ]
机构
[1] Yale Univ, Program Computat Biol & Bioinformat, New Haven, CT 06520 USA
[2] Yale Univ, Dept Mol Biophys & Biochem, New Haven, CT 06520 USA
[3] Yale Univ, Dept Comp Sci, New Haven, CT 06520 USA
基金
美国国家卫生研究院;
关键词
ALGORITHM; SEQUENCE;
D O I
10.1093/bioinformatics/btq713
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is an important problem, as it is a prerequisite for classifying SVs, evaluating their functional impact and reconstructing personal genome sequences. Given approximate breakpoint locations and a bridging assembly or split read, the problem essentially reduces to finding a correct sequence alignment. Classical algorithms for alignment and their generalizations guarantee finding the optimal (in terms of scoring) global or local alignment of two sequences. However, they cannot generally be applied to finding the biologically correct alignment of genomic sequences containing SVs because of the need to simultaneously span the SV (e. g. make a large gap) and perform precise local alignments at the flanking ends. Results: Here, we formulate the computations involved in this problem and describe a dynamic-programming algorithm for its solution. Specifically, our algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5' and 3' ends of two given sequences and introducing a 'large-gap jump' between the local end alignments to maximize the total alignment score. We also describe extensions allowing the application of AGE to tandem duplications, inversions and complex events involving two large gaps. We develop a memory-efficient implementation of AGE (allowing application to long contigs) and make it available as a downloadable software package. Finally, we applied AGE for breakpoint determination and standardization in the 1000 Genomes Project by aligning locally assembled contigs to the human genome.
引用
收藏
页码:595 / 603
页数:9
相关论文
共 19 条
  • [1] ABYZOV A, 2011, GENOME RES UNPUB
  • [2] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [3] Chao K M, 1994, J Comput Biol, V1, P271, DOI 10.1089/cmb.1994.1.271
  • [4] AN IMPROVED ALGORITHM FOR MATCHING BIOLOGICAL SEQUENCES
    GOTOH, O
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1982, 162 (03) : 705 - 708
  • [5] LINEAR SPACE ALGORITHM FOR COMPUTING MAXIMAL COMMON SUBSEQUENCES
    HIRSCHBERG, DS
    [J]. COMMUNICATIONS OF THE ACM, 1975, 18 (06) : 341 - 343
  • [6] A generalized global alignment algorithm
    Huang, XQ
    Chao, KM
    [J]. BIOINFORMATICS, 2003, 19 (02) : 228 - 233
  • [7] Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202, 10.1101/gr.229202. Article published online before March 2002]
  • [8] Mapping and sequencing of structural variation from eight human genomes (Reprinted from Nature, vol 453, pg 56-64, 2008)
    Kidd, Jeffrey M.
    Cooper, Gregory M.
    Donahue, William F.
    Hayden, Hillary S.
    Sampas, Nick
    Graves, Tina
    Hansen, Nancy
    Teague, Brian
    Alkan, Can
    Antonacci, Francesca
    Haugen, Eric
    Zerr, Troy
    Yamada, N. Alice
    Tsang, Peter
    Newman, Tera L.
    Tuzun, Eray
    Cheng, Ze
    Ebling, Heather M.
    Tusneem, Nadeem
    David, Robert
    Gillett, Will
    Phelps, Karen A.
    Weaver, Molly
    Saranga, David
    Brand, Adrianne
    Tao, Wei
    Gustafson, Erik
    McKernan, Kevin
    Chen, Lin
    Malig, Maika
    Smith, Joshua D.
    Korn, Joshua M.
    McCarroll, Steven A.
    Altshuler, David A.
    Peiffer, Daniel A.
    Dorschner, Michael
    Stamatoyannopoulos, John
    Schwartz, David
    Nickerson, Deborah A.
    Mullikin, James C.
    Wilson, Richard K.
    Bruhn, Laurakay
    Olson, Maynard V.
    Kaul, Rajinder
    Smith, Douglas R.
    Eichler, Evan E.
    [J]. NATURE GENETICS, 2009, : S22 - S30
  • [9] PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data
    Korbel, Jan O.
    Abyzov, Alexej
    Mu, Xinmeng Jasmine
    Carriero, Nicholas
    Cayting, Philip
    Zhang, Zhengdong
    Snyder, Michael
    Gerstein, Mark B.
    [J]. GENOME BIOLOGY, 2009, 10 (02):
  • [10] Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library
    Lam, Hugo Y. K.
    Mu, Xinmeng Jasmine
    Stuetz, Adrian M.
    Tanzer, Andrea
    Cayting, Philip D.
    Snyder, Michael
    Kim, Philip M.
    Korbel, Jan O.
    Gerstein, Mark B.
    [J]. NATURE BIOTECHNOLOGY, 2010, 28 (01) : 47 - U76