Progressive Cactus is a multiple-genome aligner for the thousand-genome era

被引:301
作者
Armstrong, Joel [1 ]
Hickey, Glenn [1 ]
Diekhans, Mark [1 ]
Fiddes, Ian T. [1 ]
Novak, Adam M. [1 ]
Deran, Alden [1 ]
Fang, Qi [2 ,3 ]
Xie, Duo [2 ,4 ]
Feng, Shaohong [2 ,5 ]
Stiller, Josefin [3 ]
Genereux, Diane [6 ]
Johnson, Jeremy [6 ]
Marinescu, Voichita Dana [7 ]
Alfoldi, Jessica [6 ]
Harris, Robert S. [8 ]
Lindblad-Toh, Kerstin [6 ,7 ]
Haussler, David [9 ]
Karlsson, Elinor [6 ,10 ,11 ]
Jarvis, Erich D. [9 ,12 ]
Zhang, Guojie [3 ,5 ,13 ,14 ]
Paten, Benedict [1 ]
机构
[1] UC Santa Cruz, Genom Inst, Santa Cruz, CA 95064 USA
[2] BGI Shenzhen, Beishan Ind Zone, Shenzhen, Peoples R China
[3] Univ Copenhagen, Dept Biol, Sect Ecol & Evolut, Copenhagen, Denmark
[4] Univ Chinese Acad Sci, BGI Educ Ctr, Shenzhen, Peoples R China
[5] Chinese Acad Sci, Kunming Inst Zool, State Key Lab Genet Resources & Evolut, Kunming, Yunnan, Peoples R China
[6] Broad Inst Harvard & Massachusetts Inst Technol M, Cambridge, MA USA
[7] Uppsala Univ, Dept Med Biochem & Microbiol, Sci Life Lab, Uppsala, Sweden
[8] Penn State Univ, Dept Biol, University Pk, PA 16802 USA
[9] Howard Hughes Med Inst, Chevy Chase, MD USA
[10] Univ Massachusetts, Sch Med, Program Mol Med, Worcester, MA USA
[11] Univ Massachusetts, Sch Med, Bioinformat & Integrat Biol, Worcester, MA USA
[12] Rockefeller Univ, Lab Neurogenet Language, 1230 York Ave, New York, NY 10021 USA
[13] Chinese Acad Sci, Ctr Excellence Anim Evolut & Genet, Kunming, Yunnan, Peoples R China
[14] BGI Shenzhen, China Natl GeneBank, Shenzhen, Peoples R China
关键词
MAXIMUM-LIKELIHOOD; ALIGNMENT; TREES; EVOLUTION; RESOURCE; MOUSE;
D O I
10.1038/s41586-020-2871-y
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
070301 [无机化学]; 070403 [天体物理学]; 070507 [自然资源与国土空间规划学]; 090105 [作物生产系统与生态工程];
摘要
New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies(1-3). For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database(4) increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies(5) are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus(6), a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far. The Progressive Cactus program can create reference-free alignments of hundreds of large vertebrate genomes efficiently, and is used for the alignment of more than 600 amniote genomes.
引用
收藏
页码:246 / +
页数:18
相关论文
共 64 条
[1]
Armstrong J, 2019, THESIS
[2]
Whole-Genome Alignment and Comparative Annotation [J].
Armstrong, Joel ;
Fiddes, Ian T. ;
Diekhans, Mark ;
Paten, Benedict .
ANNUAL REVIEW OF ANIMAL BIOSCIENCES, VOL 7, 2019, 7 :41-64
[3]
Bao W, 2015, MOB DNA, V6
[4]
Aligning multiple genomic sequences with the threaded blockset aligner [J].
Blanchette, M ;
Kent, WJ ;
Riemer, C ;
Elnitski, L ;
Smit, AFA ;
Roskin, KM ;
Baertsch, R ;
Rosenbloom, K ;
Clawson, H ;
Green, ED ;
Haussler, D ;
Miller, W .
GENOME RESEARCH, 2004, 14 (04) :708-715
[5]
MAVID: Constrained ancestral alignment of multiple sequences [J].
Bray, N ;
Pachter, L .
GENOME RESEARCH, 2004, 14 (04) :693-699
[6]
BLAST plus : architecture and applications [J].
Camacho, Christiam ;
Coulouris, George ;
Avagyan, Vahram ;
Ma, Ning ;
Papadopoulos, Jason ;
Bealer, Kevin ;
Madden, Thomas L. .
BMC BIOINFORMATICS, 2009, 10
[7]
Variation in the Ratio of Nucleotide Substitution and Indel Rates across Genomes in Mammals and Bacteria [J].
Chen, Jian-Qun ;
Wu, Ying ;
Yang, Haiwang ;
Bergelson, Joy ;
Kreitman, Martin ;
Tian, Dacheng .
MOLECULAR BIOLOGY AND EVOLUTION, 2009, 26 (07) :1523-1531
[8]
Chiaromonte F, 2002, Pac Symp Biocomput, P115
[9]
progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement [J].
Darling, Aaron E. ;
Mau, Bob ;
Perna, Nicole T. .
PLOS ONE, 2010, 5 (06)
[10]
Genomic legacy of the African cheetah, Acinonyx jubatus [J].
Dobrynin, Pavel ;
Liu, Shiping ;
Tamazian, Gaik ;
Xiong, Zijun ;
Yurchenko, Andrey A. ;
Krasheninnikova, Ksenia ;
Kliver, Sergey ;
Schmidt-Kuentzel, Anne ;
Koepfli, Klaus-Peter ;
Johnson, Warren ;
Kuderna, Lukas F. K. ;
Garcia-Perez, Raquel ;
de Manuel, Marc ;
Godinez, Ricardo ;
Komissarov, Aleksey ;
Makunin, Alexey ;
Brukhin, Vladimir ;
Qiu, Weilin ;
Zhou, Long ;
Li, Fang ;
Yi, Jian ;
Driscoll, Carlos ;
Antunes, Agostinho ;
Oleksyk, Taras K. ;
Eizirik, Eduardo ;
Perelman, Polina ;
Roelke, Melody ;
Wildt, David ;
Diekhans, Mark ;
Marques-Bonet, Tomas ;
Marker, Laurie ;
Bhak, Jong ;
Wang, Jun ;
Zhang, Guojie ;
O'Brien, Stephen J. .
GENOME BIOLOGY, 2015, 16