Assembling genomes on large-scale parallel computers

被引:18
作者
Kalyanaraman, A. [2 ]
Emrich, S. J. [1 ,3 ]
Schnable, P. S. [3 ,4 ,5 ]
Aluru, S. [1 ,3 ]
机构
[1] Iowa State Univ Sci & Technol, Dept Elect & Comp Engn, Ames, IA 50011 USA
[2] Washington State Univ, Sch Elect Engn & Comp Sci, Pullman, WA 99164 USA
[3] Iowa State Univ Sci & Technol, Bioinformat & Computat Biol Grad Program, Ames, IA 50011 USA
[4] Iowa State Univ Sci & Technol, Dept Agron, Ames, IA 50011 USA
[5] Iowa State Univ Sci & Technol, Dept Genet Dev & Cell Biol, Ames, IA 50011 USA
基金
美国国家科学基金会;
关键词
computational biology; genome assembly; genome sequencing; parallel algorithms; suffix trees;
D O I
10.1016/j.jpdc.2007.05.014
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as pan of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments 4 over 1.25 billion nucleotides total size into genomic islands in under 2 h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences. (c) 2007 Elsevier Inc. All fights reserved.
引用
收藏
页码:1240 / 1255
页数:16
相关论文
共 34 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
Aluru S., 2005, HDB COMPUTATIONAL MO
[3]  
Arumuganathan K, 1991, PLANT MOL BIOL REP, V9, P208, DOI [10.1007/BF02672069, DOI 10.1007/BF02672069]
[4]  
Batzoglou S, 2002, GENOME RES, V12, P177, DOI 10.1101/gr.208902
[5]   Sorghum genome sequencing by methylation filtration [J].
Bedell, JA ;
Budiman, MA ;
Nunberg, A ;
Citek, RW ;
Robbins, D ;
Jones, J ;
Flick, E ;
Rohlfing, T ;
Fries, J ;
Bradford, K ;
McMenamy, J ;
Smith, M ;
Holeman, H ;
Roe, BA ;
Wiley, G ;
Korf, IF ;
Rabinowicz, PD ;
Lakey, N ;
McCombie, WR ;
Jeddeloh, JA ;
Martienssen, RA .
PLOS BIOLOGY, 2005, 3 (01) :103-115
[6]  
Bennetzen Jeffrey L., 2001, Plant Physiology (Rockville), V127, P1572, DOI 10.1104/pp.010817
[7]   DNA sequence quality trimming and vector removal [J].
Chou, HH ;
Holmes, MH .
BIOINFORMATICS, 2001, 17 (12) :1093-1104
[8]   A strategy for assembling the maize (Zea mays L.) genome [J].
Emrich, SJ ;
Aluru, S ;
Fu, Y ;
Wen, TJ ;
Narayanan, M ;
Guo, L ;
Ashlock, DA ;
Schnable, PS .
BIOINFORMATICS, 2004, 20 (02) :140-147
[9]   Quality assessment of maize assembled genomic islands (MAGIs) and large-scale experimental verification of predicted genes [J].
Fu, Y ;
Emrich, SJ ;
Guo, L ;
Wen, TJ ;
Ashlock, DA ;
Aluru, S ;
Schnable, PS .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (34) :12282-12287
[10]   AN IMPROVED ALGORITHM FOR MATCHING BIOLOGICAL SEQUENCES [J].
GOTOH, O .
JOURNAL OF MOLECULAR BIOLOGY, 1982, 162 (03) :705-708