ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis

被引:83
作者
Oehmen, Christopher [1 ]
Nieplocha, Jarek [1 ]
机构
[1] Pacific NW Natl Lab, Computat Sci & Math Div, Richland, WA 99352 USA
关键词
high-performance sequence alignment; BLAST; global arrays;
D O I
10.1109/TPDS.2006.112
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Genes in an organism's DNA ( genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components ( proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life" by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques-distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching-to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences.
引用
收藏
页码:740 / 749
页数:10
相关论文
共 35 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]  
[Anonymous], P CLUSTERWORLD
[4]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[5]  
Bjornson R, 2002, P 16 INT PAR DISTR P
[6]  
BRAUN R, 2001, FUTURE GENERATION CO, V17
[7]   Identifying candidate disease genes with high-performance computing [J].
Braun, TA ;
Scheetz, TE ;
Webster, G ;
Clark, A ;
Stone, EM ;
Sheffield, VC ;
Casavant, TL .
JOURNAL OF SUPERCOMPUTING, 2003, 26 (01) :7-24
[8]   Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships [J].
Brenner, SE ;
Chothia, C ;
Hubbard, TJP .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (11) :6073-6078
[9]  
Camp N, 1998, HIGH THROUGHPUT BLAS
[10]  
Cao X, 2004, SIGMOD REC, V33, P39, DOI 10.1145/1024694.1024701