ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis

被引：83

作者：

Oehmen, Christopher ^{[1
]}

Nieplocha, Jarek ^{[1
]}

机构：

[1] Pacific NW Natl Lab, Computat Sci & Math Div, Richland, WA 99352 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2006年 / 17卷 / 08期

关键词：

high-performance sequence alignment; BLAST; global arrays;

D O I：

10.1109/TPDS.2006.112

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Genes in an organism's DNA ( genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components ( proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life" by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques-distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching-to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences.

引用

页码：740 / 749

页数：10

共 35 条

[1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[2] BASIC LOCAL ALIGNMENT SEARCH TOOL [J].

ALTSCHUL, SF ;

GISH, W ;

MILLER, W ;

MYERS, EW ;

LIPMAN, DJ .

JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410

[3]

[Anonymous], P CLUSTERWORLD

[4]

Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]

[5]

Bjornson R, 2002, P 16 INT PAR DISTR P

[6]

BRAUN R, 2001, FUTURE GENERATION CO, V17

[7] Identifying candidate disease genes with high-performance computing [J].

Braun, TA ;

Scheetz, TE ;

Webster, G ;

Clark, A ;

Stone, EM ;

Sheffield, VC ;

Casavant, TL .

JOURNAL OF SUPERCOMPUTING, 2003, 26 (01) :7-24

[8] Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships [J].

Brenner, SE ;

Chothia, C ;

Hubbard, TJP .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (11) :6073-6078

[9]

Camp N, 1998, HIGH THROUGHPUT BLAS

[10]

Cao X, 2004, SIGMOD REC, V33, P39, DOI 10.1145/1024694.1024701

← 1 2 3 4 →