Space and time efficient parallel algorithms and software for EST clustering

被引:21
作者
Kalyanaraman, A
Aluru, S
Brendel, V
Kothari, S
机构
[1] Iowa State Univ Sci & Technol, Dept Comp Sci, Ames, IA 50011 USA
[2] Iowa State Univ Sci & Technol, Dept Elect & Comp Engn, Ames, IA 50011 USA
[3] Iowa State Univ Sci & Technol, Dept Zool & Genet, Ames, IA 50011 USA
[4] Iowa State Univ Sci & Technol, Dept Stat, Ames, IA 50011 USA
基金
美国国家科学基金会;
关键词
computational biology; EST clustering; maximal common substring; parallel algorithms; suffix tree applications; SUFFIX TREE CONSTRUCTION; TIGR GENE INDEXES; SEQUENCES; GENERATION; D2-CLUSTER; TOOLS;
D O I
10.1109/TPDS.2003.1255634
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Expressed sequence tags, abbreviated as ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and for understanding important genetic variations such as those resulting in diseases. In this paper, we present the algorithmic foundations and implementation of PaCE, a parallel software system we developed for large-scale EST clustering. The novel features of our approach include 1) design of space-efficient algorithms to limit the space required to linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce runtime and facilitate clustering of large data sets. Using a combination of these techniques, we report the clustering of 327,632 rat ESTs in 47 minutes, and 420,694 Triticum aestivum ESTs in 3 hours and 15 minutes, using a 60-processor IBM xSeries cluster. These problems are well beyond the capabilities of state-of-the-art sequential software. We also present thorough experimental evaluation of our software including quality assessment using benchmark Arabidopsis EST data.
引用
收藏
页码:1209 / 1221
页数:13
相关论文
共 30 条
[1]  
[Anonymous], 1995, Genome Science and Technology, DOI [DOI 10.1089/GST.1995.1.9, 10.1089/gst.1995.1.9]
[2]  
[Anonymous], 1997, ALGORITHMS STRINGS T
[3]   PARALLEL CONSTRUCTION OF A SUFFIX TREE WITH APPLICATIONS [J].
APOSTOLICO, A ;
ILIOPOULOS, C ;
LANDAU, GM ;
SCHIEBER, B ;
VISHKIN, U .
ALGORITHMICA, 1988, 3 (03) :347-365
[4]   d2_cluster: A validated method for clustering EST and full-length cDNA sequences [J].
Burke, J ;
Davison, D ;
Hide, W .
GENOME RESEARCH, 1999, 9 (11) :1135-1142
[5]   Assessment of the parallelization approach of d2_cluster for high-performance sequence clustering [J].
Carpenter, JE ;
Christoffels, A ;
Weinbach, Y ;
Hide, WA .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2002, 23 (07) :755-757
[6]   SpliceNest: visualizing gene structure and alternative splicing based on EST clusters [J].
Coward, E ;
Haas, SA ;
Vingron, M .
TRENDS IN GENETICS, 2002, 18 (01) :53-55
[7]   GeneNest: automated generation and visualization of gene indices [J].
Haas, SA ;
Beissbarth, T ;
Rivals, E ;
Krause, A ;
Vingron, M .
TRENDS IN GENETICS, 2000, 16 (11) :521-523
[8]   Optimal parallel suffix tree construction [J].
Hariharan, R .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :44-69
[9]   CAP3: A DNA sequence assembly program [J].
Huang, XQ ;
Madan, A .
GENOME RESEARCH, 1999, 9 (09) :868-877
[10]  
Jain AK., 1988, ALGORITHMS CLUSTERIN