Efficient clustering of large EST data sets on parallel computers

被引:50
作者
Kalyanaraman, A
Aluru, S [1 ]
Kothari, S
Brendel, V
机构
[1] Iowa State Univ Sci & Technol, Dept Comp Sci, Ames, IA 50011 USA
[2] Iowa State Univ Sci & Technol, Dept Elect & Comp Engn, Ames, IA 50011 USA
[3] Iowa State Univ Sci & Technol, Dept Zool & Genet, Ames, IA 50011 USA
[4] Iowa State Univ Sci & Technol, Dept Stat, Ames, IA 50011 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/nar/gkg379
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for (P) under bar arallel (C) under bar lustering of (E) under bar STs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website.
引用
收藏
页码:2963 / 2974
页数:12
相关论文
共 18 条
[1]   Comparison of gene indexing databases [J].
Bouck, J ;
Yu, W ;
Gibbs, R ;
Worley, K .
TRENDS IN GENETICS, 1999, 15 (04) :159-162
[2]   d2_cluster: A validated method for clustering EST and full-length cDNA sequences [J].
Burke, J ;
Davison, D ;
Hide, W .
GENOME RESEARCH, 1999, 9 (11) :1135-1142
[3]   SpliceNest: visualizing gene structure and alternative splicing based on EST clusters [J].
Coward, E ;
Haas, SA ;
Vingron, M .
TRENDS IN GENETICS, 2002, 18 (01) :53-55
[4]  
Gusfield D., 1997, ALGORITHMS STRINGS T
[5]   CAP3: A DNA sequence assembly program [J].
Huang, XQ ;
Madan, A .
GENOME RESEARCH, 1999, 9 (09) :868-877
[6]  
Jain K, 1988, Algorithms for clustering data
[7]   An optimized protocol for analysis of EST sequences [J].
Liang, F ;
Holt, I ;
Pertea, G ;
Karamycheva, S ;
Salzberg, SL ;
Quackenbush, J .
NUCLEIC ACIDS RESEARCH, 2000, 28 (18) :3657-3665
[8]   A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base [J].
Miller, RT ;
Christoffels, AG ;
Gopalakrishnan, C ;
Burke, J ;
Ptitsyn, AA ;
Broveak, TR ;
Hide, WA .
GENOME RESEARCH, 1999, 9 (11) :1143-1155
[9]   A GENERAL METHOD APPLICABLE TO SEARCH FOR SIMILARITIES IN AMINO ACID SEQUENCE OF 2 PROTEINS [J].
NEEDLEMAN, SB ;
WUNSCH, CD .
JOURNAL OF MOLECULAR BIOLOGY, 1970, 48 (03) :443-+
[10]  
Pacheco PS, 1997, PARALLEL PROGRAMMING