ParPEST: a pipeline for EST data analysis based on parallel computing

被引:26
作者
D'Agostino, N [1 ]
Aversano, M [1 ]
Chiusano, ML [1 ]
机构
[1] Univ Naples Federico II, Dept Struct & Funct Biol, I-80134 Naples, Italy
关键词
D O I
10.1186/1471-2105-6-S4-S9
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations. Results: We implemented ParPEST (Parallel Processing of ESTs), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation. Conclusion: The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.
引用
收藏
页数:9
相关论文
共 34 条
[1]  
Apweiler R, 2004, NUCLEIC ACIDS RES, V32, pD115, DOI [10.1093/nar/gkw1099, 10.1093/nar/gkh131]
[2]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[3]   The ENZYME database in 2000 [J].
Bairoch, A .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :304-305
[4]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[5]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333
[6]   d2_cluster: A validated method for clustering EST and full-length cDNA sequences [J].
Burke, J ;
Davison, D ;
Hide, W .
GENOME RESEARCH, 1999, 9 (11) :1135-1142
[7]   DNA sequence quality trimming and vector removal [J].
Chou, HH ;
Holmes, MH .
BIOINFORMATICS, 2001, 17 (12) :1093-1104
[8]   STACK: Sequence Tag Alignment and Consensus Knowledgebase [J].
Christoffels, A ;
van Gelder, A ;
Greyling, G ;
Miller, R ;
Hide, T ;
Hide, W .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :234-238
[9]  
*EGTDC, EST AN
[10]  
*GEN ONT CONS, NUCL ACIDS RES, V32, pD258