RAPSearch: a fast protein similarity search tool for short reads

被引:103
作者
Ye, Yuzhen [1 ]
Choi, Jeong-Hyeon [2 ]
Tang, Haixu [1 ,2 ]
机构
[1] Indiana Univ, Sch Informat & Comp, Bloomington, IN 47408 USA
[2] Indiana Univ, Ctr Genom & Bioinformat, Bloomington, IN 47405 USA
来源
BMC BIOINFORMATICS | 2011年 / 12卷
基金
美国国家科学基金会;
关键词
short reads similarity search; suffix array; reduced amino acid alphabet; metagenomics; AMINO-ACID ALPHABETS; ALIGNMENT; SEQUENCES; METAGENOMICS; MICROBIOME; DATABASES; PROGRAMS; GENOMES; COMMON; BLAST;
D O I
10.1186/1471-2105-12-159
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search-a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions-faces daunting challenges because of the very sizes of the short read datasets. Results: We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved similar to 20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (similar to 1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (similar to 0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST. Conclusions: RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.
引用
收藏
页数:10
相关论文
共 33 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   AN ATPASE DOMAIN COMMON TO PROKARYOTIC CELL-CYCLE PROTEINS, SUGAR KINASES, ACTIN, AND HSP70 HEAT-SHOCK PROTEINS [J].
BORK, P ;
SANDER, C ;
VALENCIA, A .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (16) :7290-7294
[4]  
Bork P, 1996, METHOD ENZYMOL, V266, P162
[5]  
Brady A, 2009, NAT METHODS, V6, P673, DOI [10.1038/nmeth.1358, 10.1038/NMETH.1358]
[6]   MAVID: Constrained ancestral alignment of multiple sequences [J].
Bray, N ;
Pachter, L .
GENOME RESEARCH, 2004, 14 (04) :693-699
[7]   Alignment of whole genomes [J].
Delcher, AL ;
Kasif, S ;
Fleischmann, RD ;
Peterson, J ;
White, O ;
Salzberg, SL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (11) :2369-2376
[8]   THEORY FOR THE FOLDING AND STABILITY OF GLOBULAR-PROTEINS [J].
DILL, KA .
BIOCHEMISTRY, 1985, 24 (06) :1501-1509
[9]   Functional metagenomic profiling of nine biomes [J].
Dinsdale, Elizabeth A. ;
Edwards, Robert A. ;
Hall, Dana ;
Angly, Florent ;
Breitbart, Mya ;
Brulc, Jennifer M. ;
Furlan, Mike ;
Desnues, Christelle ;
Haynes, Matthew ;
Li, Linlin ;
McDaniel, Lauren ;
Moran, Mary Ann ;
Nelson, Karen E. ;
Nilsson, Christina ;
Olson, Robert ;
Paul, John ;
Brito, Beltran Rodriguez ;
Ruan, Yijun ;
Swan, Brandon K. ;
Stevens, Rick ;
Valentine, David L. ;
Thurber, Rebecca Vega ;
Wegley, Linda ;
White, Bryan A. ;
Rohwer, Forest .
NATURE, 2008, 452 (7187) :629-U8
[10]  
Eddy Sean R, 2009, Genome Inform, V23, P205