Database indexing for production MegaBLAST searches

被引:944
作者
Morgulis, Aleksandr [1 ]
Coulouris, George [1 ]
Raytselis, Yan [1 ]
Madden, Thomas L. [1 ]
Agarwala, Richa [1 ]
Schaeffer, Alejandro A. [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Dept Hlth & Human Serv, Bethesda, MD 20894 USA
关键词
D O I
10.1093/bioinformatics/btn322
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar. Results: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindexthat preprocesses the database into a data structure for rapid seed searching. We show that the new indexed MegaBLAST is faster than the non-indexed version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBIsWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.
引用
收藏
页码:1757 / 1764
页数:8
相关论文
共 16 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
Cao X, 2004, SIGMOD REC, V33, P39, DOI 10.1145/1024694.1024701
[3]   Composition-based statistics and translated nucleotide searches:: Improving the TBLASTN module of BLAST [J].
Gertz, E. Michael ;
Yu, Yi-Kuo ;
Agarwala, Richa ;
Schaffer, Alejandro A. ;
Altschul, Stephen F. .
BMC BIOLOGY, 2006, 4 (1)
[4]   SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size [J].
Giladi, E ;
Walker, MG ;
Wang, JZ ;
Volkmuth, W .
BIOINFORMATICS, 2002, 18 (06) :873-879
[5]   Survey on index based homology search algorithms [J].
Jiang, Xianyang ;
Zhang, Peiheng ;
Liu, Xinchun ;
Yau, Stephen S-T. .
JOURNAL OF SUPERCOMPUTING, 2007, 40 (02) :185-212
[6]  
Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202. Article published online before March 2002, 10.1101/gr.229202]
[7]   miBLAST: scalable evaluation of a batch of nucleotide sequence queries with BLAST [J].
Kim, YJ ;
Boyd, A ;
Athey, BD ;
Patel, JM .
NUCLEIC ACIDS RESEARCH, 2005, 33 (13) :4335-4344
[8]   A novel filtration method in biological sequence databases [J].
Lee, Anthony J. T. ;
Lin, Chao-Wen ;
Lo, Wen-Hsing ;
Chen, Chieh-Chun ;
Chen, Jia-Xin .
PATTERN RECOGNITION LETTERS, 2007, 28 (04) :447-458
[9]   WindowMasker:: window-based masker for sequenced genomes [J].
Morgulis, A ;
Gertz, EM ;
Schäffer, AA ;
Agarwala, R .
BIOINFORMATICS, 2006, 22 (02) :134-141
[10]   A fast and symmetric DUST implementation to mask low-complexity DNA sequences [J].
Morgulis, Aleksandr ;
Gertz, E. Michael ;
Schaffer, Alejandro A. ;
Agarwala, Richa .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2006, 13 (05) :1028-1040