Composition-based statistics and translated nucleotide searches:: Improving the TBLASTN module of BLAST

被引:403
作者
Gertz, E. Michael [1 ]
Yu, Yi-Kuo [1 ]
Agarwala, Richa [1 ]
Schaffer, Alejandro A. [1 ]
Altschul, Stephen F. [1 ]
机构
[1] Natl Lib Med, Natl Ctr Biotechnol Informat, US Dept HHS, NIH, Bethesda, MD 20894 USA
关键词
D O I
10.1186/1741-7007-4-41
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. Results: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. Conclusion: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.
引用
收藏
页数:14
相关论文
共 52 条
[1]  
Altschul SE, 1997, THEORETICAL AND COMPUTATIONAL METHODS IN GENOME RESEARCH, P1
[2]   Protein database searches using compositionally adjusted substitution matrices [J].
Altschul, SF ;
Wootton, JC ;
Gertz, EM ;
Agarwala, R ;
Morgulis, A ;
Schäffer, AA ;
Yu, YK .
FEBS JOURNAL, 2005, 272 (20) :5101-5109
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[5]  
[Anonymous], 1978, Atlas of protein sequence and structure
[6]  
BARRELL BG, 1979, NATURE, V282, P189, DOI 10.1038/282189a0
[7]   GeneWise and genomewise [J].
Birney, E ;
Clamp, M ;
Durbin, R .
GENOME RESEARCH, 2004, 14 (05) :988-995
[8]   PairWise and SearchWise: Finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames [J].
Birney, E ;
Thompson, JD ;
Gibson, TJ .
NUCLEIC ACIDS RESEARCH, 1996, 24 (14) :2730-2739
[9]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333
[10]   THE TURNING-POINT IN GENOME RESEARCH [J].
BOGUSKI, MS .
TRENDS IN BIOCHEMICAL SCIENCES, 1995, 20 (08) :295-296