Mastering seeds for genomic size nucleotide BLAST searches

被引:25
作者
Gotea, V
Veeramachaneni, V
Makalowski, W [1 ]
机构
[1] Penn State Univ, Inst Mol Evolut Genet, Mueller Lab 514, University Pk, PA 16802 USA
[2] Penn State Univ, Dept Biol, Mueller Lab 514, University Pk, PA 16802 USA
[3] Penn State Univ, Dept Comp Sci & Engn, Mueller Lab 514, University Pk, PA 16802 USA
关键词
D O I
10.1093/nar/gkg886
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is how reliable the results obtained using the default parameter values are, i.e. what fraction of potential matches have been retrieved by these searches. Our primary focus is on the initial hit parameter, also known as the seed or word, used by the NCBI BLASTn, MegaBLAST and other similar programs in searches for similar nucleotide sequences. We show that the use of default values for the initial hit parameter can have a big negative impact on the proportion of potentially similar sequences that are retrieved. We also show how the hit probability of different seeds varies with the minimum length and similarity of sequences desired to be retrieved and describe methods that help in determining appropriate seeds. The experimental results described in this paper illustrate situations in which these methods are most applicable and also show the relationship between the various BLAST parameters.
引用
收藏
页码:6935 / 6941
页数:7
相关论文
共 20 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]   Searching DNA databases for similarities to DNA sequences: when is a match significant? [J].
Anderson, I ;
Brass, A .
BIOINFORMATICS, 1998, 14 (04) :349-356
[4]   Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes [J].
Aparicio, S ;
Chapman, J ;
Stupka, E ;
Putnam, N ;
Chia, J ;
Dehal, P ;
Christoffels, A ;
Rash, S ;
Hoon, S ;
Smit, A ;
Gelpke, MDS ;
Roach, J ;
Oh, T ;
Ho, IY ;
Wong, M ;
Detter, C ;
Verhoef, F ;
Predki, P ;
Tay, A ;
Lucas, S ;
Richardson, P ;
Smith, SF ;
Clark, MS ;
Edwards, YJK ;
Doggett, N ;
Zharkikh, A ;
Tavtigian, SV ;
Pruss, D ;
Barnstead, M ;
Evans, C ;
Baden, H ;
Powell, J ;
Glusman, G ;
Rowen, L ;
Hood, L ;
Tan, YH ;
Elgar, G ;
Hawkins, T ;
Venkatesh, B ;
Rokhsar, D ;
Brenner, S .
SCIENCE, 2002, 297 (5585) :1301-1310
[5]  
BUHLER J, 2003, 7 ANN INT C RES COMP
[6]  
BURKHARDT S, 1999, 3 ANN INT C COMP MOL
[7]   Alignment of whole genomes [J].
Delcher, AL ;
Kasif, S ;
Fleischmann, RD ;
Peterson, J ;
White, O ;
Salzberg, SL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (11) :2369-2376
[8]   A TIME-EFFICIENT, LINEAR-SPACE LOCAL SIMILARITY ALGORITHM [J].
HUANG, XQ ;
MILLER, W .
ADVANCES IN APPLIED MATHEMATICS, 1991, 12 (03) :337-357
[9]  
Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202, 10.1101/gr.229202. Article published online before March 2002]
[10]   Serial BLAST searching [J].
Korf, I .
BIOINFORMATICS, 2003, 19 (12) :1492-1496