Fast index based algorithms and software for matching position specific scoring matrices

被引:111
作者
Beckstette, Michael [1 ]
Homann, Robert
Giegerich, Robert
Kurtz, Stefan
机构
[1] Univ Bielefeld, Int NRW Grad Sch Bioinformat & Genome Res, Ctr Biotechnol, CeBITec, D-33594 Bielefeld, Germany
[2] Univ Bielefeld, Tech Fak, D-33501 Bielefeld, Germany
[3] Univ Hamburg, Zentrum Bioinformat, D-20146 Hamburg, Germany
关键词
D O I
10.1186/1471-2105-7-389
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results: We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e. g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion: Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than vertical bar A vertical bar(m) + m - 1, where m is the length of the PSSM and. a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
引用
收藏
页数:25
相关论文
共 37 条
  • [1] Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
  • [2] PRINTS and its automatic supplement, prePRINTS
    Attwood, TK
    Bradley, P
    Flower, DR
    Gaulton, A
    Maudling, N
    Mitchell, AL
    Moulton, G
    Nordle, A
    Paine, K
    Taylor, P
    Uddin, A
    Zygouri, C
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 400 - 402
  • [3] BECKSTETTE M, 2006, POSSUM SOFTWARE DIST
  • [4] BECKSTETTE M, 2004, LECT NOTES INFORM, P53
  • [5] CASTILLO G, 1988, EXTREME VALUE THEORY
  • [6] de Bruijn NG, 1946, KONINKLIJKE NEDERLAN, V49, P758
  • [7] Dorohonceanu B, 2000, Proc Int Conf Intell Syst Mol Biol, V8, P128
  • [8] Embrechts P., 1997, MODELLING EXTREMAL E, DOI 10.1007/978-3-642-33483-2
  • [9] Using sequence compression to speedup probabilistic profile matching
    Freschi, V
    Bogliolo, A
    [J]. BIOINFORMATICS, 2005, 21 (10) : 2225 - 2229
  • [10] A comparison of imperative and purely functional suffix tree constructions
    Giegerich, R
    Kurtz, S
    [J]. SCIENCE OF COMPUTER PROGRAMMING, 1995, 25 (2-3) : 187 - 218