SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS

被引:97
作者
CHANG, WI [1 ]
LAWLER, EL [1 ]
机构
[1] UNIV CALIF BERKELEY, DIV COMP SCI, BERKELEY, CA 94720 USA
关键词
PATTERN MATCHING; EDIT DISTANCE; SUFFIX TREE; LOWEST COMMON ANCESTOR; CHERNOFF BOUND; COMPUTATIONAL BIOLOGY;
D O I
10.1007/BF01185431
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Given a text string of length n and a pattern string of length m over a b-letter alphabet, the k differences approximate string matching problem asks for all locations in the text where the pattern occurs with at most k differences (substitutions, insertions, deletions). We treat k not as a constant but as a fraction of m (not necessarily constant-fraction). Previous algorithms require at least 0(kn) time (or exponential space). We give an algorithm that is sublinear time O((n/m)k log(b) m) when the text is random and k is bounded by the threshold m/(log(b) m + 0(1)). In particular, when k = o(m/log(b) m) the expected running time is o(n). In the worst case our algorithm is 0(kn), but is still an improvement in that it is practical and uses 0(m) space compared with 0(n) or 0(m2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.
引用
收藏
页码:327 / 344
页数:18
相关论文
共 51 条
  • [1] EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH
    AHO, AV
    CORASICK, MJ
    [J]. COMMUNICATIONS OF THE ACM, 1975, 18 (06) : 333 - 340
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] Apostolico A., 1985, COMBINATORIAL ALGORI, P85, DOI DOI 10.1007/978-3-642-82456-2_6
  • [4] CHANG WI, 1990, ANN IEEE SYMP FOUND, P116
  • [5] CHANG WI, 1992, LECT NOTES COMPUT SC, V644, P175
  • [6] CHANG WI, 1994, IN PRESS LECTURE NOT
  • [7] CHANG WI, 1991, THESIS U CALIFORNIA
  • [8] CHANG WI, 1990, FAST IMPLEMENTATION
  • [9] CHANG WI, UCBCSD91653654 COMP
  • [10] CHANG WI, 1990, HUMAN GENOME 2 OFFIC, P24