MULTIPLE FILTRATION AND APPROXIMATE PATTERN-MATCHING

被引:50
作者
PEVZNER, PA
WATERMAN, MS
机构
[1] PENN STATE UNIV, DEPT COMP SCI, UNIVERSITY PK, PA 16802 USA
[2] UNIV SO CALIF, DEPT MOLEC BIOL, LOS ANGELES, CA 90089 USA
关键词
STRING MATCHING; COMPUTATIONAL MOLECULAR BIOLOGY;
D O I
10.1007/BF01188584
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Given a text of length n and a query of length q, we present an algorithm for finding all locations of m-tuples in the text and in the query that differ by at most k mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar m-tuples. The second stage compares these In-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.
引用
收藏
页码:135 / 154
页数:20
相关论文
共 39 条
[1]  
BAEZAYATES RA, 1992, LECT NOTES COMPUT SC, V644, P185
[2]  
BAEZAYATES RA, 1989, 12TH P ANN ACM SIG C, P168
[4]   FAST STRING SEARCHING ALGORITHM [J].
BOYER, RS ;
MOORE, JS .
COMMUNICATIONS OF THE ACM, 1977, 20 (10) :762-772
[5]  
CHANG WI, 1990, ANN IEEE SYMP FOUND, P116
[6]  
DANCKAERT A, 1987, COMPUT APPL BIOSCI, V3, P303
[7]   EFFICIENT ALGORITHMS FOR FOLDING AND COMPARING NUCLEIC-ACID SEQUENCES [J].
DUMAS, JP ;
NINIO, J .
NUCLEIC ACIDS RESEARCH, 1982, 10 (01) :197-206
[8]   A NEW DISTANCE METRIC ON STRINGS COMPUTABLE IN LINEAR TIME [J].
EHRENFEUCHT, A ;
HAUSSLER, D .
DISCRETE APPLIED MATHEMATICS, 1988, 20 (03) :191-203
[9]  
Feller W., 1970, INTRO PROBABILITY TH
[10]  
GAIL Z, 1986, APR SIGACT NEWS, P52