An algorithm for approximate tandem repeats

被引:104
作者
Landau, GM [1 ]
Schmidt, JP
Sokol, D
机构
[1] Univ Haifa, Dept Comp Sci, IL-31905 Haifa, Israel
[2] Polytech Univ, Dept Comp & Informat Sci, Brooklyn, NY 11201 USA
[3] Incyte Pharmaceut, Palo Alto, CA 94304 USA
[4] Bar Ilan Univ, Dept Comp Sci, IL-52900 Ramat Gan, Israel
关键词
D O I
10.1089/106652701300099038
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e,g,, abcabc, An approximate single tandem repeat is one in which the substrings are similar, but not identical, e,g,, abcdaacd, In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = (u) over bar(u) over cap, for which the Hamming distance of (u) over bar and (u) over cap is at most k, in O(nk log(n / k)) time, or all those for which the edit distance of (u) over bar and (u) over cap is at most k, in O(nk log k log(n / k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u ', where u is a prefix of r and u ' is a prefix of u, An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical, We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length a, is O(nka log(n / k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm, The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 26 条
[1]  
Aho A.V., 1974, The Design and Analysis of Computer Algorithms
[2]   OPTIMAL PARALLEL DETECTION OF SQUARES IN STRINGS [J].
APOSTOLICO, A .
ALGORITHMICA, 1992, 8 (04) :285-319
[3]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[4]   On the distribution of K-tuple matches for sequence homology: A constant time exact calculation of the variance [J].
Benson, G ;
Su, XP .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (01) :87-100
[5]   A SPACE EFFICIENT ALGORITHM FOR FINDING THE BEST NONOVERLAPPING ALIGNMENT SCORE [J].
BENSON, G .
THEORETICAL COMPUTER SCIENCE, 1995, 145 (1-2) :357-369
[6]   Sequence alignment with tandem duplication [J].
Benson, G .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1997, 4 (03) :351-367
[7]  
CASKEY CT, 1992, SCIENCE, V255, P1256
[8]  
DOOLITTLE RF, 1990, METHOD ENZYMOL, V183, P99
[9]  
GUSFIELD G, 1997, ALGORITHMS STRINGS T
[10]  
Jeong Seop Sim, 1999, Combinatorial Pattern Matching. 10th Annual Symposium, CPM 99. Proceedings (Lecture Notes in Computer Science Vol.1645), P123