A subquadratic sequence alignment algorithm for unrestricted scoring matrices

被引:75
作者
Crochemore, M
Landau, GM
Ziv-Ukelson, M
机构
[1] Univ Marne la Vallee, Inst Gaspard Monge, F-77454 Marne La Vallee 2, France
[2] Kings Coll London, Dept Comp Sci, London WC2R 2LS, England
[3] Univ Haifa, Dept Comp Sci, IL-31905 Haifa, Israel
[4] Polytech Univ, Dept Comp & Informat Sci, Brooklyn, NY 11201 USA
基金
美国国家科学基金会;
关键词
alignment; dynamic programming; text compression; run length;
D O I
10.1137/S0097539702402007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Given two strings of size n over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison - Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195 - 197] uses a dynamic programming matrix and compares the two strings in O(n(2)) time. We address the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n(2)/log n), algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn(2)/log n), where h less than or equal to 1 is the entropy of the text. We also present an algorithm for comparing two run-length encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m' n + n' m) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty.
引用
收藏
页码:1654 / 1673
页数:20
相关论文
共 54 条
  • [1] GEOMETRIC APPLICATIONS OF A MATRIX-SEARCHING ALGORITHM
    AGGARWAL, A
    KLAWE, MM
    MORAN, S
    SHOR, P
    WILBER, R
    [J]. ALGORITHMICA, 1987, 2 (02) : 195 - 208
  • [2] AGGARWAL A, 1988, P 29 ANN IEEE S FDN, P497
  • [3] Let sleeping files lie: Pattern matching in Z-compressed files
    Amir, A
    Benson, G
    Farach, M
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1996, 52 (02) : 299 - 307
  • [4] Matching for run-length encoded strings
    Apostolico, A
    Landau, GM
    Skiena, S
    [J]. JOURNAL OF COMPLEXITY, 1999, 15 (01) : 4 - 16
  • [5] EFFICIENT PARALLEL ALGORITHMS FOR STRING EDITING AND RELATED PROBLEMS
    APOSTOLICO, A
    ATALLAH, MJ
    LARMORE, LL
    MCFADDIN, S
    [J]. SIAM JOURNAL ON COMPUTING, 1990, 19 (05) : 968 - 988
  • [6] ARBELL O, IN PRESS INFORM PROC
  • [7] Bell T. C., 1990, TEXT COMPRESSION
  • [8] A SPACE EFFICIENT ALGORITHM FOR FINDING THE BEST NONOVERLAPPING ALIGNMENT SCORE
    BENSON, G
    [J]. THEORETICAL COMPUTER SCIENCE, 1995, 145 (1-2) : 357 - 369
  • [9] AN IMPROVED ALGORITHM FOR COMPUTING THE EDIT DISTANCE OF RUN-LENGTH CODED STRINGS
    BUNKE, H
    CSIRIK, J
    [J]. INFORMATION PROCESSING LETTERS, 1995, 54 (02) : 93 - 96
  • [10] Chao K M, 1994, J Comput Biol, V1, P271, DOI 10.1089/cmb.1994.1.271