Hardness of optimal spaced seed design

被引:15
作者
Nicolas, Francois [1 ]
Rivals, Eric [1 ]
机构
[1] Univ Montpellier 2, CNRS, LIRMM, UMR 5506, F-34392 Montpellier 5, France
基金
芬兰科学院;
关键词
sequence comparison; alignment; string matching; filtration; spaced seed; gapped seed; maximum independent set; golomb ruler; tiling; maximum coverage; approximiability;
D O I
10.1016/j.jcss.2007.10.001
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Speeding up approximate pattern matching is a line of research in stringology since the 80s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dynamic programming. Among the conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, Counting spaced subwords instead of substrings improves the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called spaced seeds (or gapped seeds), for the subwords, depending oil the search parameters. Two distinct lines of research appear the literature: one with probabilistic formulations of seed design problems, in which one wishes for instance to compute a seed with the highest probability to detect the desired similarities (lossy filtration), a second line with combinatorial formulations, where the goal is to find a seed that detects all or a maximum number of similarities (both lossless and lossy filtration). We concentrate on combinatorial seed design problems and consider formulations in which the set of sought similarities is either listed explicitly (RSOS), or characterised by their length and maximal number of mismatches (NON-DETECTION). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for several seed design problems, thereby justifying the complexity of known algorithms. Moreover, we introduce a new formulation of seed design (MWLS), in which the weight of the seed has to be maximised, and show it is as difficult to approximate as MAXIMUM INDEPENDENT SET. (c) 2007 Elsevier Inc. All rights reserved.
引用
收藏
页码:831 / 849
页数:19
相关论文
共 30 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
[Anonymous], [No title captured], DOI DOI 10.1145/299432.299460
[3]   INTERMODULATION INTERFERENCE IN RADIO SYSTEMS - FREQUENCY OF OCCURRENCE AND CONTROL BY CHANNEL SELECTION [J].
BABCOCK, WC .
BELL SYSTEM TECHNICAL JOURNAL, 1953, 32 (01) :63-73
[4]   TILING FIGURES OF THE PLANE WITH 2 BARS [J].
BEAUQUIER, D ;
NIVAT, M ;
REMILA, E ;
ROBSON, M .
COMPUTATIONAL GEOMETRY-THEORY AND APPLICATIONS, 1995, 5 (01) :1-25
[5]  
Brejova Brona, 2004, J Bioinform Comput Biol, V1, P595, DOI 10.1142/S0219720004000326
[6]   Designing seeds for similarity search in genomic DNA [J].
Buhler, J ;
Keich, U ;
Sun, YN .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2005, 70 (03) :342-363
[7]  
Burkhardt S, 2003, FUND INFORM, V56, P51
[8]  
Califano A, 1993, Proc Int Conf Intell Syst Mol Biol, V1, P56
[9]   SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS [J].
CHANG, WI ;
LAWLER, EL .
ALGORITHMICA, 1994, 12 (4-5) :327-344
[10]   Sensitivity analysis and efficient method for identifying optimal spaced seeds [J].
Choi, KP ;
Zhang, LX .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2004, 68 (01) :22-40