Monotony of surprise and large-scale quest for unusual words

被引:34
作者
Apostolico, A
Bock, ME
Lonardi, S
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] Univ Padua, Dipartimento Ingn Informaz, Padua, Italy
[3] Purdue Univ, Dept Stat, W Lafayette, IN 47907 USA
[4] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
关键词
design and analysis of algorithms; combinatoric on words; statistical analysis of sequences; annotated suffix trees; over- and under-represented words; pattern discovery;
D O I
10.1089/10665270360688020
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
引用
收藏
页码:283 / 311
页数:29
相关论文
共 22 条
[1]  
Aho A.V., 1974, The Design and Analysis of Computer Algorithms
[2]   Annotated statistical indices for sequence analysis [J].
Apostolico, A ;
Bock, ME ;
Xu, XY .
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, :215-229
[3]   A speed-up for the commute between subword trees and DAWGs [J].
Apostolico, A ;
Lonardi, S .
INFORMATION PROCESSING LETTERS, 2002, 83 (03) :159-161
[4]   Efficient detection of unusual words [J].
Apostolico, A ;
Bock, ME ;
Lonardi, S ;
Xu, XY .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (1-2) :71-94
[5]  
APOSTOLICO A, 2001, VERBUMCULUS
[6]  
Apostolico A, 1997, Pattern Matching Algorithms
[7]  
APOSTOLICO A, 2001, P 8 INT C STRING PRO
[8]   COMPLETE INVERTED FILES FOR EFFICIENT TEXT RETRIEVAL AND ANALYSIS [J].
BLUMER, A ;
BLUMER, J ;
HAUSSLER, D ;
MCCONNELL, R ;
EHRENFEUCHT, A .
JOURNAL OF THE ACM, 1987, 34 (03) :578-595
[9]  
Borges JorgeLuis., 1975, UNIVERSAL HIST INFAM
[10]   SEQUENCE LANDSCAPES [J].
CLIFT, B ;
HAUSSLER, D ;
MCCONNELL, R ;
SCHNEIDER, TD ;
STORMO, GD .
NUCLEIC ACIDS RESEARCH, 1986, 14 (01) :141-158