String matching and 1d lattice gases

被引:2
作者
Mungan, Muhittin [1 ]
机构
[1] Bogazici Univ, Dept Phys, TR-34342 Istanbul, Turkey
[2] Gursey Inst, TR-34680 Istanbul, Turkey
关键词
pattern occurrences; combinatorics; lattice gases; theory of liquids;
D O I
10.1007/s10955-006-9247-z
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
We calculate the probability distribution for the number of occurrences n of a given l letter word x inside a random string of k letters, whose letters have been generated by a known stationary stochastic process. Denoting by p( x) the probability of occurrence of the word, it is well-known that the distribution of occurrences in the asymptotic regime k-->infinity such that kp( x) >> 1 is Gaussian, while in the limit k --> infinity, and p( x) --> 0, such that kp( x) is finite, the distribution is Compound Poisson. It is also known that these limiting forms do not work well in the intermediate regime when kp( x) greater than or similar to 1 and k is finite. We show that the problem of calculating the probability of occurrences is equivalent to determining the configurational partition function of a 1d lattice gas of interacting particles, with the probability distribution given by the n-particle terms of the grand-partition function and the number of particles corresponding to the number of occurrences on the string. Utilizing this equivalence, we obtain the probability distribution from the equation of state of the lattice gas. Our result reproduces rather well the behavior of the distribution in the asymptotic as well as the intermediate regimes. Within the lattice gas description, the asymptotic forms of the distribution naturally emerge as certain low density approximations. Thus our approach which is based on statistical mechanics, also provides an alternative to the usual statistics based treatments employing the central limit and Chen - Stein theorems.
引用
收藏
页码:207 / 242
页数:36
相关论文
共 39 条
[21]   APPLICATIONS AND STATISTICS FOR MULTIPLE HIGH-SCORING SEGMENTS IN MOLECULAR SEQUENCES [J].
KARLIN, S ;
ALTSCHUL, SF .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1993, 90 (12) :5873-5877
[22]   CRITICAL-BEHAVIOR IN THE SATISFIABILITY OF RANDOM BOOLEAN EXPRESSIONS [J].
KIRKPATRICK, S ;
SELMAN, B .
SCIENCE, 1994, 264 (5163) :1297-1301
[23]  
KLEFFE J, 1992, COMPUT APPL BIOSCI, V8, P433
[24]  
Knuth D. E., 1977, SIAM Journal on Computing, V6, P323, DOI 10.1137/0206024
[25]  
MERTENS S, CSCC0309020
[26]   Analytic and algorithmic solution of random satisfiability problems [J].
Mézard, M ;
Parisi, G ;
Zecchina, R .
SCIENCE, 2002, 297 (5582) :812-815
[27]   Determining computational complexity from characteristic 'phase transitions' [J].
Monasson, R ;
Zecchina, R ;
Kirkpatrick, S ;
Selman, B ;
Troyansky, L .
NATURE, 1999, 400 (6740) :133-137
[28]   LINGUISTICS OF NUCLEOTIDE-SEQUENCES .1. THE SIGNIFICANCE OF DEVIATIONS FROM MEAN STATISTICAL CHARACTERISTICS AND PREDICTION OF THE FREQUENCIES OF OCCURRENCE OF WORDS [J].
PEVZNER, PA ;
BORODOVSKY, MY ;
MIRONOV, AA .
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 1989, 6 (05) :1013-1026
[29]  
PRUM B, 1995, J ROY STAT SOC B MET, V57, P205
[30]   On pattern frequency occurrences in a Markovian sequence [J].
Regnier, M ;
Szpankowski, W .
ALGORITHMICA, 1998, 22 (04) :631-649