On the distribution of K-tuple matches for sequence homology: A constant time exact calculation of the variance

被引:9
作者
Benson, G [1 ]
Su, XP [1 ]
机构
[1] CUNY Mt Sinai Sch Med, Dept Biomath Sci, New York, NY 10029 USA
关键词
D O I
10.1089/cmb.1998.5.87
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We study the distribution of a statistic useful in calculating the significance of the number of k-tuple matches detected in biological sequence homology algorithms, The statistic is R-n,R-k, the total number of heads in head runs of length k or more in a sequence of lid Bernoulli trials of length n, Calculation of the mean is straightforward. Poisson approximation formulas have been used for the variance because they are simple and powerful, Unfortunately, when p = P(Head) is large, the Poisson approximation no longer works well, In our application, p is large, say .75, and we have turned instead to direct calculation of the variance, Surprisingly, we are able to show that the variance, which is based on the interactions of O(n(2)) random variables, can be computed in constant time, independent of the length of the sequence and probability p, This result can be used to calculate the mean and variance of a number of other head run statistics in constant time, Additionally, we show how to extend the result to sequences generated by a stationary Markov process where the variance can be calculated in O (n) time.
引用
收藏
页码:87 / 100
页数:14
相关论文
共 26 条
[1]  
[Anonymous], 1986, FIBONACCI NUMBERS TH
[2]   Minisatellite diversity supports a recent African origin for modern humans [J].
Armour, JAL ;
Anttinen, T ;
May, CA ;
Vega, EE ;
Sajantila, A ;
Kidd, JR ;
Kidd, KK ;
Bertranpetit, J ;
Paabo, S ;
Jeffreys, AJ .
NATURE GENETICS, 1996, 13 (02) :154-160
[3]  
Arratia R., 1990, STAT SCI, P403, DOI [10.1214/ss/1177012015, DOI 10.1214/SS/1177012015]
[4]   A METHOD FOR FAST DATABASE SEARCH FOR ALL K-NUCLEOTIDE REPEATS [J].
BENSON, G ;
WATERMAN, MS .
NUCLEIC ACIDS RESEARCH, 1994, 22 (22) :4828-4836
[5]  
BENSON G, 1998, P 2 ANN INT C COMP M
[6]   Friedreich's ataxia: Autosomal recessive disease caused by an intronic GAA triplet repeat expansion [J].
Campuzano, V ;
Montermini, L ;
Molto, MD ;
Pianese, L ;
Cossee, M ;
Cavalcanti, F ;
Monros, E ;
Rodius, F ;
Duclos, F ;
Monticelli, A ;
Zara, F ;
Canizares, J ;
Koutnikova, H ;
Bidichandani, SI ;
Gellera, C ;
Brice, A ;
Trouillas, P ;
DeMichele, G ;
Filla, A ;
DeFrutos, R ;
Palau, F ;
Patel, PI ;
DiDonato, S ;
Mandel, JL ;
Cocozza, S ;
Koenig, M ;
Pandolfo, M .
SCIENCE, 1996, 271 (5254) :1423-1427
[7]  
CHRYSSAPHINOU O, 1993, APPLICATIONS FIBONAC, V5, P103
[8]   GENETIC-VARIATION AT 5 TRIMERIC AND TETRAMERIC TANDEM REPEAT LOCI IN 4 HUMAN-POPULATION GROUPS [J].
EDWARDS, A ;
HAMMOND, HA ;
JIN, L ;
CASKEY, CT ;
CHAKRABORTY, R .
GENOMICS, 1992, 12 (02) :241-253
[9]   DISTRIBUTION-THEORY OF RUNS - A MARKOV-CHAIN APPROACH [J].
FU, JC ;
KOUTRAS, MV .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (427) :1050-1058
[10]   AN UNSTABLE TRIPLET REPEAT IN A GENE RELATED TO MYOTONIC MUSCULAR-DYSTROPHY [J].
FU, YH ;
PIZZUTI, A ;
FENWICK, RG ;
KING, J ;
RAJNARAYAN, S ;
DUNNE, PW ;
DUBEL, J ;
NASSER, GA ;
ASHIZAWA, T ;
DEJONG, P ;
WIERINGA, B ;
KORNELUK, R ;
PERRYMAN, MB ;
EPSTEIN, HF ;
CASKEY, CT .
SCIENCE, 1992, 255 (5049) :1256-1258