Euler circuits and DNA sequencing by hybridization

被引:27
作者
Arratia, R
Bollobás, B
Coppersmith, D
Sorkin, GB [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Dept Math Sci, Yorktown Heights, NY 10598 USA
[2] Univ So Calif, Dept Math, Los Angeles, CA 90089 USA
[3] Univ Memphis, Dept Math Sci, Memphis, TN 38152 USA
基金
美国国家科学基金会;
关键词
combinatorial enumeration; pairing; circle graph; interlace graph; Euler path; Euler circuit; matrix-tree theorem; BEST theorem; DNA sequencing; hybridization; Catalan number;
D O I
10.1016/S0166-218X(00)00190-6
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Sequencing by hybridization is a method of reconstructing a long DNA string - that is, figuring out its nucleotide sequence - from knowledge of its short substrings. Unique reconstruction is not always possible, and the goal of this paper is to study the number of reconstructions of a random string. For a given string, the number of reconstructions is determined by the pattern of repeated substrings; in an appropriate limit substrings will occur at most twice, so the pattern of repeats is given by a pairing: a string of length 2n in which each symbol occurs twice. A pairing induces a 2-in, 2-out graph, whose directed edges are defined by successive symbols of the pairing - for example the pairing ABBCAC induces the graph with edges AB, BE, BC, and so forth - and the number of reconstructions is simply the number of Euler circuits in this 2-in, 2-out graph. The original problem is thus transformed into one about pairings: to find the number f(k)(n) of n-symbol pairings having k Euler circuits. We show how to compute this function, in closed form, for any fixed k, and we present the functions explicitly for k = 1,...,9. The key is a decomposition theorem: the Euler "circuit number" of a pairing is the product of the circuit numbers of "component" sub-pairings. These components come from connected components of the "interlace graph", which has the: pairing's symbols as vertices, and edges when symbols are "interlaced". (A and B are interlaced if the pairing has the form A B A B or B A B A.) We carry these results back to the original question about DNA strings, and provide a total variation distance upper bound for the approximation error. We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that, for a typical random n-pairing, the number of Euler circuits is of order no smaller than 2(n)/n, and the expected number is asymptotically at least e(-1/2)2(n-1)/n. Since any n-pairing has at most 2(n-1) Euler circuits, this pinpoints the exponential growth rate. (C) 2000 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:63 / 96
页数:34
相关论文
共 31 条
[1]  
[Anonymous], GRADUATE TEXTS MATH
[2]  
[Anonymous], OXFORD STUDIES PROBA
[3]   Poisson process approximation for sequence repeats, and sequencing by hybridization [J].
Arratia, R ;
Martin, D ;
Reinert, G ;
Waterman, MS .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1996, 3 (03) :425-463
[4]   2 MOMENTS SUFFICE FOR POISSON APPROXIMATIONS - THE CHEN-STEIN METHOD [J].
ARRATIA, R ;
GOLDSTEIN, L ;
GORDON, L .
ANNALS OF PROBABILITY, 1989, 17 (01) :9-25
[5]  
ARRATIA R, 1996, SPRINGER LECT NOTES, V1075, P209
[6]  
ARRATIA R, 2000, P 11 ANN ACM SIAM S
[7]   NON-SEXIST SOLUTION OF THE MENAGE PROBLEM [J].
BOGART, KP ;
DOYLE, PG .
AMERICAN MATHEMATICAL MONTHLY, 1986, 93 (07) :514-519
[8]  
Bollobas B, 1980, EUR J COMBINAT, V1, P311
[9]  
Bollobas B, 1985, RANDOM GRAPHS
[10]  
de Fraysseix H., 1984, EUROPEAN J COMBIN, V5, P223