Efficient enumeration of phylogenetically informative substrings

被引:1
作者
Angelov, Stanislav
Harb, Boulos
Kannan, Sampath
Khanna, Sanjeev
Kim, Junhyong
机构
[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
[2] Univ Penn, Dept Biol, Philadelphia, PA 19104 USA
关键词
design and analysis of algorithms; phylogenetically informative substring; phylogeny; stochastic analysis; suffix tree;
D O I
10.1089/cmb.2007.R011
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in nonmammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogmetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a tag if it is found in all of the genomes of one set and none of the genomes of the other set. We present a near-linear time algorithm that finds all tags in a given phylogeny; and a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. Under a stochastic model of evolution, we show that a simple process of tag-generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, since tags are not guaranteed to exist in a given data set, we generalize the notion of a tag from a single substring to a set of substrings. We present a linear programming-based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.
引用
收藏
页码:701 / 723
页数:23
相关论文
共 25 条
[1]   Ribosomal RNA-targeted nucleic acid probes for studies in microbial ecology [J].
Amann, R ;
Ludwig, W .
FEMS MICROBIOLOGY REVIEWS, 2000, 24 (05) :555-565
[2]  
Angelov S, 2004, LECT NOTES COMPUT SC, V3240, P400
[3]  
Angelov S, 2006, LECT NOTES COMPUT SC, V3909, P248
[4]  
[Anonymous], ENUMERATIVE COMBINAT
[5]  
[Anonymous], CSTR22861 U MAR I AD
[6]   Computational screening of conserved genomic DNA in search of functional noncoding elements [J].
Bejerano, G ;
Siepel, AC ;
Kent, WJ ;
Haussler, D .
NATURE METHODS, 2005, 2 (07) :535-545
[7]   Ultraconserved elements in the human genome [J].
Bejerano, G ;
Pheasant, M ;
Makunin, I ;
Stephen, S ;
Kent, WJ ;
Mattick, JS ;
Haussler, D .
SCIENCE, 2004, 304 (5675) :1321-1325
[8]   DESIGN AND ANALYSIS OF A DATA STRUCTURE FOR REPRESENTING SORTED LISTS [J].
BROWN, MR ;
TARJAN, RE .
SIAM JOURNAL ON COMPUTING, 1980, 9 (03) :594-614
[9]   FAST MERGING ALGORITHM [J].
BROWN, MR ;
TARJAN, RE .
JOURNAL OF THE ACM, 1979, 26 (02) :211-226
[10]   SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS [J].
CHANG, WI ;
LAWLER, EL .
ALGORITHMICA, 1994, 12 (4-5) :327-344