Efficient enumeration of phylogenetically informative substrings

被引:1
作者
Angelov, Stanislav
Harb, Boulos
Kannan, Sampath
Khanna, Sanjeev
Kim, Junhyong
机构
[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
[2] Univ Penn, Dept Biol, Philadelphia, PA 19104 USA
关键词
design and analysis of algorithms; phylogenetically informative substring; phylogeny; stochastic analysis; suffix tree;
D O I
10.1089/cmb.2007.R011
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in nonmammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogmetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a tag if it is found in all of the genomes of one set and none of the genomes of the other set. We present a near-linear time algorithm that finds all tags in a given phylogeny; and a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. Under a stochastic model of evolution, we show that a simple process of tag-generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, since tags are not guaranteed to exist in a given data set, we generalize the notion of a tag from a single substring to a set of substrings. We present a linear programming-based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.
引用
收藏
页码:701 / 723
页数:23
相关论文
共 25 条
[21]   Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes [J].
Siepel, A ;
Bejerano, G ;
Pedersen, JS ;
Hinrichs, AS ;
Hou, MM ;
Rosenbloom, K ;
Clawson, H ;
Spieth, J ;
Hillier, LW ;
Richards, S ;
Weinstock, GM ;
Wilson, RK ;
Gibbs, RA ;
Kent, WJ ;
Miller, W ;
Haussler, D .
GENOME RESEARCH, 2005, 15 (08) :1034-1050
[22]   Comparative analyses of multi-species sequences from targeted genomic regions [J].
Thomas, JW ;
Touchman, JW ;
Blakesley, RW ;
Bouffard, GG ;
Beckstrom-Sternberg, SM ;
Margulies, EH ;
Blanchette, M ;
Siepel, AC ;
Thomas, PJ ;
McDowell, JC ;
Maskeri, B ;
Hansen, NF ;
Schwartz, MS ;
Weber, RJ ;
Kent, WJ ;
Karolchik, D ;
Bruen, TC ;
Bevan, R ;
Cutler, DJ ;
Schwartz, S ;
Elnitski, L ;
Idol, JR ;
Prasad, AB ;
Lee-Lin, SQ ;
Maduro, VVB ;
Summers, TJ ;
Portnoy, ME ;
Dietrich, NL ;
Akhter, N ;
Ayele, K ;
Benjamin, B ;
Cariaga, K ;
Brinkley, CP ;
Brooks, SY ;
Granite, S ;
Guan, X ;
Gupta, J ;
Haghighi, P ;
Ho, SL ;
Huang, MC ;
Karlins, E ;
Laric, PL ;
Legaspi, R ;
Lim, MJ ;
Maduro, QL ;
Masiello, CA ;
Mastrian, SD ;
McCloskey, JC ;
Pearson, R ;
Stantripop, S .
NATURE, 2003, 424 (6950) :788-793
[23]   ONLINE CONSTRUCTION OF SUFFIX TREES [J].
UKKONEN, E .
ALGORITHMICA, 1995, 14 (03) :249-260
[24]  
Weiner P., 1973, 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, P1
[25]   Genome trees and the Tree of Life [J].
Wolf, YI ;
Rogozin, IB ;
Grishin, NV ;
Koonin, EV .
TRENDS IN GENETICS, 2002, 18 (09) :472-479