Efficient enumeration of phylogenetically informative substrings

被引：1

作者：

Angelov, Stanislav

Harb, Boulos

Kannan, Sampath

Khanna, Sanjeev

Kim, Junhyong

机构：

[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA

[2] Univ Penn, Dept Biol, Philadelphia, PA 19104 USA

来源：

JOURNAL OF COMPUTATIONAL BIOLOGY | 2007年 / 14卷 / 06期

关键词：

design and analysis of algorithms; phylogenetically informative substring; phylogeny; stochastic analysis; suffix tree;

D O I：

10.1089/cmb.2007.R011

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in nonmammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogmetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a tag if it is found in all of the genomes of one set and none of the genomes of the other set. We present a near-linear time algorithm that finds all tags in a given phylogeny; and a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. Under a stochastic model of evolution, we show that a simple process of tag-generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, since tags are not guaranteed to exist in a given data set, we generalize the notion of a tag from a single substring to a set of substrings. We present a linear programming-based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.

引用

页码：701 / 723

页数：23

共 25 条

[1] Ribosomal RNA-targeted nucleic acid probes for studies in microbial ecology [J].