Assessing phylogenetic motif models for predicting transcription factor binding sites

被引:14
作者
Hawkins, John [1 ]
Grant, Charles [2 ]
Noble, William Stafford [2 ,3 ]
Bailey, Timothy L. [1 ]
机构
[1] Univ Queensland, Inst Mol Biosci, Brisbane, Qld 4072, Australia
[2] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
[3] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
MULTIPLE SEQUENCE ALIGNMENT; REGULATORY ELEMENTS; HUMAN GENOME; DNA; IDENTIFICATION; REVEALS; PROTEIN; GENES;
D O I
10.1093/bioinformatics/btp201
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A variety of algorithms have been developed to predict transcription factor binding sites (TFBSs) within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models (PMMs) have never been rigorously benchmarked in order to determine whether they lead to better prediction of TFBSs than obtained using simple position weight matrix scanning. Results: We evaluate three PMM-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast TFBSs. However, the PMM scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled 'random' motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing 'weak' sites, which are exactly the type of sites needed to discriminate among predictors. We then extend our previous theoretical model of the statistical power of PMM-based prediction algorithms to allow for loss of binding sites during evolution, and show that it gives a more accurate upper bound on scanner accuracy. Finally, utilizing our theoretical model, we introduce a new method for predicting the number of real binding sites in a genome. The results suggest that the number of true sites for a yeast TF is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae Database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast TFBSs.
引用
收藏
页码:I339 / I347
页数:9
相关论文
共 33 条
  • [1] Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome
    Berman, BP
    Nibu, Y
    Pfeiffer, BD
    Tomancak, P
    Celniker, SE
    Levine, M
    Rubin, GM
    Eisen, MB
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (02) : 757 - 762
  • [2] Aligning multiple genomic sequences with the threaded blockset aligner
    Blanchette, M
    Kent, WJ
    Riemer, C
    Elnitski, L
    Smit, AFA
    Roskin, KM
    Baertsch, R
    Rosenbloom, K
    Clawson, H
    Green, ED
    Haussler, D
    Miller, W
    [J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715
  • [3] Phylogenetic shadowing of primate sequences to find functional regions of the human genome
    Boffelli, D
    McAuliffe, J
    Ovcharenko, D
    Lewis, KD
    Ovcharenko, I
    Pachter, L
    Rubin, EM
    [J]. SCIENCE, 2003, 299 (5611) : 1391 - 1394
  • [4] Divergence of transcription factor binding sites across related yeast species
    Borneman, Anthony R.
    Gianoulis, Tara A.
    Zhang, Zhengdong D.
    Yu, Haiyuan
    Rozowsky, Joel
    Seringhaus, Michael R.
    Wang, Lu Yong
    Gerstein, Mark
    Snyder, Michael
    [J]. SCIENCE, 2007, 317 (5839) : 815 - 819
  • [5] LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA
    Brudno, M
    Do, CB
    Cooper, GM
    Kim, MF
    Davydov, E
    Green, ED
    Sidow, A
    Batzoglou, S
    [J]. GENOME RESEARCH, 2003, 13 (04) : 721 - 731
  • [6] Multiple sequence alignment with the Clustal series of programs
    Chenna, R
    Sugawara, H
    Koike, T
    Lopez, R
    Gibson, TJ
    Higgins, DG
    Thompson, JD
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (13) : 3497 - 3500
  • [7] Frequent gain and loss of functional transcription factor binding sites
    Doniger, Scott W.
    Fay, Justin C.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (05) : 932 - 942
  • [8] A model of the statistical power of comparative genome sequence analysis
    Eddy, SR
    [J]. PLOS BIOLOGY, 2005, 3 (01) : 95 - 102
  • [9] EVOLUTIONARY TREES FROM DNA-SEQUENCES - A MAXIMUM-LIKELIHOOD APPROACH
    FELSENSTEIN, J
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1981, 17 (06) : 368 - 376
  • [10] Analysis of combinatorial cis-regulation in synthetic and genomic promoters
    Gertz, Jason
    Siggia, Eric D.
    Cohen, Barak A.
    [J]. NATURE, 2009, 457 (7226) : 215 - U113