Context-specific independence mixture modeling for positional weight matrices

被引:23
作者
Georgi, Benjamin [1 ]
Schliep, Alexander [1 ]
机构
[1] Max Planck Inst Mol Genet, Dept Computat Mol Biol, D-14195 Berlin, Germany
关键词
D O I
10.1093/bioinformatics/btl249
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A positional weight matrix (PWM) is a statistical representation of the binding pattern of a transcription factor estimated from known binding site sequences. Previous studies showed that for factors which bind to divergent binding sites, mixtures of multiple PWMs increase performance. However, estimating a conventional mixture distribution for each position will in many cases cause overfitting. Results: We propose a context-specific independence (CSI) mixture model and a learning algorithm based on a Bayesian approach. The CSI model adjusts complexity to fit the amount of variation observed on the sequence level in each position of a site. This not only yields a more parsimonious description of binding patterns, which improves parameter estimates, it also increases robustness as the model automatically adapts the number of components to fit the data. Evaluation of the CSI model on simulated data showed favorable results compared to conventional mixtures. We demonstrate its adaptive properties in a classical model selection setup. The increased parsimony of the CSI model was shown for the transcription factor Leu3 where two binding-energy subgroups were distinguished equally well as with a conventional mixture but requiring 30% less parameters. Analysis of the human-mouse conservation of predicted binding sites of 64 JASPAR TFs showed that CSI was as good or better than a conventional mixture for 89% of the TFs and for 70% for a single PWM model.
引用
收藏
页码:E166 / E173
页数:8
相关论文
共 40 条
[31]   COMPUTER METHODS TO LOCATE SIGNALS IN NUCLEIC-ACID SEQUENCES [J].
STADEN, R .
NUCLEIC ACIDS RESEARCH, 1984, 12 (01) :505-519
[32]  
STOLCKE A, 1994, TR94003
[33]  
STORMO GD, 1990, METHOD ENZYMOL, V183, P211
[34]   DNA binding sites: representation and discovery [J].
Stormo, GD .
BIOINFORMATICS, 2000, 16 (01) :16-23
[35]   Comparative analyses of multi-species sequences from targeted genomic regions [J].
Thomas, JW ;
Touchman, JW ;
Blakesley, RW ;
Bouffard, GG ;
Beckstrom-Sternberg, SM ;
Margulies, EH ;
Blanchette, M ;
Siepel, AC ;
Thomas, PJ ;
McDowell, JC ;
Maskeri, B ;
Hansen, NF ;
Schwartz, MS ;
Weber, RJ ;
Kent, WJ ;
Karolchik, D ;
Bruen, TC ;
Bevan, R ;
Cutler, DJ ;
Schwartz, S ;
Elnitski, L ;
Idol, JR ;
Prasad, AB ;
Lee-Lin, SQ ;
Maduro, VVB ;
Summers, TJ ;
Portnoy, ME ;
Dietrich, NL ;
Akhter, N ;
Ayele, K ;
Benjamin, B ;
Cariaga, K ;
Brinkley, CP ;
Brooks, SY ;
Granite, S ;
Guan, X ;
Gupta, J ;
Haghighi, P ;
Ho, SL ;
Huang, MC ;
Karlins, E ;
Laric, PL ;
Legaspi, R ;
Lim, MJ ;
Maduro, QL ;
Masiello, CA ;
Mastrian, SD ;
McCloskey, JC ;
Pearson, R ;
Stantripop, S .
NATURE, 2003, 424 (6950) :788-793
[36]   Decoding human regulatory circuits [J].
Thompson, W ;
Palumbo, MJ ;
Wasserman, WW ;
Liu, JS ;
Lawrence, CE .
GENOME RESEARCH, 2004, 14 (10A) :1967-1974
[37]   Positive and negative autoregulation of REB1 transcription in Saccharomyces cerevisiae [J].
Wang, KLC ;
Warner, JR .
MOLECULAR AND CELLULAR BIOLOGY, 1998, 18 (07) :4368-4376
[38]   Models for prediction and recognition of eukaryotic promoters [J].
Werner, T .
MAMMALIAN GENOME, 1999, 10 (02) :168-175
[39]   Sequence variations within PrfA DNA binding sites and effects on Listeria monocytogenes virulence gene expression [J].
Williams, JR ;
Thayyullathil, C ;
Freitag, NE .
JOURNAL OF BACTERIOLOGY, 2000, 182 (03) :837-841
[40]   Analysis of zinc fingers optimized via phage display: Evaluating the utility of a recognition code [J].
Wolfe, SA ;
Greisman, HA ;
Ramm, EI ;
Pabo, CO .
JOURNAL OF MOLECULAR BIOLOGY, 1999, 285 (05) :1917-1934