Identification of transcription factor binding sites with variable-order Bayesian networks

被引:108
作者
Ben-Gal, I [1 ]
Shani, A
Gohr, A
Grau, J
Arviv, S
Shmilovici, A
Posch, S
Grosse, I
机构
[1] Tel Aviv Univ, Dept Ind Engn, IL-69978 Tel Aviv, Israel
[2] Inst Plant Genet & Crop Plant Res IPK, D-06466 Gatersleben, Germany
[3] Univ Halle Wittenberg, Inst Comp Sci, D-06099 Halle An Der Saale, Germany
[4] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
关键词
D O I
10.1093/bioinformatics/bti410
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: We propose a new class of variable-order Bayesian network (VOBN) models for the identification of transcription factor binding sites (TFBSs). The proposed models generalize the widely used position weight matrix (PWM) models, Markov models and Bayesian network models. In contrast to these models, where for each position a fixed subset of the remaining positions is used to model dependencies, in VOBN models, these subsets may vary based on the specific nucleotides observed, which are called the context. This flexibility turns out to be of advantage for the classification and analysis of TFBSs, as statistical dependencies between nucleotides in different TFBS positions (not necessarily adjacent) may be taken into account efficiently-in a position-specific and context-specific manner. Results: We apply the VOBN model to a set of 238 experimentally verified sigma-70 binding sites in Escherichia coli. We find that the VOBN model can distinguish these 238 sites from a set of 472 intergenic 'non-promoter' sequences with a higher accuracy than fixed-order Markov models or Bayesian trees. We use a replicated stratified-holdout experiment having a fixed true-negative rate of 99.9%. We find that for a foreground inhomogeneous VOBN model of order 1 and a background homogeneous variable-order Markov (VOM) model of order 5, the obtained mean true-positive (TP) rate is 47.56%. In comparison, the best TP rate for the conventional models is 44.39%, obtained from a foreground PWM model and a background 2nd-order Markov model. As the standard deviation of the estimated TP rate is similar or equal to 0.01%, this improvement is highly significant.
引用
收藏
页码:2657 / 2666
页数:10
相关论文
共 51 条
[1]  
Baldi P., 2001, Bioinformatics: the machine learning approach
[2]  
BARASH Y, 2003, P 7 ANN INT C COMP M
[3]   Variations on probabilistic suffix trees: statistical modeling and prediction of protein families [J].
Bejerano, G ;
Yona, G .
BIOINFORMATICS, 2001, 17 (01) :23-43
[4]   Context-based statistical process control: A monitoring procedure for state-dependent processes [J].
Ben-Gal, I ;
Morag, G ;
Shmilovici, A .
TECHNOMETRICS, 2003, 45 (04) :293-311
[5]  
BENGAL I, 2001, WORKSH ART INT HEUR
[6]  
BENOS PV, 2001, PSB 01
[7]  
BILU Y, 2002, 200257 TR LEIB CTR
[8]   A COMPUTER PACKAGE FOR DNA-SEQUENCE ANALYSIS [J].
BLATTNER, FR ;
SCHROEDER, JL .
NUCLEIC ACIDS RESEARCH, 1984, 12 (01) :615-617
[9]  
Boutilier C, 1996, UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, P115
[10]  
Bühlmann P, 1999, ANN STAT, V27, P480