Pseudocounts for transcription factor binding sites

被引:39
作者
Nishida, Keishin [1 ]
Frith, Martin C. [2 ]
Nakai, Kenta [1 ,3 ,4 ]
机构
[1] Univ Tokyo, Grad Sch Frontier Sci, Dept Med Genome Sci, Chiba 2778562, Japan
[2] Inst Adv Ind Sci & Technol, Computat Biol Res Ctr, Tokyo 1350064, Japan
[3] Univ Tokyo, Inst Med Sci, Ctr Human Genome, Minato Ku, Tokyo 1088639, Japan
[4] Japan Sci & Technol Agcy, Inst Bioinformat Res & Dev BIRD, Chiyoda Ku, Tokyo 10020081, Japan
基金
日本科学技术振兴机构;
关键词
REGULATORY ELEMENTS; MATRICES; IDENTIFICATION; SIMILARITY; SEQUENCES; ALIGNMENT; PROTEINS; DATABASE; MOTIFS;
D O I
10.1093/nar/gkn1019
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
070307 [化学生物学]; 071010 [生物化学与分子生物学];
摘要
To represent the sequence specificity of transcription factors, the position weight matrix (PWM) is widely used. In most cases, each element is defined as a log likelihood ratio of a base appearing at a certain position, which is estimated from a finite number of known binding sites. To avoid bias due to this small sample size, a certain numeric value, called a pseudocount, is usually allocated for each position, and its fraction according to the background base composition is added to each element. So far, there has been no consensus on the optimal pseudocount value. In this study, we simulated the sampling process by artificially generating binding sites based on observed nucleotide frequencies in a public PWM database, and then the generated matrix with an added pseudocount value was compared to the original frequency matrix using various measures. Although the results were somewhat different between measures, in many cases, we could find an optimal pseudocount value for each matrix. These optimal values are independent of the sample size and are clearly correlated with the entropy of the original matrices, meaning that larger pseudocount vales are preferable for less conserved binding sites. As a simple representative, we suggest the value of 0.8 for practical uses.
引用
收藏
页码:939 / 944
页数:6
相关论文
共 17 条
[1]
PSI-BLAST pseudocounts and the minimum description length principle [J].
Altschul, Stephen F. ;
Gertz, E. Michael ;
Agarwala, Richa ;
Schaffer, Alejandro A. ;
Yu, Yi-Kuo .
NUCLEIC ACIDS RESEARCH, 2009, 37 (03) :815-824
[2]
SELECTION OF DNA-BINDING SITES BY REGULATORY PROTEINS - STATISTICAL-MECHANICAL THEORY AND APPLICATION TO OPERATORS AND PROMOTERS [J].
BERG, OG ;
VONHIPPEL, PH .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 193 (04) :723-743
[3]
CHEN QK, 1995, COMPUT APPL BIOSCI, V11, P563
[4]
Durbin R., 1998, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
[5]
Detection of functional DNA motifs via statistical over-representation [J].
Frith, MC ;
Fu, YT ;
Yu, LQ ;
Chen, JF ;
Hansen, U ;
Weng, ZP .
NUCLEIC ACIDS RESEARCH, 2004, 32 (04) :1372-1381
[6]
Quantifying similarity between motifs [J].
Gupta, Shobhit ;
Stamatoyannopoulos, John A. ;
Bailey, Timothy L. ;
Noble, William Stafford .
GENOME BIOLOGY, 2007, 8 (02)
[7]
Henikoff JG, 1996, COMPUT APPL BIOSCI, V12, P135
[8]
Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae [J].
Hughes, JD ;
Estep, PW ;
Tavazoie, S ;
Church, GM .
JOURNAL OF MOLECULAR BIOLOGY, 2000, 296 (05) :1205-1214
[9]
THEORETICAL-STUDIES OF PROTEIN-FOLDING AND UNFOLDING [J].
KARPLUS, M ;
SALI, A .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 1995, 5 (01) :58-73
[10]
DETECTING SUBTLE SEQUENCE SIGNALS - A GIBBS SAMPLING STRATEGY FOR MULTIPLE ALIGNMENT [J].
LAWRENCE, CE ;
ALTSCHUL, SF ;
BOGUSKI, MS ;
LIU, JS ;
NEUWALD, AF ;
WOOTTON, JC .
SCIENCE, 1993, 262 (5131) :208-214