Genomewide motif identification using a dictionary model

被引:15
作者
Sabatti, C [1 ]
Lange, K
机构
[1] Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA 90095 USA
[2] Univ Calif Los Angeles, Biomath Dept, Los Angeles, CA 90095 USA
[3] Univ Calif Los Angeles, Dept Stat, Los Angeles, CA 90095 USA
关键词
expectation-maximization algorithm; genomic sequence; maximum a posteriori; text segmentation;
D O I
10.1109/JPROC.2002.804689
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper surveys and extends models and algorithms for identifying binding sites in noncoding regions of DNA. Binding sites control the transcription of genes into messenger RNA in preparation for translation into proteins. The base sequence of most binding sites is not entirely fixed, with the different permitted spellings collectively constituting a "motif." After summarizing the underlying biological issues, we review three different models for binding site identification. Each model was developed with a different type of dataset as reference. We then present a unified model that borrows from the previous ones and integrates their main features. In our unified model, one can identify motifs and their unknown positions along a sequence. One can also fit the model to data using maximum likelihood and maximum a posteriori algorithms. These algorithms rely on recursive formulas and the maximization/minorization principle. Finally, we conclude with a prospectus of future data analyses and theoretical research.
引用
收藏
页码:1803 / 1810
页数:8
相关论文
共 12 条
[1]  
Baum L.E., 1972, Inequalities III: Proceedings of the Third Symposium on Inequalities, page, V3, P1
[2]  
Bussemaker H J, 2000, Proc Int Conf Intell Syst Mol Biol, V8, P67
[3]   Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis [J].
Bussemaker, HJ ;
Li, H ;
Siggia, ED .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10096-10100
[4]   BAUM FORWARD - BACKWARD ALGORITHM REVISITED [J].
DEVIJVER, PA .
PATTERN RECOGNITION LETTERS, 1985, 3 (06) :369-373
[5]   Initial sequencing and analysis of the human genome [J].
Lander, ES ;
Int Human Genome Sequencing Consortium ;
Linton, LM ;
Birren, B ;
Nusbaum, C ;
Zody, MC ;
Baldwin, J ;
Devon, K ;
Dewar, K ;
Doyle, M ;
FitzHugh, W ;
Funke, R ;
Gage, D ;
Harris, K ;
Heaford, A ;
Howland, J ;
Kann, L ;
Lehoczky, J ;
LeVine, R ;
McEwan, P ;
McKernan, K ;
Meldrim, J ;
Mesirov, JP ;
Miranda, C ;
Morris, W ;
Naylor, J ;
Raymond, C ;
Rosetti, M ;
Santos, R ;
Sheridan, A ;
Sougnez, C ;
Stange-Thomann, N ;
Stojanovic, N ;
Subramanian, A ;
Wyman, D ;
Rogers, J ;
Sulston, J ;
Ainscough, R ;
Beck, S ;
Bentley, D ;
Burton, J ;
Clee, C ;
Carter, N ;
Coulson, A ;
Deadman, R ;
Deloukas, P ;
Dunham, A ;
Dunham, I ;
Durbin, R ;
French, L .
NATURE, 2001, 409 (6822) :860-921
[6]  
Lange K, 2000, J COMPUT GRAPH STAT, V9, P1, DOI 10.2307/1390605
[7]  
Lange K, 2002, MATH STAT METHODS GE, DOI 10.1007/978-0-387-21750-5
[8]   AN EXPECTATION MAXIMIZATION (EM) ALGORITHM FOR THE IDENTIFICATION AND CHARACTERIZATION OF COMMON SITES IN UNALIGNED BIOPOLYMER SEQUENCES [J].
LAWRENCE, CE ;
REILLY, AA .
PROTEINS-STRUCTURE FUNCTION AND GENETICS, 1990, 7 (01) :41-51
[9]   DETECTING SUBTLE SEQUENCE SIGNALS - A GIBBS SAMPLING STRATEGY FOR MULTIPLE ALIGNMENT [J].
LAWRENCE, CE ;
ALTSCHUL, SF ;
BOGUSKI, MS ;
LIU, JS ;
NEUWALD, AF ;
WOOTTON, JC .
SCIENCE, 1993, 262 (5131) :208-214
[10]   Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes [J].
McCue, LA ;
Thompson, W ;
Carmack, CS ;
Ryan, MP ;
Liu, JS ;
Derbyshire, V ;
Lawrence, CE .
NUCLEIC ACIDS RESEARCH, 2001, 29 (03) :774-782