Redefining CpG islands using hidden Markov models

被引:123
作者
Wu, Hao [1 ]
Caffo, Brian [1 ]
Jaffee, Harris A. [1 ]
Irizarry, Rafael A. [1 ]
Feinberg, Andrew P. [2 ,3 ]
机构
[1] Johns Hopkins Univ, Dept Biostat, Baltimore, MD 21205 USA
[2] Johns Hopkins Univ, Sch Med, Dept Med, Baltimore, MD 21205 USA
[3] Johns Hopkins Univ, Sch Med, Ctr Epigenet, Baltimore, MD 21205 USA
基金
美国国家卫生研究院;
关键词
CpG island; Epigenetics; Hidden Markov model; Sequence analysis; DNA METHYLATION;
D O I
10.1093/biostatistics/kxq005
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The DNA of most vertebrates is depleted in CpG dinucleotide: a C followed by a G in the 5' to 3' direction. CpGs are the target for DNA methylation, a chemical modification of cytosine (C) heritable during cell division and the most well-characterized epigenetic mechanism. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). Knowing CGI locations is important because they mark functionally relevant epigenetic loci in development and disease. For various mammals, including human, a readily available and widely used list of CGI is available from the UCSC Genome Browser. This list was derived using algorithms that search for regions satisfying a definition of CGI proposed by Gardiner-Garden and Frommer more than 20 years ago. Recent findings, enabled by advances in technology that permit direct measurement of epigenetic endpoints at a whole-genome scale, motivate the need to adapt the current CGI definition. In this paper, we propose a procedure, guided by hidden Markov models, that permits an extensible approach to detecting CGI. The main advantage of our approach over others is that it summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for other species. The utility of this approach is demonstrated by generating the first CGI lists for invertebrates, and the fact that we can create CGI lists that substantially increases overlap with recently discovered epigenetic marks. A CGI list and the probability scores, as a function of genome location, for each species are available at http://www.rafalab.org.
引用
收藏
页码:499 / 514
页数:16
相关论文
共 23 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]   MEME: discovering and analyzing DNA and protein sequence motifs [J].
Bailey, Timothy L. ;
Williams, Nadya ;
Misleh, Chris ;
Li, Wilfred W. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :W369-W373
[3]   Combining evidence using p-values: application to sequence homology searches [J].
Bailey, TL ;
Gribskov, M .
BIOINFORMATICS, 1998, 14 (01) :48-54
[4]   CPG-RICH ISLANDS AND THE FUNCTION OF DNA METHYLATION [J].
BIRD, AP .
NATURE, 1986, 321 (6067) :209-213
[5]   A Bayesian approach to DNA sequence segmentation - Discussion - Reply [J].
Boys, RJ ;
Henderson, DA .
BIOMETRICS, 2004, 60 (03) :585-588
[6]  
CHURCHILL GA, 1989, B MATH BIOL, V51, P79
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]  
Durbin R., 1998, Biological sequence analysis: probabilistic models of proteins and nucleic acids
[9]   Phenotypic plasticity and the epigenetics of human disease [J].
Feinberg, Andrew P. .
NATURE, 2007, 447 (7143) :433-440
[10]   CPG ISLANDS IN VERTEBRATE GENOMES [J].
GARDINERGARDEN, M ;
FROMMER, M .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 196 (02) :261-282