Generalizations of Markov model to characterize biological sequences

被引:15
作者
Wang, J [1 ]
Hannenhalli, S [1 ]
机构
[1] Univ Penn, Penn Ctr Bioinformat, Dept Genet, Philadelphia, PA 19104 USA
关键词
D O I
10.1186/1471-2105-6-219
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The currently used k(th) order Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding ( gap = 0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. Result: We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences - CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and trinucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10 - 11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at ftp://ftp.pcbi.upenn.edu/GMM/. Conclusion: Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools.
引用
收藏
页数:8
相关论文
共 29 条
[11]   Prediction of hybridization and melting for double-stranded nucleic acids [J].
Dimitrov, RA ;
Zuker, M .
BIOPHYSICAL JOURNAL, 2004, 87 (01) :215-226
[12]  
Durbin R., 1998, BIOL SEQUENCE ANAL
[13]   CPG ISLANDS IN VERTEBRATE GENOMES [J].
GARDINERGARDEN, M ;
FROMMER, M .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 196 (02) :261-282
[14]  
Hannenhalli S, 2001, Bioinformatics, V17 Suppl 1, pS90
[15]   Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure [J].
Ioshikhes, I ;
Trifonov, EN ;
Zhang, MQ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (06) :2891-2895
[16]  
Krogh A, 1997, ISMB-97 - FIFTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY, PROCEEDINGS, P179
[17]   NPRD: Nucleosome positioning region database [J].
Levitsky, VG ;
Katokhin, AV ;
Podkolodnaya, OA ;
Furman, DP ;
Kolchanov, NA .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D67-D70
[18]   Architectural specificity in chromatin structure at the TATA box in vivo:: Nucleosome displacement upon β-phaseolin gene activation [J].
Li, GF ;
Chandler, SP ;
Wolffe, AP ;
Hall, TC .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (08) :4772-4777
[19]   A GENERAL METHOD APPLICABLE TO SEARCH FOR SIMILARITIES IN AMINO ACID SEQUENCE OF 2 PROTEINS [J].
NEEDLEMAN, SB ;
WUNSCH, CD .
JOURNAL OF MOLECULAR BIOLOGY, 1970, 48 (03) :443-+
[20]   Identification and analysis of eukaryotic promoters: recent computational approaches [J].
Ohler, U ;
Niemann, H .
TRENDS IN GENETICS, 2001, 17 (02) :56-60