Comparative ab initio prediction of gene structures using pair HMMs

被引:66
作者
Meyer, IM [1 ]
Durbin, R [1 ]
机构
[1] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
基金
英国惠康基金;
关键词
D O I
10.1093/bioinformatics/18.10.1309
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present a novel comparative method for the ab initio prediction of protein coding genes in eukaryotic genomes. The method simultaneously predicts the gene structures of two un-annotated input DNA sequences which are homologous to each other and retrieves the subsequences which are conserved between the two DNA sequences. It is capable of predicting partial, complete and multiple genes and can align pairs of genes which differ by events of exon-fusion or exon-splitting. The method employs a probabilistic pair hidden Markov model. We generate annotations using our model with two different algorithms: the Viterbi algorithm in its linear memory implementation and a new heuristic algorithm, called the stepping stone, for which both memory and time requirements scale linearly with the sequence length. We have implemented the model in a computer program called Doublescan. In this article, we introduce the method and confirm the validity of the approach on a test set of 80 pairs of orthologous DNA sequences from mouse and human. More information can be found at: http://www.sanger.ac.uk/Software/analysis/doublescan.
引用
收藏
页码:1309 / 1318
页数:10
相关论文
共 17 条
[1]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[2]   Human and mouse gene structure: Comparative analysis and application to exon prediction [J].
Batzoglou, S ;
Pachter, L ;
Mesirov, JP ;
Berger, B ;
Lander, ES .
GENOME RESEARCH, 2000, 10 (07) :950-958
[3]   THE ISOCHORE ORGANIZATION OF THE HUMAN GENOME [J].
BERNARDI, G .
ANNUAL REVIEW OF GENETICS, 1989, 23 :637-661
[4]   Prediction of complete gene structures in human genomic DNA [J].
Burge, C ;
Karlin, S .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) :78-94
[5]   Analysis of canonical and non-canonical splice sites in mammalian genomes [J].
Burset, M ;
Seledtsov, IA ;
Solovyev, VV .
NUCLEIC ACIDS RESEARCH, 2000, 28 (21) :4364-4375
[6]  
Durbin R., 1998, BIOL SEQUENCE ANAL P
[7]   STATISTICAL-ANALYSIS OF VERTEBRATE SEQUENCES REVEALS THAT LONG GENES ARE SCARCE IN GC-RICH ISOCHORES [J].
DURET, L ;
MOUCHIROUD, D ;
GAUTIER, C .
JOURNAL OF MOLECULAR EVOLUTION, 1995, 40 (03) :308-317
[8]   LINEAR SPACE ALGORITHM FOR COMPUTING MAXIMAL COMMON SUBSEQUENCES [J].
HIRSCHBERG, DS .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :341-343
[9]   Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs [J].
Jareborg, N ;
Birney, E ;
Durbin, R .
GENOME RESEARCH, 1999, 9 (09) :815-824
[10]   Conservation, regulation, synteny, and introns in a large-scale C-briggsae-C-elegans genomic alignment [J].
Kent, WJ ;
Zahler, AM .
GENOME RESEARCH, 2000, 10 (08) :1115-1125