Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

被引:906
作者
Stanke, M [1 ]
Schöffmann, O
Morgenstern, B
Waack, S
机构
[1] Univ Gottingen, Inst Mikrobiol & Genet, D-3400 Gottingen, Germany
[2] Univ Gottingen, Inst Numer & Angew Math, D-3400 Gottingen, Germany
关键词
D O I
10.1186/1471-2105-7-62
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence. Results: We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e. g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly. Conclusion: Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.
引用
收藏
页数:11
相关论文
共 25 条
[1]   SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model [J].
Alexandersson, M ;
Cawley, S ;
Pachter, L .
GENOME RESEARCH, 2003, 13 (03) :496-502
[2]  
Allen JE, 2004, GENOME RES, V14, P142, DOI 10.1101/gr.1562804
[3]  
[Anonymous], 1997, THESIS STANFORD U
[4]   GeneWise and genomewise [J].
Birney, E ;
Clamp, M ;
Durbin, R .
GENOME RESEARCH, 2004, 14 (05) :988-995
[5]   ExonHunter:: a comprehensive approach to gene finding [J].
Brejová, B ;
Brown, DG ;
Li, M ;
Vinar, T .
BIOINFORMATICS, 2005, 21 :I57-I65
[6]   Recent advances in gene structure prediction [J].
Brent, MR ;
Guigó, R .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 2004, 14 (03) :264-272
[7]   Fast and sensitive multiple alignment of large genomic sequences -: art. no. 66 [J].
Brudno, M ;
Chapman, M ;
Göttgens, B ;
Batzoglou, S ;
Morgenstern, B .
BMC BIOINFORMATICS, 2003, 4 (1)
[8]   Reevaluating human gene annotation: A second-generation analysis of chromosome 22 [J].
Collins, JE ;
Goward, ME ;
Cole, CG ;
Smink, LJ ;
Huckle, EJ ;
Knowles, S ;
Bye, JM ;
Beare, DM ;
Dunham, I .
GENOME RESEARCH, 2003, 13 (01) :27-36
[9]   IDENTIFICATION OF PROTEIN CODING REGIONS BY DATABASE SIMILARITY SEARCH [J].
GISH, W ;
STATES, DJ .
NATURE GENETICS, 1993, 3 (03) :266-272
[10]  
Gross SS, 2005, P 9 ANN INT C RES CO