Gene recognition via spliced sequence alignment

被引:203
作者
Gelfand, MS
Mironov, AA
Pevzner, PA
机构
[1] UNIV SO CALIF,DEPT MATH,LOS ANGELES,CA 90089
[2] UNIV SO CALIF,DEPT COMP SCI,LOS ANGELES,CA 90089
[3] RUSSIAN ACAD SCI,INST PROT RES,MOSCOW 142292,RUSSIA
[4] NIIGENETIKA,NATL BIOTECHNOL CTR,LAB MATH METHODS,MOSCOW 113545,RUSSIA
关键词
D O I
10.1073/pnas.93.17.9061
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored, Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein, Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons, On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.
引用
收藏
页码:9061 / 9066
页数:6
相关论文
共 35 条
  • [1] 3,400 NEW EXPRESSED SEQUENCE TAGS IDENTIFY DIVERSITY OF TRANSCRIPTS IN HUMAN BRAIN
    ADAMS, MD
    KERLAVAGE, AR
    FIELDS, C
    VENTER, JC
    [J]. NATURE GENETICS, 1993, 4 (03) : 256 - 267
  • [2] AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE
    ALTSCHUL, SF
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) : 555 - 565
  • [3] ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
  • [4] [Anonymous], 1978, Atlas of protein sequence and structure
  • [5] BURSET M, 1996, IN PRESS GENOMICS, V31
  • [6] DODGSON CL, 1981, ALICE IN WONDERLAND
  • [7] GENE STRUCTURE PREDICTION BY LINGUISTIC METHODS
    DONG, S
    SEARLS, DB
    [J]. GENOMICS, 1994, 23 (03) : 540 - 551
  • [8] RECOGNITION OF PROTEIN CODING REGIONS IN DNA-SEQUENCES
    FICKETT, JW
    [J]. NUCLEIC ACIDS RESEARCH, 1982, 10 (17) : 5303 - 5318
  • [9] FICKETT JW, 1996, IN PRESS COMPUTERS C, V19
  • [10] Gelfand M S, 1995, J Comput Biol, V2, P87, DOI 10.1089/cmb.1995.2.87