Analysis of EST-driven gene annotation in human genomic sequence

被引:49
作者
Bailey, LC [1 ]
Searls, DB
Overton, GC
机构
[1] Univ Penn, Sch Med, Dept Genet, Computat Biol & Informat Lab, Philadelphia, PA 19104 USA
[2] SmithKline Beecham Pharmaceut, Bioinformat Grp, King Of Prussia, PA 19406 USA
来源
GENOME RESEARCH | 1998年 / 8卷 / 04期
关键词
D O I
10.1101/gr.8.4.362
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-Lip laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% to ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point For crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.
引用
收藏
页码:362 / 376
页数:15
相关论文
共 40 条
[1]   Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data [J].
Aaronson, JS ;
Eckman, B ;
Blevins, RA ;
Borkowski, JA ;
Myerson, J ;
Imran, S ;
Elliston, KO .
GENOME RESEARCH, 1996, 6 (09) :829-845
[2]  
ADAMS MD, 1995, NATURE, V377, P3
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]   Large-scale sequencing in human chromosome 12p13: Experimental and computational gene structure determination [J].
AnsariLari, MA ;
Shen, Y ;
Muzny, DM ;
Lee, W ;
Gibbs, RA .
GENOME RESEARCH, 1997, 7 (03) :268-280
[5]   NUMBER OF CPG ISLANDS AND GENES IN HUMAN AND MOUSE [J].
ANTEQUERA, F ;
BIRD, A .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1993, 90 (24) :11995-11999
[6]   GAIA: Framework annotation of genomic sequence [J].
Bailey, LC ;
Fischer, S ;
Schug, J ;
Crabtree, J ;
Gibson, M ;
Overton, GC .
GENOME RESEARCH, 1998, 8 (03) :234-250
[7]   Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching [J].
Banfi, S ;
Borsani, G ;
Rossi, E ;
Bernard, L ;
Guffanti, A ;
Rubboli, F ;
Marchitiello, A ;
Giglio, S ;
Coluccia, E ;
Zollo, M ;
Zuffardi, O ;
Ballabio, A .
NATURE GENETICS, 1996, 13 (02) :167-174
[8]   COMPARATIVE GENOMICS, GENOME CROSS-REFERENCING AND XREFDB [J].
BASSET, DE ;
BOGUSKI, MS ;
SPENCER, F ;
REEVES, R ;
GOEBL, M ;
HIETER, P .
TRENDS IN GENETICS, 1995, 11 (09) :372-373
[9]   A gene belonging to the Sm family of snRNP core proteins maps within the mouse MHC [J].
Bedian, V ;
Adams, T ;
Geiger, EA ;
Bailey, LC ;
Gasser, DL .
IMMUNOGENETICS, 1997, 46 (05) :427-430
[10]   EXON RECOGNITION IN VERTEBRATE SPLICING [J].
BERGET, SM .
JOURNAL OF BIOLOGICAL CHEMISTRY, 1995, 270 (06) :2411-2414