Analysis of EST-driven gene annotation in human genomic sequence

被引:49
作者
Bailey, LC [1 ]
Searls, DB
Overton, GC
机构
[1] Univ Penn, Sch Med, Dept Genet, Computat Biol & Informat Lab, Philadelphia, PA 19104 USA
[2] SmithKline Beecham Pharmaceut, Bioinformat Grp, King Of Prussia, PA 19406 USA
来源
GENOME RESEARCH | 1998年 / 8卷 / 04期
关键词
D O I
10.1101/gr.8.4.362
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-Lip laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% to ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point For crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.
引用
收藏
页码:362 / 376
页数:15
相关论文
共 40 条
[11]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[12]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333
[13]   Evaluation of gene structure prediction programs [J].
Burset, M ;
Guigo, R .
GENOMICS, 1996, 34 (03) :353-367
[14]   Isolation of LERK-5: A ligand of the eph-related receptor tyrosine kinases [J].
Cerretti, DP ;
Bos, TV ;
Nelson, N ;
Kozlosky, CJ ;
Reddy, P ;
Maraskovsky, E ;
Park, LS ;
Lyman, SD ;
Copeland, NG ;
Gilbert, DJ ;
Jenkins, NA ;
Fletcher, FA .
MOLECULAR IMMUNOLOGY, 1995, 32 (16) :1197-1205
[15]   Ordered shotgun sequencing of a 135 kb Xq25 YAC containing ANT2 and four possible genes, including three confirmed by EST matches [J].
Chen, CN ;
Su, Y ;
Baybayan, P ;
Siruno, A ;
Nagaraja, R ;
Mazzarella, R ;
Schlessinger, D ;
Chen, E .
NUCLEIC ACIDS RESEARCH, 1996, 24 (20) :4034-4041
[16]   BETA-CENTRACTIN - CHARACTERIZATION AND DISTRIBUTION OF A NEW MEMBER OF THE CENTRACTIN FAMILY OF ACTIN-RELATED PROTEINS [J].
CLARK, SW ;
STAUB, O ;
CLARK, IB ;
HOLZBAUR, ELF ;
PASCHAL, BM ;
VALLEE, RB ;
MEYER, DI .
MOLECULAR BIOLOGY OF THE CELL, 1994, 5 (12) :1301-1310
[17]   HOW MANY GENES IN THE HUMAN GENOME [J].
FIELDS, C ;
ADAMS, MD ;
WHITE, O ;
VENTER, JC .
NATURE GENETICS, 1994, 7 (03) :345-346
[18]   Computational and biological analysis of 680 kb of DNA sequence from the human 5q31 cytokine gene cluster region [J].
Frazer, KA ;
Ueda, Y ;
Zhu, YW ;
Gifford, VR ;
Garofalo, MR ;
Mohandas, N ;
Martin, CH ;
Palazzolo, MJ ;
Cheng, JF ;
Rubin, EM .
GENOME RESEARCH, 1997, 7 (05) :495-512
[19]  
GISH W, 1997, WU BLAST VERS 2 0
[20]   A transcription map of the DiGeorge and velo-cardio-facial syndrome minimal critical 22q11 [J].
Gong, WK ;
Emanuel, BS ;
Collins, J ;
Kim, DH ;
Wang, ZL ;
Chen, F ;
Zhang, GZ ;
Roe, B ;
Budarf, ML .
HUMAN MOLECULAR GENETICS, 1996, 5 (06) :789-800