Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data

被引:89
作者
Aaronson, JS [1 ]
Eckman, B [1 ]
Blevins, RA [1 ]
Borkowski, JA [1 ]
Myerson, J [1 ]
Imran, S [1 ]
Elliston, KO [1 ]
机构
[1] MERCK SHARP & DOHME RES LABS,DEPT BIOINFORMAT,W POINT,PA 19486
来源
GENOME RESEARCH | 1996年 / 6卷 / 09期
关键词
D O I
10.1101/gr.6.9.829
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A rigorous analysis of the Merck-sponsored EST data with respect to known gene sequences increases the utility of the data set and helps refine methods for building a gene index. A highly curated human transcript data base was used as a reference data set of known genes. A detailed analysis of EST sequences derived From known genes was performed to assess the accuracy of EST sequence annotation. The EST data was screened to remove low-quality and low-complexity sequences. A set of high-quality ESTs similar to the transcript data base was identified using BLAST; this subset of ESTs was compared with the set of known genes using the Smith-Watermal algorithm. Error rates of several types were assessed based on a flexible match criterion defining sequence identity. The rate of lane-tracking errors is very low, similar to 0.5%. Insert size data is accurate within similar to 20%. Reversed clone and internal priming error rates are similar to 5% and 2.5%, respectively, contributing to the incorrect identification of reads as 3' ends of genes. Follow-up investigation reveals that a significant number of clones, miscategorized as reversed, represent overlapping genes on the opposite strand of entries in the transcript data base. Relevance of these results to the creation of a high-quality index to the human genome capable of supporting diverse genomic investigations is discussed.
引用
收藏
页码:829 / 845
页数:17
相关论文
共 31 条
[1]  
AARONSON JS, 1996, E COMMUNICATION
[2]  
ADAMS MD, 1995, NATURE, V377, P3
[3]   COMPLEMENTARY-DNA SEQUENCING - EXPRESSED SEQUENCE TAGS AND HUMAN GENOME PROJECT [J].
ADAMS, MD ;
KELLEY, JM ;
GOCAYNE, JD ;
DUBNICK, M ;
POLYMEROPOULOS, MH ;
XIAO, H ;
MERRIL, CR ;
WU, A ;
OLDE, B ;
MORENO, RF ;
KERLAVAGE, AR ;
MCCOMBIE, WR ;
VENTER, JC .
SCIENCE, 1991, 252 (5013) :1651-1656
[4]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[5]  
[Anonymous], PROGRAMMING PERL
[6]   GENBANK [J].
BENSON, DA ;
BOGUSKI, M ;
LIPMAN, DJ ;
OSTELL, J .
NUCLEIC ACIDS RESEARCH, 1994, 22 (17) :3441-3444
[7]   TRANSCRIPTION TERMINATION AND 3' PROCESSING - THE END IS IN SITE [J].
BIRNSTIEL, ML ;
BUSSLINGER, M ;
STRUB, K .
CELL, 1985, 41 (02) :349-359
[8]  
Blevins R, 1995, COMPUT APPL BIOSCI, V11, P667
[9]   ESTABLISHING A HUMAN TRANSCRIPT MAP [J].
BOGUSKI, MS ;
SCHULER, GD .
NATURE GENETICS, 1995, 10 (04) :369-371
[10]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333