A novel algorithm for computational identification of contaminated EST libraries

被引:71
作者
Sorek, R
Safer, HM
机构
[1] Compugen Ltd, IL-69512 Tel Aviv, Israel
[2] Tel Aviv Univ, Sackler Sch Med, Dept Human Genet & Mol Med, IL-69978 Tel Aviv, Israel
关键词
D O I
10.1093/nar/gkg170
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A key goal of the Human Genome Project was to understand the complete set of human proteins, the proteome. Since the genome sequence by itself is not sufficient for predicting new genes and alternative splicing events that lead to new proteins, expressed sequence tags (ESTs) are used as the primary tool for these purposes. The high prevalence of artifacts in dbEST, however, often leads to invalid predictions. Here we describe a novel method for recognizing genomic DNA contamination and other artifacts that cannot be identified using current EST cleaning techniques. Our method uses the alignment of the entire set of ESTs to the human genome to identify highly contaminated EST libraries. We discovered 53 highly contaminated libraries and a subset of 24 766 ESTs from these libraries that probably represent contamination with genomic DNA, pre-mRNA, and ESTs that span non-canonical introns. Although this is only a small fraction of the entire EST dataset, each contaminating sequence could create a spurious transcript prediction. Indeed, in the clustering and assembly tool that we used, these sequences would have caused incorrect inference of 9575 new splice variants and 6370 new genes. Conclusions based on EST analysis, including prediction of alternative splicing, should be re-evaluated in light of these results. Our method, along with the identified set of contaminated sequences, will be essential for applications that depend on large EST datasets.
引用
收藏
页码:1067 / 1074
页数:8
相关论文
共 56 条
[1]   Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data [J].
Aaronson, JS ;
Eckman, B ;
Blevins, RA ;
Borkowski, JA ;
Myerson, J ;
Imran, S ;
Elliston, KO .
GENOME RESEARCH, 1996, 6 (09) :829-845
[2]   3,400 NEW EXPRESSED SEQUENCE TAGS IDENTIFY DIVERSITY OF TRANSCRIPTS IN HUMAN BRAIN [J].
ADAMS, MD ;
KERLAVAGE, AR ;
FIELDS, C ;
VENTER, JC .
NATURE GENETICS, 1993, 4 (03) :256-267
[3]   SEQUENCE IDENTIFICATION OF 2,375 HUMAN BRAIN GENES [J].
ADAMS, MD ;
DUBNICK, M ;
KERLAVAGE, AR ;
MORENO, R ;
KELLEY, JM ;
UTTERBACK, TR ;
NAGLE, JW ;
FIELDS, C ;
VENTER, JC .
NATURE, 1992, 355 (6361) :632-634
[4]  
ADAMS MD, 1995, NATURE, V377, P3
[5]   COMPLEMENTARY-DNA SEQUENCING - EXPRESSED SEQUENCE TAGS AND HUMAN GENOME PROJECT [J].
ADAMS, MD ;
KELLEY, JM ;
GOCAYNE, JD ;
DUBNICK, M ;
POLYMEROPOULOS, MH ;
XIAO, H ;
MERRIL, CR ;
WU, A ;
OLDE, B ;
MORENO, RF ;
KERLAVAGE, AR ;
MCCOMBIE, WR ;
VENTER, JC .
SCIENCE, 1991, 252 (5013) :1651-1656
[6]   CHARACTERIZATION AND MAPPING OF 3 NEW MAMMALIAN ATP-BINDING TRANSPORTER GENES FROM AN EST DATABASE [J].
ALLIKMETS, R ;
GERRARD, B ;
GLAVAC, D ;
RAVNIKGLAVAC, M ;
JENKINS, NA ;
GILBERT, DJ ;
COPELAND, NG ;
MODI, W ;
DEAN, M .
MAMMALIAN GENOME, 1995, 6 (02) :114-117
[7]   DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS [J].
BOGUSKI, MS ;
LOWE, TMJ ;
TOLSTOSHEV, CM .
NATURE GENETICS, 1993, 4 (04) :332-333
[8]   THE TURNING-POINT IN GENOME RESEARCH [J].
BOGUSKI, MS .
TRENDS IN BIOCHEMICAL SCIENCES, 1995, 20 (08) :295-296
[9]   Normalization and subtraction: Two approaches to facilitate gene discovery [J].
Bonaldo, MDF ;
Lennon, G ;
Soares, MB .
GENOME RESEARCH, 1996, 6 (09) :791-806
[10]  
Braren R, 1997, ADV EXP MED BIOL, V419, P163