A QUALITY-CONTROL ALGORITHM FOR DNA-SEQUENCING PROJECTS

被引:27
作者
WHITE, O
DUNNING, T
SUTTON, G
ADAMS, M
VENTER, JC
FIELDS, C
机构
[1] INST GENOM RES, 932 CLOPPER RD, GAITHERSBURG, MD 20878 USA
[2] NEW MEXICO STATE UNIV, COMP RES LAB, LAS CRUCES, NM 88003 USA
关键词
D O I
10.1093/nar/21.16.3829
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Heterologous DNA sequences from rearrangements with the genomes of host cells, genomic fragments from hybrid cells, or impure tissue sources can threaten the purity of libraries that are derived from RNA or DNA. Hybridization methods can only detect contaminants from known or suspected heterologous sources, and whole library screening is technically very difficult. Detection of contaminating heterologous clones by sequence alignment is only possible when related sequences are present in a known database. We have developed a statistical test to identify heterologous sequences that is based on the differences in hexamer composition of DNA from different organisms. This test does not require that sequences similar to potential heterologous contaminants are present in the database, and can in principle detect contamination by previously unknown organisms. We have applied this test to the major public expressed sequence tag (EST) data sets to evaluate its utility as a quality control measure and a peer evaluation tool. There is detectable heterogeneity in most human and C.elegans EST data sets but it is not apparently associated with cross-species contamination. However, there is direct evidence for both yeast and bacterial sequence contamination in some public database sequences annotated as human. Results obtained with the hexamer test have been confirmed with similarity searches using sequences from the relevant data sets.
引用
收藏
页码:3829 / 3838
页数:10
相关论文
共 25 条
[1]  
ADAMS M, 1993, IN PRESS NATURE GENE, V4
[2]   3,400 NEW EXPRESSED SEQUENCE TAGS IDENTIFY DIVERSITY OF TRANSCRIPTS IN HUMAN BRAIN [J].
ADAMS, MD ;
KERLAVAGE, AR ;
FIELDS, C ;
VENTER, JC .
NATURE GENETICS, 1993, 4 (03) :256-267
[3]  
ADAMS MD, 1992, NATURE, V357, P367, DOI 10.1038/357367b0
[4]   SEQUENCE IDENTIFICATION OF 2,375 HUMAN BRAIN GENES [J].
ADAMS, MD ;
DUBNICK, M ;
KERLAVAGE, AR ;
MORENO, R ;
KELLEY, JM ;
UTTERBACK, TR ;
NAGLE, JW ;
FIELDS, C ;
VENTER, JC .
NATURE, 1992, 355 (6361) :632-634
[5]   COMPLEMENTARY-DNA SEQUENCING - EXPRESSED SEQUENCE TAGS AND HUMAN GENOME PROJECT [J].
ADAMS, MD ;
KELLEY, JM ;
GOCAYNE, JD ;
DUBNICK, M ;
POLYMEROPOULOS, MH ;
XIAO, H ;
MERRIL, CR ;
WU, A ;
OLDE, B ;
MORENO, RF ;
KERLAVAGE, AR ;
MCCOMBIE, WR ;
VENTER, JC .
SCIENCE, 1991, 252 (5013) :1651-1656
[6]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[7]   SEQUENCE OF AN UNUSUALLY LARGE PROTEIN IMPLICATED IN REGULATION OF MYOSIN ACTIVITY IN C-ELEGANS [J].
BENIAN, GM ;
KIFF, JE ;
NECKELMANN, N ;
MOERMAN, DG ;
WATERSTON, RH .
NATURE, 1989, 342 (6245) :45-50
[8]   OVER-REPRESENTATION AND UNDER-REPRESENTATION OF SHORT OLIGONUCLEOTIDES IN DNA-SEQUENCES [J].
BURGE, C ;
CAMPBELL, AM ;
KARLIN, S .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (04) :1358-1362
[9]   INTRONS IN SEQUENCE TAGS [J].
BURGLIN, TR ;
BARNES, TM .
NATURE, 1992, 357 (6377) :367-367
[10]   AN ANALYSIS OF THE ORIGIN OF METAZOANS, USING COMPARISONS OF PARTIAL SEQUENCES OF THE 28S RNA, REVEALS AN EARLY EMERGENCE OF TRIPLOBLASTS [J].
CHRISTEN, R ;
RATTO, A ;
BAROIN, A ;
PERASSO, R ;
GRELL, KG ;
ADOUTTE, A .
EMBO JOURNAL, 1991, 10 (03) :499-503