Database verification studies of SWISS-PROT and GenBank

被引:22
作者
Karp, PD
Paley, S
Zhu, JC
机构
[1] SRI Int, Bioinformat Res Grp, Menlo Park, CA 94025 USA
[2] Univ Calif San Francisco, Dept Med Informat Sci, San Francisco, CA 94143 USA
关键词
D O I
10.1093/bioinformatics/17.6.526
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Problem statement: We have studied the relationships among SWISS-PROT, TrEMBL, and GenBank with two goals. First is to determine whether users can reliably identify those proteins in SWISS-PROT whose functions were determined experimentally, as opposed to proteins whose functions were predicted computationally. If this information was present in reasonable quantities, it would allow researchers to decrease the propagation of incorrect function predictions during sequence annotation, and to assemble training sets for developing the next generation of sequence-analysis algorithms. Second is to assess the consistency between translated GenBank sequences and sequences in SWISS-PROT and TrEMBL, Results: (1) Contrary to claims by the SWISS-PROT authors, we conclude that SWISS-PROT does not identify a significant number of experimentally characterized proteins. (2) SWISS-PROT is more incomplete than we expected in that version 38.0 from July 1999 lacks many proteins from the full genomes of important organisms that were sequenced years earlier. (3) Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset. (4) In many cases, translated GenBank genes do not exactly match the corresponding SWISS-PROT sequences, for reasons that include missing or removed methionines, differing translation start positions, individual amino-acid differences, and inclusion of sequence data from multiple sequencing projects. For example, results show that for Escherichia coli, 80.6% of the proteins in the GenBank entry for the complete genome have identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have exact substring matches, and matches for 4.1% can be found using BLAST search; the remaining 2.0% of E.coli protein sequences (most of which are ORFs) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differences can be explained by the complexity of the DB, and by the curation processes used to create it, the scale of the differences is notable.
引用
收藏
页码:526 / 532
页数:7
相关论文
共 9 条
[1]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[2]   The complete genome sequence of Escherichia coli K-12 [J].
Blattner, FR ;
Plunkett, G ;
Bloch, CA ;
Perna, NT ;
Burland, V ;
Riley, M ;
ColladoVides, J ;
Glasner, JD ;
Rode, CK ;
Mayhew, GF ;
Gregor, J ;
Davis, NW ;
Kirkpatrick, HA ;
Goeden, MA ;
Rose, DJ ;
Mau, B ;
Shao, Y .
SCIENCE, 1997, 277 (5331) :1453-+
[3]   WHOLE-GENOME RANDOM SEQUENCING AND ASSEMBLY OF HAEMOPHILUS-INFLUENZAE RD [J].
FLEISCHMANN, RD ;
ADAMS, MD ;
WHITE, O ;
CLAYTON, RA ;
KIRKNESS, EF ;
KERLAVAGE, AR ;
BULT, CJ ;
TOMB, JF ;
DOUGHERTY, BA ;
MERRICK, JM ;
MCKENNEY, K ;
SUTTON, G ;
FITZHUGH, W ;
FIELDS, C ;
GOCAYNE, JD ;
SCOTT, J ;
SHIRLEY, R ;
LIU, LI ;
GLODEK, A ;
KELLEY, JM ;
WEIDMAN, JF ;
PHILLIPS, CA ;
SPRIGGS, T ;
HEDBLOM, E ;
COTTON, MD ;
UTTERBACK, TR ;
HANNA, MC ;
NGUYEN, DT ;
SAUDEK, DM ;
BRANDON, RC ;
FINE, LD ;
FRITCHMAN, JL ;
FUHRMANN, JL ;
GEOGHAGEN, NSM ;
GNEHM, CL ;
MCDONALD, LA ;
SMALL, KV ;
FRASER, CM ;
SMITH, HO ;
VENTER, JC .
SCIENCE, 1995, 269 (5223) :496-512
[4]  
Junker VL, 1999, BIOINFORMATICS, V15, P1066
[5]   Integrated access to metabolic and genomic data [J].
Karp, PD ;
Paley, S .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1996, 3 (01) :191-212
[6]   What we do not know about sequence analysis and sequence databases [J].
Karp, PD .
BIOINFORMATICS, 1998, 14 (09) :753-754
[7]   The EcoCyc and MetaCyc databases [J].
Karp, PD ;
Riley, M ;
Saier, M ;
Paulsen, IT ;
Paley, SM ;
Pellegrini-Toole, A .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :56-59
[8]   EcoGene:: a genome sequence database for Escherichia coli K-12 [J].
Rudd, KE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :60-64
[9]   The complete genome sequence of the gastric pathogen Helicobacter pylori [J].
Tomb, JF ;
White, O ;
Kerlavage, AR ;
Clayton, RA ;
Sutton, GG ;
Fleischmann, RD ;
Ketchum, KA ;
Klenk, HP ;
Gill, S ;
Dougherty, BA ;
Nelson, K ;
Quackenbush, J ;
Zhou, LX ;
Kirkness, EF ;
Peterson, S ;
Loftus, B ;
Richardson, D ;
Dodson, R ;
Khalak, HG ;
Glodek, A ;
McKenney, K ;
Fitzegerald, LM ;
Lee, N ;
Adams, MD ;
Hickey, EK ;
Berg, DE ;
Gocayne, JD ;
Utterback, TR ;
Peterson, JD ;
Kelley, JM ;
Cotton, MD ;
Weldman, JM ;
Fujii, C ;
Bowman, C ;
Watthey, L ;
Wallin, E ;
Hayes, WS ;
Weidman, JM ;
Fujii, C ;
Borodovsky, M ;
Karp, PD ;
Smith, HO ;
Fraser, CM ;
Venter, JC .
NATURE, 1997, 388 (6642) :539-547