CONSTRUCTION OF VALIDATED, NONREDUNDANT COMPOSITE PROTEIN-SEQUENCE DATABASES

被引:192
作者
BLEASBY, AJ
WOOTTON, JC
机构
[1] UNIV LEEDS,DEPT GENET,LEEDS LS2 9JT,W YORKSHIRE,ENGLAND
[2] UNIV LEEDS,DEPT BIOPHYS,LEEDS LS2 9JT,W YORKSHIRE,ENGLAND
来源
PROTEIN ENGINEERING | 1990年 / 3卷 / 03期
关键词
Amino acid sequence; Composite database; Information retrieval; Protein homology; Sepuence similarity;
D O I
10.1093/protein/3.3.153
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
A strategy has been developed for the construction of a validated, comprehensive composite protein sequence database. Entries are amalgamated from primary source data bases by a largely automated set of processes in which redundant and trivially different entries are eliminated. A modular approach has been adopted to allow scientific judgement to be used at each stage of database processing and amalgamation. Source databases are assigned a priority depending on the quality of sequence validation and commenting. Rejection of entries from the lower priority database, in each pairwise comparison of databases, is carried out according to optionally defined redundancy criteria based on sequence segment mismatches. Efficient algorithms for this methodology are embodied in the COMPO software system. COMPO has been applied for over 2 years in construction and regular updating of the OWL composite protein sequence database from the source databases NBRF-PIR, SWISS-PROT, a GenBank translation retrieved from the feature tables, NBRF-NEW, NEWAT86, PSD-KYOTO and the sequences contained in the Brookhaven protein structure databank. OWL is part of the ISIS integrated data resource of protein sequence and structure [Akrigg et al. (1988) Nature, 335, 745-746]. The modular nature of the integration process greatly facilitates the frequent updating of OWL following releases of the source databases. The extent of redundancy in these sources is revealed by the comparison process. The advantages of a robust composite database for sequence similarity searching and information retrieval are discussed. © 1990 Oxford University Press.
引用
收藏
页码:153 / 159
页数:7
相关论文
共 13 条
[1]
Akrigg D., 1988, Nature, UK, V335, P745, DOI 10.1038/335745a0
[2]
PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES [J].
BERNSTEIN, FC ;
KOETZLE, TF ;
WILLIAMS, GJB ;
MEYER, EF ;
BRICE, MD ;
RODGERS, JR ;
KENNARD, O ;
SHIMANOUCHI, T ;
TASUMI, M .
JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) :535-542
[3]
BISHOP MJ, 1987, NUCLEIC ACID PROTEIN, P83
[4]
BURKS C, 1985, COMPUT APPL BIOSCI, V1, P225
[5]
CLAVERIE J M, 1986, Proteins Structure Function and Genetics, V1, P60, DOI 10.1002/prot.340010110
[6]
SIMILAR AMINO-ACID-SEQUENCES - CHANCE OR COMMON ANCESTRY [J].
DOOLITTLE, RF .
SCIENCE, 1981, 214 (4517) :149-159
[7]
[8]
FICKETT JW, 1986, TRENDS BIOCHEM SCI, V11, P190, DOI 10.1016/0968-0004(86)90142-8
[9]
THE PROTEIN IDENTIFICATION RESOURCE (PIR) [J].
GEORGE, DG ;
BARKER, WC ;
HUNT, LT .
NUCLEIC ACIDS RESEARCH, 1986, 14 (01) :11-15
[10]
THE EMBL DATA LIBRARY [J].
HAMM, GH ;
CAMERON, GN .
NUCLEIC ACIDS RESEARCH, 1986, 14 (01) :5-9