Distinguishing enzyme structures from non-enzymes without alignments

被引:300
作者
Dobson, PD [1 ]
Doig, AJ [1 ]
机构
[1] Univ Manchester, Dept Biomol Sci, Manchester M60 1QD, Lancs, England
基金
英国医学研究理事会;
关键词
protein function prediction; structure; enzyme; machine learning; structural genomics;
D O I
10.1016/S0022-2836(03)00628-4
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The ability to predict protein function from structure is becoming increasingly important as the number of structures resolved is growing more rapidly than our capacity to study function. Current methods for predicting protein function are mostly reliant on identifying a similar protein of known function. For proteins that are highly dissimilar or are only similar to proteins also lacking functional annotations, these methods fail. Here, we show that protein function can be predicted as enzymatic or not without resorting to alignments. We describe 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank using simple features such as secondary-structure content, amino acid propensities, surface properties and ligands. The subset is split into two functional groupings, enzymes and non-enzymes. We use the support vector machine-learning algorithm to develop models that are capable of assigning the protein class. Validation of the method shows that the function can be predicted to an accuracy of 77% using 52 features to describe each protein. An adaptive search of possible subsets of features produces a simplified model based on 36 features that predicts at an accuracy of 80%. We compare the method to sequence-based methods that also avoid calculating alignments and predict a recently released set of unrelated proteins. The most useful features for distinguishing enzymes from non-enzymes are secondary-structure content, amino acid frequencies, number of disulphide bonds and size of the largest cleft. This method is applicable to any structure as it does not require the identification of sequence or structural similarity to a protein of known function. (C) 2003 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:771 / 783
页数:13
相关论文
共 36 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]  
[Anonymous], 1994, MACHINE LEARNING
[4]  
[Anonymous], 1999, The Nature Statist. Learn. Theory
[5]  
Attwood Terri K, 2002, Brief Bioinform, V3, P252, DOI 10.1093/bib/3.3.252
[6]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[7]  
BISHOP CM, 1995, NEURAL NETWORKS PATT, P372
[8]   E-MSD: the European Bioinformatics Institute Macromolecular Structure Database [J].
Boutselakis, H ;
Dimitropoulos, D ;
Fillon, J ;
Golovin, A ;
Henrick, K ;
Hussain, A ;
Ionides, J ;
John, M ;
Keller, PA ;
Krissinel, E ;
McNeil, P ;
Naim, A ;
Newman, R ;
Oldfield, T ;
Pineda, J ;
Rachedi, A ;
Copeland, J ;
Sitnov, A ;
Sobhany, S ;
Suarez-Uruena, A ;
Swaminathan, J ;
Tagari, M ;
Tate, J ;
Tromm, S ;
Velankar, S ;
Vranken, W .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :458-462
[9]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[10]   Artificial neural network model for predicting protein subcellular location [J].
Cai, YD ;
Liu, XJ ;
Chou, KC .
COMPUTERS & CHEMISTRY, 2002, 26 (02) :179-182