Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)

被引:336
作者
Rausch, C [1 ]
Weber, T
Kohlbacher, O
Wohlleben, W
Huson, DH
机构
[1] Univ Tubingen, ZBIT, Tubingen, Germany
[2] Univ Tubingen, Dept Microbiol & Biotechnol, Tubingen, Germany
关键词
D O I
10.1093/nar/gki885
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We present a new support vector machine (SVM)-based approach to predict the substrate specificity of subtypes of a given protein sequence family. We demonstrate the usefulness of this method on the example of aryl acid-activating and amino acid-activating adenylation domains (A domains) of nonribosomal peptide synthetases (NRPS). The residues of gramicidin synthetase A that are 8 angstrom around the substrate amino acid and corresponding positions of other adenylation domain sequences with 397 known and unknown specificities were extracted and used to encode this physico-chemical fingerprint into normalized real-valued feature vectors based on the physico-chemical properties of the amino acids. The SVM software package SVMlight was used for training and classification, with transductive SVMs to take advantage of the information inherent in unlabeled data. Specificities for very similar substrates that frequently show cross-specificities were pooled to the so-called composite specificities and predictive models were built for them. The reliability of the models was confirmed in cross-validations and in comparison with a currently used sequence-comparison-based method. When comparing the predictions for 1230 NRPS A domains that are currently detectable in UniProt, the new method was able to give a specificity prediction in an additional 18% of the cases compared with the old method. For 70% of the sequences both methods agreed, for < 6% they did not, mainly on low-confidence predictions by the existing method. None of the predictive methods could infer any specificity for 2.4% of the sequences, suggesting completely new types of specificity.
引用
收藏
页码:5799 / 5808
页数:10
相关论文
共 48 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
APWEILER R, 1934, NUCLEIC ACIDS RES, V32, pD115
[3]   Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[4]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[5]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[6]   Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains [J].
Challis, GL ;
Ravel, J ;
Townsend, CA .
CHEMISTRY & BIOLOGY, 2000, 7 (03) :211-224
[7]   The barbamide biosynthetic gene cluster: a novel marine cyanobacterial system of mixed polyketide synthase (PKS)-non-ribosomal peptide synthetase (NRPS) origin involving an unusual trichloroleucyl starter unit [J].
Chang, ZX ;
Flatt, P ;
Gerwick, WH ;
Nguyen, VA ;
Willis, CL ;
Sherman, DH .
GENE, 2002, 296 (1-2) :235-247
[8]  
Chou P Y, 1978, Adv Enzymol Relat Areas Mol Biol, V47, P45
[9]   Structural basis for the activation of phenylalanine in the non-ribosomal biosynthesis of gramicidin S [J].
Conti, E ;
Stachelhaus, T ;
Marahiel, MA ;
Brick, P .
EMBO JOURNAL, 1997, 16 (14) :4174-4183
[10]   In silico analysis of the adenylation domains of the freestanding enzymes belonging to the eucaryotic nonribosomal peptide synthetase-like family [J].
Di Vincenzo, L ;
Grgurina, I ;
Pascarella, S .
FEBS JOURNAL, 2005, 272 (04) :929-941