Predicting Tryptic Cleavage from Proteomics Data Using Decision Tree Ensembles

被引:39
作者
Fannes, Thomas [3 ]
Vandermarliere, Elien [1 ,2 ]
Schietgat, Leander [3 ]
Degroeve, Sven [1 ,2 ]
Martens, Lennart [1 ,2 ]
Ramon, Jan [3 ]
机构
[1] VIB, Dept Med Prot Res, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Biochem, B-9000 Ghent, Belgium
[3] Katholieke Univ Leuven, Dept Comp Sci, B-3000 Louvain, Belgium
关键词
mass spectrometry; trypsin; PRIDE; machine learning; decision tree; TANDEM MASS-SPECTROMETRY; PROTEIN IDENTIFICATION; TRYPSIN; PERFORMANCE; COMPLEXITY; PEPTIDES; SITES;
D O I
10.1021/pr4001114
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Trypsin is the workhorse protease in mass spectrometry-based proteomics experiments and is used to digest proteins into more readily analyzable peptides. To identify these peptides after mass spectrometric analysis, the actual digestion has to be mimicked as faithfully as possible in Aim In this paper we introduce CP-DT (Cleavage Prediction with Decision Trees), an algorithm based on a decision tree ensemble that was learned on publicly available peptide identification data from the PRIDE repository. We demonstrate that CP-DT is able to accurately predict tryptic cleavage: tests on three independent data sets show that CP-DT significantly outperforms the Keil rules that are currently used to predict tryptic cleavage. Moreover, the trees generated by CP-DT can make predictions efficiently and are interpretable by domain experts.
引用
收藏
页码:2253 / 2259
页数:7
相关论文
共 36 条
[1]
MECHANISM OF ACTIVATION OF TRYPSINOGEN - ROLE OF 4 N-TERMINAL ASPARTYL RESIDUES [J].
ABITA, JP ;
DELAAGE, M ;
LAZDUNSKI, M .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 1969, 8 (03) :314-+
[2]
[Anonymous], 1993, The Morgan Kaufmann Series in Machine Learning
[3]
Ongoing and future developments at the Universal Protein Resource [J].
Apweiler, Rolf ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Antunes, Ricardo ;
Barrell, Daniel ;
Bely, Benoit ;
Bingley, Mark ;
Binns, David ;
Bower, Lawrence ;
Browne, Paul ;
Chan, Wei Mun ;
Dimmer, Emily ;
Eberhardt, Ruth ;
Fazzini, Francesco ;
Fedotov, Alexander ;
Foulger, Rebecca ;
Garavelli, John ;
Castro, Leyla Garcia ;
Huntley, Rachael ;
Jacobsen, Julius ;
Kleen, Michael ;
Laiho, Kati ;
Legge, Duncan ;
Lin, Quan ;
Liu, Wudong ;
Luo, Jie ;
Orchard, Sandra ;
Patient, Samuel ;
Pichler, Klemens ;
Poggioli, Diego ;
Pontikos, Nikolas ;
Pruess, Manuela ;
Rosanoff, Steven ;
Sawford, Tony ;
Sehra, Harminder ;
Turner, Edward ;
Corbett, Matt ;
Donnelly, Mike ;
van Rensburg, Pieter ;
Xenarios, Ioannis ;
Bougueleret, Lydie ;
Auchincloss, Andrea ;
Argoud-Puy, Ghislaine ;
Axelsen, Kristian ;
Bairoch, Amos ;
Baratin, Delphine ;
Blatter, Marie-Claude ;
Boeckmann, Brigitte .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D214-D219
[4]
Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype [J].
Beerenwinkel, N ;
Schmidt, B ;
Walter, H ;
Kaiser, R ;
Lengauer, T ;
Hoffmann, D ;
Korn, K ;
Selbig, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (12) :8271-8276
[5]
SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[6]
Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]
Protein Significance Analysis in Selected Reaction Monitoring (SRM) Measurements [J].
Chang, Ching-Yun ;
Picotti, Paola ;
Huettenhain, Ruth ;
Heinzelmann-Schwarz, Viola ;
Jovanovic, Marko ;
Aebersold, Ruedi ;
Vitek, Olga .
MOLECULAR & CELLULAR PROTEOMICS, 2012, 11 (04)
[8]
RIBAR and xRIBAR: Methods for Reproducible Relative MS/MS-based Label-Free Protein Quantification [J].
Colaert, Niklaas ;
Gevaert, Kris ;
Martens, Lennart .
JOURNAL OF PROTEOME RESEARCH, 2011, 10 (07) :3183-3189
[9]
TANDEM: matching proteins with tandem mass spectra [J].
Craig, R ;
Beavis, RC .
BIOINFORMATICS, 2004, 20 (09) :1466-1467
[10]
A posteriori quality control for the curation and reuse of public proteomics data [J].
Foster, Joseph M. ;
Degroeve, Sven ;
Gatto, Laurent ;
Visser, Matthieu ;
Wang, Rui ;
Griss, Johannes ;
Apweiler, Rolf ;
Martens, Lennart .
PROTEOMICS, 2011, 11 (11) :2182-2194