Predicting liquid chromatographic retention times of peptides from the Drosophila melanogaster proteome by machine learning approaches

被引:36
作者
Tian, Feifei [1 ]
Yang, Li [1 ]
Lv, Fenglin [1 ]
Zhou, Peng [2 ]
机构
[1] Chongqing Univ, Coll Bioengn, Chongqing 400044, Peoples R China
[2] Zhejiang Univ, Dept Chem, Hangzhou 310027, Peoples R China
关键词
Least-squares support vector machine; Random forest; Gaussian process; Peptide; Liquid chromatography; Quantitative structure-retention relationship; PARTIAL LEAST-SQUARES; ARTIFICIAL NEURAL-NETWORKS; ESCHERICHIA-COLI PROTEOME; SUPPORT VECTOR MACHINE; CARLO CROSS-VALIDATION; QUANTITATIVE PREDICTION; PROTEASE DIGESTION; GAUSSIAN-PROCESSES; REGRESSION-MODELS; MS;
D O I
10.1016/j.aca.2009.04.010
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Three machine learning algorithms as least-squares support vector machine (LSSVM), random forest (RF) and Gaussian process (GP) were used to model the quantitative structure-retention relationship (QSRR) for predicting and explaining the retention behavior of proteome-wide peptides in the reverse-phase liquid chromatography. Peptides were parameterized using CODESSA approach and 145 descriptors were obtained for each peptide, including diverse Structural information such as constitutional, topological, geometrical and physicochemical property. Based upon that, the nonlinear LSSVM, RF and GP as well as another sophisticated linear method (partial least-squares regression (PLS)) were employed in the QSRR model development. By a series of systematic validations as internal cross-validation, external test and Monte Carlo cross-validation. the stability and predictive power of the constructed models were confirmed. Results show that regression models developed using nonlinear approaches such as LSSVM, RF and GP predict better than linear PLS models. Considering the retention times used in this work were measured in different columns and thus have a relatively large uncertainty (reproducibility within 7%), the optimal statistics obtained from GP modeling are satisfactory, with the coefficients of determination (R-2) for training set and test set of 0.894 and 0.866, respectively. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 16
页数:7
相关论文
共 33 条
[1]  
[Anonymous], 1996, ADV NEURAL INFORM PR
[2]  
[Anonymous], MATLAB INTERFACE RAN
[3]  
[Anonymous], CHEMOAC CAL TOOLB
[4]   Prediction of peptide retention at different HPLC conditions from multiple linear regression models [J].
Baczek, T ;
Wiczling, P ;
Marszall, M ;
Vander Heyden, Y ;
Kaliszan, R .
JOURNAL OF PROTEOME RESEARCH, 2005, 4 (02) :555-563
[7]   D-OPTIMAL DESIGNS IN QSAR [J].
BARONI, M ;
CLEMENTI, S ;
CRUCIANI, G ;
KETTANEHWOLD, N ;
WOLD, S .
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS, 1993, 12 (03) :225-231
[8]   Variable selection by modified IPW (iterative predictor weighting)-PLS (partial least squares) in continuous wavelet regression models [J].
Chen, D ;
Hu, XG ;
Shao, XG ;
Su, QD .
ANALYST, 2004, 129 (07) :664-669
[9]   Novel approaches to predict the retention of histidine-containing peptides in immobilized metal-affinity chromatography [J].
Du, Hongying ;
Zhang, Xiaoyun ;
Wang, Xe ;
Yao, Xiaojun ;
Hu, Zhide .
PROTEOMICS, 2008, 8 (11) :2185-2195
[10]   Prediction of retention times of peptides in RPLC by using radial basis function neural networks and projection pursuit regression [J].
Du, Hongying ;
Wang, He ;
Zhang, Xiaoyun ;
Yao, Xiaojun ;
Hu, Zhide .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2008, 92 (01) :92-99