Sorting variables by using informative vectors as a strategy for feature selection in multivariate regression

被引:200
作者
Teofilo, Reinaldo F. [1 ]
Martins, Joao Paulo A. [1 ]
Ferreira, Marcia M. C. [1 ]
机构
[1] Univ Estadual Campinas, Inst Quim, BR-13084971 Sao Paulo, Brazil
关键词
variable selection; informative vectors; OPS; partial least squares; chemometrics; LEAST-SQUARES REGRESSION; INFRARED SPECTROSCOPIC DATA; HIV-1 PROTEASE INHIBITORS; WAVELENGTH SELECTION; CALIBRATION; PLS; PREDICTION; SPECTRA; MODELS; ERROR;
D O I
10.1002/cem.1192
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross-validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal-to-noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near-infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure-activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright (C) 2008 John Wiley & Sons, Ltd.
引用
收藏
页码:32 / 48
页数:17
相关论文
共 47 条
[1]   Practical aspects of PARAFAC modeling of fluorescence excitation-emission data [J].
Andersen, CM ;
Bro, R .
JOURNAL OF CHEMOMETRICS, 2003, 17 (04) :200-215
[2]  
[Anonymous], 1965, SIAM J. Numer. Anal, DOI DOI 10.1137/0702016
[3]   Handling of Rayleigh and Raman scatter for PARAFAC modeling of fluorescence data using interpolation [J].
Bahram, Morteza ;
Bro, Rasmus ;
Stedmon, Colin ;
Afkhami, Abbas .
JOURNAL OF CHEMOMETRICS, 2006, 20 (3-4) :99-105
[4]   Theory of net analyte signal vectors in inverse regression [J].
Bro, R ;
Andersen, CM .
JOURNAL OF CHEMOMETRICS, 2003, 17 (12) :646-652
[5]   Standard error of prediction for multilinear PLS -: 2.: Practical implementation in fluorescence spectroscopy [J].
Bro, R ;
Rinnan, Å ;
Faber, NM .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2005, 75 (01) :69-76
[6]   WAVELENGTH SELECTION IN MULTICOMPONENT NEAR-INFRARED CALIBRATION [J].
BROWN, PJ .
JOURNAL OF CHEMOMETRICS, 1992, 6 (03) :151-161
[7]   Elimination of uninformative variables for multivariate calibration [J].
Centner, V ;
Massart, DL ;
deNoord, OE ;
deJong, S ;
Vandeginste, BM ;
Sterna, C .
ANALYTICAL CHEMISTRY, 1996, 68 (21) :3851-3858
[8]   Performance of some variable selection methods when multicollinearity is present [J].
Chong, IG ;
Jun, CH .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2005, 78 (1-2) :103-112
[9]   Construction of an online reduced-spectrum NIR calibration model from full-spectrum data [J].
Dodds, SA ;
Heath, WP .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2005, 76 (01) :37-43
[10]   Chemometric quantitation of the active substance (containing CN) in a pharmaceutical tablet using near-infrared (NIR) transmittance and NIR FT-Raman spectra [J].
Dyrby, M ;
Engelsen, SB ;
Norgaard, L ;
Bruhn, M ;
Lundsberg-Nielsen, L .
APPLIED SPECTROSCOPY, 2002, 56 (05) :579-585