Simultaneous variable selection and outlier detection using a robust genetic algorithm

被引:40
作者
Wiegand, Patrick [1 ]
Pell, Randy [2 ]
Comas, Enric [3 ]
机构
[1] Kaiser Opt Syst Inc, Charleston, WV 25309 USA
[2] Dow Chem Co USA, Midland, MI 48667 USA
[3] Dow Chem Co USA, NL-4533 Terneuzen, Netherlands
关键词
Variable selection; Inverse model; Genetic algorithm; Robust statistics; Outlier detection; Sample selection; NEAR-INFRARED SPECTROSCOPY; WAVELENGTH SELECTION; CHEMOMETRICS; OPTIMIZATION; PREDICTORS; ACCURACY; PLS;
D O I
10.1016/j.chemolab.2009.05.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a dataset in which it is known that all spectra are representative, without error, and have matching accurate reference values, there are many tools which exist to determine the best set of variables to use for constructing an inverse model. such as partial least squares (PLS). Likewise, given that the best variables are known a priori, there are many tools that can be used to determine if any samples are outliers, either due to inaccurate reference values, or due to invalid spectra. However, in many real-world situations, the reference values contain error and the spectra are imperfect. In this situation, it is not always possible to determine either the best subset of samples or the best subset of variables. This paper presents a new technique for combining a robust outlier determination method with a genetic algorithm optimized for spectral variable selection. No assumptions are made as to the optimum set of variables or as to the amount and structure of the errors present in either the predictor (X) or predictand (Y) variables. The technique is best suited for datasets which contain redundant information, i.e., datasets from designed experiments with no replicates may not produce optimum results, as the experimental design implicitly assumes there are no outlier data. (c) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:108 / 114
页数:7
相关论文
共 29 条
[1]   Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: Application to near-infrared spectroscopy [J].
Bangalore, AS ;
Shaffer, RE ;
Small, GW ;
Arnold, MA .
ANALYTICAL CHEMISTRY, 1996, 68 (23) :4200-4212
[2]   Genetic algorithm-based wavelength selection for the near-infrared determination of glucose in biological matrixes: Initialization strategies and effects of spectral resolution [J].
Ding, Q ;
Small, GW ;
Arnold, MA .
ANALYTICAL CHEMISTRY, 1998, 70 (21) :4472-4479
[3]   Fast model selection for robust calibration methods [J].
Engelen, S ;
Hubert, M .
ANALYTICA CHIMICA ACTA, 2005, 544 (1-2) :219-228
[4]  
Forina M, 1999, J CHEMOMETR, V13, P165
[5]   Selection of useful predictors in multivariate calibration [J].
Forina, M ;
Lanteri, S ;
Oliveros, MCC ;
Millan, CP .
ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 2004, 380 (03) :397-418
[6]   Wavelength selection method for multicomponent spectrophotometric determinations using partial least squares [J].
Frenich, AG ;
JouanRimbaud, D ;
Massart, DL ;
Kuttatharmmakul, S ;
Galera, MM ;
Vidal, JLM .
ANALYST, 1995, 120 (12) :2787-2792
[7]   Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based [J].
Hemmateenejad, B .
JOURNAL OF CHEMOMETRICS, 2004, 18 (11) :475-485
[8]   Quantitative structure-retention relationship for the Kovats retention indices of a large set of terpenes: A combined data splitting-feature selection strategy [J].
Hemmateenejad, Bahram ;
Javadnia, Katayoun ;
Elyasi, Maryam .
ANALYTICA CHIMICA ACTA, 2007, 592 (01) :72-81
[9]   Variable and subset selection in PLS regression [J].
Höskuldsson, A .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2001, 55 (1-2) :23-38
[10]   A robust PCR method for high-dimensional regressors [J].
Hubert, M ;
Verboven, S .
JOURNAL OF CHEMOMETRICS, 2003, 17 (8-9) :438-452