A new strategy to prevent over-fitting in partial least squares models based on model population analysis

被引:68
作者
Deng, Bai-Chuan [1 ,2 ]
Yun, Yong-Huan [2 ]
Liang, Yi-Zeng [2 ]
Cao, Dong-Sheng [3 ]
Xu, Qing-Song [4 ]
Yi, Lun-Zhao [5 ]
Huang, Xin [2 ]
机构
[1] Univ Bergen, Dept Chem, N-5007 Bergen, Norway
[2] Cent South Univ, Sch Chem & Chem Engn, Changsha 410083, Hunan, Peoples R China
[3] Cent South Univ, Sch Pharmaceut Sci, Changsha 410083, Hunan, Peoples R China
[4] Cent South Univ, Sch Math & Stat, Changsha 410083, Hunan, Peoples R China
[5] Kunming Univ Sci & Technol, Yunnan Food Safety Res Inst, Kunming 650500, Peoples R China
关键词
Partial least squares; Over-fitting; Model population analysis; Model selection; Model stability; Cross-validation; MULTIVARIATE CALIBRATION; VARIABLE SELECTION; OUTLIER DETECTION; CROSS-VALIDATION; PROPAGATION; PERSPECTIVE; REGRESSION; OPTIMIZES; DIMENSION; VARIANCE;
D O I
10.1016/j.aca.2015.04.045
中图分类号
O65 [分析化学];
学科分类号
070302 [分析化学];
摘要
Partial least squares (PLS) is one of the most widely used methods for chemical modeling. However, like many other parameter tunable methods, it has strong tendency of over-fitting. Thus, a crucial step in PLS model building is to select the optimal number of latent variables (nLVs). Cross-validation (CV) is the most popular method for PLS model selection because it selects a model from the perspective of prediction ability. However, a clear minimum of prediction errors may not be obtained in CV which makes the model selection difficult. To solve the problem, we proposed a new strategy for PLS model selection which combines the cross-validated coefficient of determination (Q(cv)(2)) and model stability (S). S is defined as the stability of PLS regression vectors which is obtained using model population analysis (MPA). The results show that, when a clear maximum of Q(cv)(2) is not obtained, S can provide additional information of over-fitting and it helps in finding the optimal nLVs. Compared with other regression vector based indictors such as the Euclidean 2-norm (B2), the Durbin Watson statistic (DW) and the jaggedness (J), S is more sensitive to over-fitting. The model selected by our method has both good prediction ability and stability. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 57 条
[1]
NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]
[Anonymous], 1992, PATTERN RECOGNITION
[3]
Bakeev K.A, 2010, PROCESS ANAL TECHNOL
[4]
OCCAM RAZOR [J].
BLUMER, A ;
EHRENFEUCHT, A ;
HAUSSLER, D ;
WARMUTH, MK .
INFORMATION PROCESSING LETTERS, 1987, 24 (06) :377-380
[5]
SCIENCE AND STATISTICS [J].
BOX, GEP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1976, 71 (356) :791-799
[6]
Statistical modeling: The two cultures [J].
Breiman, L .
STATISTICAL SCIENCE, 2001, 16 (03) :199-215
[7]
Critical factors limiting the interpretation of regression vectors in multivariate calibration [J].
Brown, Christopher D. ;
Green, Robert L. .
TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 2009, 28 (04) :506-514
[8]
Brown SD, 2009, COMPREHENSIVE CHEMOMETRICS: CHEMICAL AND BIOCHEMICAL DATA ANALYSIS, VOLS 1-4, P1
[9]
Candes E, 2007, ANN STAT, V35, P2313, DOI 10.1214/009053606000001523
[10]
A New Strategy of Outlier Detection for QSAR/QSPR [J].
Cao, Dong-Sheng ;
Liang, Yi-Zeng ;
Xu, Qing-Song ;
Li, Hong-Dong ;
Chen, Xian .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2010, 31 (03) :592-602