Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data

被引:127
作者
Brereton, Richard G. [1 ]
机构
[1] Univ Bristol, Sch Chem, Ctr Chemometr, Bristol BS2 8DF, Avon, England
关键词
classification model; discriminant partial least squares; D-PLS; optimisation; percentage correctly classified; validation; %CC;
D O I
10.1016/j.trac.2006.10.005
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
This article discusses problems of validating classification models especially in datasets where sample sizes are small and the number of variables is large. It describes the use of percentage correctly classified (%CC) as an indicator for success of a classification model. For small datasets, %CC should not be used uncritically and its interpretation depends on sample size. It illustrates the use of a common classification method, discriminant partial least squares (D-PLS) on a randomly generated dataset of 200 samples and 200 variables. An aim of the classifier is to determine whether the null hypothesis (there is no distinction between two classes) can be rejected. Autoprediction gives an 84.5% CC. it is shown that, if there is variable selection, it must be performed independently on the training set to obtain a CC close to 50% on the test set; otherwise, over-optimistic and false conclusions can be reached about the ability to classify samples into groups. Finally, two aims of determining the quality of a model are frequently confused, namely optimisation (often used to determine the most appropriate number of components in a model) and independent validation; to overcome this, the data should be split into three groups. There are often difficulties with model building if validation and optimisation have been done on different groups of samples, especially using iterative methods, each group being modelled using properties, such as a different number of components or different variables. (c) 2006 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1103 / 1111
页数:9
相关论文
共 10 条
[1]  
Brereton R.G., 2007, APPL CHEMOMETRICS SC
[2]  
BRERETON RG, 2003, CHEMOMETRICS DATA AN, pCH4
[3]  
BRERETON RG, 1992, MULTIVARIATE PATTERN
[4]   The Mahalanobis distance [J].
De Maesschalck, R ;
Jouan-Rimbaud, D ;
Massart, DL .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2000, 50 (01) :1-18
[5]   The use and misuse of chemometrics for treating classification problems [J].
Defernez, M ;
Kemsley, EK .
TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 1997, 16 (04) :216-221
[6]  
DIXON SJ, UNPUB CHEMOM INTELL
[7]  
Efron B., 1994, INTRO BOOTSTRAP, DOI DOI 10.1201/9780429246593
[9]   PATTERN-RECOGNITION BY MEANS OF DISJOINT PRINCIPAL COMPONENTS MODELS [J].
WOLD, S .
PATTERN RECOGNITION, 1976, 8 (03) :127-139
[10]  
XU Y, IN PRESS CRIT REV AN