Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support Vector Machines, as dependent on data structure

被引:177
作者
Dixon, Sarah J. [1 ]
Brereton, Richard G. [1 ]
机构
[1] Univ Bristol, Sch Chem, Ctr Chemometr, Bristol BS8 1TS, Avon, England
关键词
Classification; Boundary methods; Euclidean distance; Mahalanobis distance; Linear Discriminant Analysis; Quadratic Discriminant Analysis; Learning Vector Quantization; Support Vector Machines; PATTERN-RECOGNITION; CLASSIFICATION; SELECTION;
D O I
10.1016/j.chemolab.2008.07.010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Five methods for discrimination are described, namely Euclidean Distance to centroids (EDC), Linear Discriminant Analysis (LDA) (based on the Mahalanobis distance and pooled variance covariance matrix), Quadratic Discriminant Analysis (QDA) (based on the Mahalanobis distance and individual class variance covariance matrix - non-Bayesian form), Learning Vector Quantization (LVQ) and Support Vector Machines (SVMs) (using soft boundaries and Radial Basis Functions), and illustrated graphically as boundary methods. The performance of each method was determined using four synthetic datasets each consisting of 200 samples half belonging to one of two classes, and a further two synthetic datasets containing 400 samples, again equally split between the two classes. In datasets 1 to 3. five variables were distributed multinormally, in dataset 1 the classes are distributed roughly circularly but with a significant degree of overlap, in dataset 2. the distribution is in elongated hyperellipsoids with small overlap, and in dataset 3 there is a region of complete overlap between classes. In dataset 4 two variables are distributed in a crescent shape. In datasets 5 and 6, 100 variables were generated from multinormal populations, some of which were potential discriminators, however a large proportion of the variables were designed to be uninformative. The methods were optimised using a training set and their performance evaluated using a test set: this was repeated 100 times for different test and training set splits. The average % correctly classified was computed for each class and model, as well as the model stability for each sample (the proportion of times the sample is classified into the same group over all 100 iterations). The conclusions are that the performance of the classifiers depends very much on the distribution of data. Approaches such as LVQ and SVMs that try to determine complex boundaries perform best when the data is not normally distributed such as in dataset 4, but can be prone to overfitting otherwise. QDA tends to perform best on multinormal data although it can be influenced by non-discriminative variables which show a difference in variance. It is recommended to look at the data structure prior to model building to determine the optimal type of model. (c) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 42 条
[1]
Abe S., 2005, ADV PTRN RECOGNIT
[2]
Improving support vector machine classifiers by modifying kernel functions [J].
Amari, S ;
Wu, S .
NEURAL NETWORKS, 1999, 12 (06) :783-789
[3]
[Anonymous], 1936, P NATL I SCI ENCES, DOI [DOI 10.1007/S13171-019-00164-5, 10.1007/s13171-019-00164-5]
[4]
[Anonymous], P IEEE INT C AUT FAC
[5]
Brereton R.G., 2003, DATA ANAL LAB CHEM P
[7]
Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[8]
A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[9]
CAGNONI S, 1994, 1994 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOL 1-7, P762, DOI 10.1109/ICNN.1994.374273
[10]
Choosing multiple parameters for support vector machines [J].
Chapelle, O ;
Vapnik, V ;
Bousquet, O ;
Mukherjee, S .
MACHINE LEARNING, 2002, 46 (1-3) :131-159