Data complexity assessment in undersampled classification of high-dimensional biomedical data

被引:27
作者
Baumgartner, R [1 ]
Somorjai, RL [1 ]
机构
[1] Natl Res Council Canada, Inst Biodiagnost, Winnipeg, MB R3B 1Y6, Canada
关键词
classification; data complexity; regularization; undersampled biomedical problems;
D O I
10.1016/j.patrec.2006.01.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
Regularized linear classifiers have been successfully applied in undersampled, i.e. small sample size/high dimensionality biomedical classification problems. Additionally, a design of data complexity measures was proposed in order to assess the competence of a classifier in a particular context. Our work was motivated by the analysis of ill-posed regression problems by Elden and the interpretation of linear discriminant analysis as a mean square error classifier. Using Singular Value Decomposition analysis, we define a discriminatory power spectrum and show that it provides useful means of data complexity assessment for undersampled classification problems. In five real-life biomedical data sets of increasing difficulty we demonstrate how the data complexity of a classification problem can be related to the performance of regularized linear classifiers. We show that the concentration of the discriminatory power manifested in the discriminatory power spectrum is a deciding factor for the success of the regularized linear classifiers in undersampled classification problems. As a practical outcome of our work, the proposed data complexity assessment may facilitate the choice of a classifier for a given undersampled problem. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:1383 / 1389
页数:7
相关论文
共 25 条
[1]
Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[2]
[Anonymous], 1999, The Nature Statist. Learn. Theory
[3]
BAIR E, 2005, PREDICITION SUPERVIS
[4]
BENNETT K, 2003, NATO SCI SERIES, V2, P227
[5]
BJORCK A, 2004, ACTA NUMER, V13, P1
[6]
Duda R. O., 2000, PATTERN CLASSIFICATI
[7]
*EIG RES INC, 1998, PLS TOOLB
[8]
Partial least-squares vs. Lanczos bidiagonalization -: I:: analysis of a projection method for multiple regression [J].
Eldén, L .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2004, 46 (01) :11-31
[9]
Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[10]
Gene selection for cancer classification using support vector machines [J].
Guyon, I ;
Weston, J ;
Barnhill, S ;
Vapnik, V .
MACHINE LEARNING, 2002, 46 (1-3) :389-422