Measurement of Data Complexity for Classification Problems with Unbalanced Data

被引:46
作者
Anwar, Nafees [1 ]
Jones, Geoff [1 ]
Ganesh, Siva [1 ]
机构
[1] Massey Univ, Inst Fundamental Sci Stat, Palmerston North 4412, Manawatu, New Zealand
关键词
class imbalance; complexity measurement; nearest neighbors; Bayes error; PROBABILITY; ERROR;
D O I
10.1002/sam.11228
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
We introduce a complexity measure for classification problems that takes account of deterioration in classifier performance as a result of class imbalance. The measure is based on k-nearest neighbors. We explore the choices of k and the distance metric through a simulation study, and illustrate the use of our measure, and related data visualization techniques, with real datasets from the literature. (C) 2014 Wiley Periodicals, Inc.
引用
收藏
页码:194 / 211
页数:18
相关论文
共 59 条
[1]
Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]
Alaiz-Rodríguez R, 2008, LECT NOTES ARTIF INT, V5212, P660, DOI 10.1007/978-3-540-87481-2_43
[3]
[Anonymous], 2004, R LANG ENV STAT COMP
[4]
[Anonymous], 1973, Pattern Classification and Scene Analysis
[5]
[Anonymous], 2003, C4 5 IMBALANCED DATA
[6]
Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[7]
Batista GEAPA, 2005, LECT NOTES COMPUT SC, V3646, P24
[8]
Data complexity assessment in undersampled classification of high-dimensional biomedical data [J].
Baumgartner, R ;
Somorjai, RL .
PATTERN RECOGNITION LETTERS, 2006, 27 (12) :1383-1389
[9]
Domain of competence of XCS classifier system in complexity measurement space [J].
Bernadó-Mansilla, E ;
Ho, TK .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2005, 9 (01) :82-104
[10]
Bock HH., 1999, Analysis of symbolic data: Exploratory methods for extracting statistical information from complex data