An analysis of four missing data treatment methods for supervised learning

被引:137
作者
Batista, GEAPA [1 ]
Monard, MC [1 ]
机构
[1] Univ Sao Paulo, Sao Carlos, SP, Brazil
关键词
D O I
10.1080/713827181
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing values.
引用
收藏
页码:519 / 533
页数:15
相关论文
共 11 条
[1]
[Anonymous], 1998, UCI REPOSITORY MACHI
[2]
BATISTA GE, 2003, 186 ICMC USP
[3]
Ciaccia P, 1997, PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, P426
[4]
PRO-OPIOMELANOCORTIN MESSENGER-RNA SIZE HETEROGENEITY IN ACTH-DEPENDENT CUSHINGS-SYNDROME [J].
CLARK, AJL ;
LAVENDER, PM ;
BESSER, GM ;
REES, LH .
JOURNAL OF MOLECULAR ENDOCRINOLOGY, 1989, 2 (01) :3-9
[5]
Hu M, 2001, LNCS LNAI, V2005, P378, DOI DOI 10.1007/3-540-45554-X_46
[6]
Kohavi R., 1997, International Journal on Artificial Intelligence Tools (Architectures, Languages, Algorithms), V6, P537, DOI 10.1142/S021821309700027X
[7]
Imputation of missing data in industrial databases [J].
Lakshminarayan, K ;
Harp, SA ;
Samad, T .
APPLIED INTELLIGENCE, 1999, 11 (03) :259-275
[8]
LEE HD, 1999, 94 ICMCUSP
[9]
Little RJA, 1987, Statistical Analysis With Missing Data
[10]
QUINLAN JR, 1988, C4 5 PROGRAMS MACHIN