Incorporating domain knowledge into data mining classifiers: An application in indirect lending

被引:76
作者
Sinha, Atish R. [1 ]
Zhao, Huimin [1 ]
机构
[1] Univ Wisconsin, Sheldon B Lubar Sch Business, Milwaukee, WI 53201 USA
关键词
Data mining; Classification; Supervised learning; Domain knowledge; Expert system;
D O I
10.1016/j.dss.2008.06.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data mining techniques have been applied to solve classification problems for a variety of applications such as credit scoring, bankruptcy prediction, insurance underwriting, and management fraud detection. In many of those application domains, there exist human experts whose knowledge Could have a bearing on the effectiveness of the classification decision. The lack of research in combining data mining techniques with domain knowledge has prompted researchers to identify the fusion of data mining and knowledge-based expert systems as an important future direction. In this paper, we compare the performance of seven data mining classification methods-naive Bayes, logistic regression, decision tree, decision table, neural network, k-nearest neighbor, and support vector machine-with and without incorporating domain knowledge. The application we focus on is in the domain of indirect bank lending. An expert system capturing a lending expert's knowledge of rating a borrower's credit is used in combination with data mining to study if the incorporation of domain knowledge improves classification performance. We use two performance measures: misclassification cost and AUC (area under the curve). A 2 x 7 factorial, repeated-measures ANOVA, with the two factors being domain knowledge (present or absent) and data mining method (seven methods), as well as a special statistical test for comparing AUCs, is used for analyzing the results. Analysis of the results reveals that incorporation of domain knowledge significantly improves classification performance with respect to both misclassification cost and AUC. There is interaction between classification method and domain knowledge. Incorporation of domain knowledge has a higher influence on performance for some methods than for others. Both measures-misclassification cost and AUC-yield similar results, indicating that the findings of the study are robust. (c) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:287 / 299
页数:13
相关论文
共 56 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]  
Ambrosino R, 1999, J AM MED INFORM ASSN, P192
[3]  
[Anonymous], 2003, ROC GRAPHS NOTES PRA
[4]  
[Anonymous], J MACHINE LEARNING R
[5]  
[Anonymous], 1999, CD9914 NAT U SING DE
[6]  
Barakat NH, 2007, IEEE T KNOWL DATA EN, V19, P729, DOI [10.1109/TKDE.2007.1023., 10.1109/TKDE.2007.1023]
[7]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[8]  
Buchanan B. G., 1984, Rule Based Expert Systems: The Mycin Experiments of the Stanford Heuristic Programming Project (The Addison-Wesley series in artificial intelligence)
[9]  
Chauvin Y., 1995, BACKPROPAGATION THEO
[10]  
Choong Nyoung Kim, 1999, Journal of Management Information Systems, V16, P189