Using sensitivity analysis and visualization techniques to open black box data mining models

被引:302
作者
Cortez, Paulo [1 ]
Embrechts, Mark J. [2 ]
机构
[1] Univ Minho, Dept Sistemas Informacao, Ctr Algoritmi, P-4800058 Guimarres, Portugal
[2] Rensselaer Polytech Inst, Dept Ind & Syst Engn, Troy, NY 12180 USA
关键词
Sensitivity analysis; Visualization; Input importance; Supervised data mining; Regression; Classification;
D O I
10.1016/j.ins.2012.10.039
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a new visualization approach based on a Sensitivity Analysis (SA) to extract human understandable knowledge from supervised learning black box data mining models, such as Neural Networks (NNs), Support Vector Machines (SVMs) and ensembles, including Random Forests (RFs). Five SA methods (three of which are purely new) and four measures of input importance (one novel) are presented. Also, the SA approach is adapted to handle discrete variables and to aggregate multiple sensitivity responses. Moreover, several visualizations for the SA results are introduced, such as input pair importance color matrix and variable effect characteristic surface. A wide range of experiments was performed in order to test the SA methods and measures by fitting four well-known models (NN, SVM, RF and decision trees) to synthetic datasets (five regression and five classification tasks). In addition, the visualization capabilities of the SA are demonstrated using four real-world datasets (e.g., bank direct marketing and white wine quality). (C) 2012 Elsevier Inc. All rights reserved.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 34 条
  • [1] [Anonymous], 1990, Journal of Neural Network Computing
  • [2] [Anonymous], 2007, Uci machine learning repository
  • [3] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Chapman P, 2000, CRISP DM 10 STEP BY
  • [6] Nonlinear support vector machine visualization for risk factor analysis using nomograms and localized radial basis function kernels
    Cho, Baek Hwan
    Yu, Hwanjo
    Lee, Jongshill
    Chee, Young Joon
    Kim, In Young
    Kim, Sun I.
    [J]. IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, 2008, 12 (02): : 247 - 256
  • [7] Cortez P., 2011, IEEE S SERIES COMPUT, P4
  • [8] Cortez P, 2010, LECT NOTES ARTIF INT, V6171, P572, DOI 10.1007/978-3-642-14400-4_44
  • [9] Modeling wine preferences by data mining from physicochemical properties
    Cortez, Paulo
    Cerdeira, Antonio
    Almeida, Fernando
    Matos, Telmo
    Reis, Jose
    [J]. DECISION SUPPORT SYSTEMS, 2009, 47 (04) : 547 - 553
  • [10] Craven M. W., 1992, International Journal on Artificial Intelligence Tools (Architectures, Languages, Algorithms), V1, P399, DOI 10.1142/S0218213092000260