Benchmarking classification models for software defect prediction: A proposed framework and novel findings

被引:767
作者
Lessmann, Stefan [1 ]
Baesens, Bart [2 ]
Mues, Christophe [3 ]
Pietsch, Swantje [1 ]
机构
[1] Univ Hamburg, Inst Informat Syst, D-20146 Hamburg, Germany
[2] Katholieke Univ Leuven, Dept Appl Econ Sci, B-3000 Louvain, Belgium
[3] Univ Southampton, Sch Management, Southampton SO17 1BJ, Hants, England
关键词
complexity measures; data mining; formal methods; statistical methods; software defect prediction;
D O I
10.1109/TSE.2008.35
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software defect prediction strives to improve software quality and testing efficiency by constructing predictive classification models from code attributes to enable a timely identification of fault-prone modules. Several classification models have been evaluated for this task. However, due to inconsistent findings regarding the superiority of one classifier over another and the usefulness of metric-based classification in general, more research is needed to improve convergence across studies and further advance confidence in experimental results. We consider three potential sources for bias: comparing classifiers over one or a small number of proprietary data sets, relying on accuracy indicators that are conceptually inappropriate for software defect prediction and cross-study comparisons, and, finally, limited use of statistical testing procedures to secure empirical findings. To remedy these problems, a framework for comparative software defect prediction experiments is proposed and applied in a large-scale empirical comparison of 22 classifiers over 10 public domain data sets from the NASA Metrics Data repository. Overall, an appealing degree of predictive accuracy is observed, which supports the view that metric-based classification is useful. However, our results indicate that the importance of the particular classification algorithm may be less than previously assumed since no significant performance differences could be detected among the top 17 classifiers.
引用
收藏
页码:485 / 496
页数:12
相关论文
共 67 条
[31]   An empirical study of predicting software faults with case-based reasoning [J].
Khoshgoftaar, Taghi M. ;
Seliya, Naeem ;
Sundaresh, Nandini .
SOFTWARE QUALITY JOURNAL, 2006, 14 (02) :85-111
[32]   Comparative assessment of software quality classification techniques: An empirical case study [J].
Khoshgoftaar, TM ;
Seliya, N .
EMPIRICAL SOFTWARE ENGINEERING, 2004, 9 (03) :229-257
[33]   Analogy-based practical classification rules for software quality estimation [J].
Khoshgoftaar, TM ;
Seliya, N .
EMPIRICAL SOFTWARE ENGINEERING, 2003, 8 (04) :325-350
[34]   Classification-tree models of software-quality over multiple releases [J].
Khoshgoftaar, TM ;
Allen, EB ;
Jones, WD ;
Hudepohl, JP .
IEEE TRANSACTIONS ON RELIABILITY, 2000, 49 (01) :4-11
[35]   Application of neural networks to software quality modeling of a very large telecommunications system [J].
Khoshgoftaar, TM ;
Allen, EB ;
Hudepohl, JP ;
Aud, SJ .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1997, 8 (04) :902-909
[36]  
KORU AG, 2005, P WORKSH PRED MOD SO
[37]   Logistic model trees [J].
Landwehr, N ;
Hall, M ;
Frank, E .
MACHINE LEARNING, 2005, 59 (1-2) :161-205
[38]   Evaluating predictive quality models derived from software measures: Lessons learned [J].
Lanubile, F ;
Visaggio, G .
JOURNAL OF SYSTEMS AND SOFTWARE, 1997, 38 (03) :225-234
[39]   A flexible method for software effort estimation by analogy [J].
Li, Jingzhou ;
Ruhe, Guenther ;
Al-Emran, Ahmed ;
Richter, Michael M. .
EMPIRICAL SOFTWARE ENGINEERING, 2007, 12 (01) :65-106
[40]   THE EVIDENCE FRAMEWORK APPLIED TO CLASSIFICATION NETWORKS [J].
MACKAY, DJC .
NEURAL COMPUTATION, 1992, 4 (05) :720-736