Classifying software changes: Clean or buggy?

被引:425
作者
Kim, Sunghun [1 ]
Whitehead, E. James, Jr. [2 ]
Zhang, Yi [2 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
基金
美国国家科学基金会;
关键词
maintenance; software metrics; software fault diagnosis; configuration management; classification; association rules; data mining; machine learning;
D O I
10.1109/TSE.2007.70773
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This paper introduces a new technique for predicting latent software bugs, called change classification. Change classification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes or clean changes. In this manner, change classification predicts the existence of bugs in software changes. The classifier is trained using features ( in the machine learning sense) extracted from the revision history of a software project stored in its software configuration management repository. The trained classifier can classify changes as buggy or clean, with a 78 percent accuracy and a 60 percent buggy change recall on average. Change classification has several desirable qualities: 1) The prediction granularity is small ( a change to a single file), 2) predictions do not require semantic information about the source code, 3) the technique works for a broad array of project types and programming languages, and 4) predictions can be made immediately upon the completion of a change. Contributions of this paper include a description of the change classification approach, techniques for extracting features from the source code and change histories, a characterization of the performance of change classification across 12 open source projects, and an evaluation of the predictive power of different groups of features.
引用
收藏
页码:181 / 196
页数:16
相关论文
共 55 条
[1]  
Alpaydin Ethem, 2004, Introduction to machine learning
[2]  
[Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques
[3]  
[Anonymous], ACM T ASIAN LANGUAGE, DOI DOI 10.1145/1039621.1039625
[4]  
[Anonymous], 2002, P 2002 ACM SIGSOFT I
[5]  
[Anonymous], 2004, P 1 INT WORKSHOP MIN, DOI DOI 10.1049/IC:20040466
[6]   Traceability recovery by modeling programmer behavior [J].
Antoniol, G ;
Casazza, G ;
Cimitile, A .
SEVENTH WORKING CONFERENCE ON REVERSE ENGINEERING - PROCEEDINGS, 2000, :240-247
[7]  
Anvik J, 2006, P 28 INT C SOFTWARE, P361, DOI DOI 10.1145/1134285.1134336
[8]  
Bevan J., 2005, P 10 EUR SOFTW ENG C, P177, DOI DOI 10.1145/1081706.1081736
[9]   Finding latent code errors via machine learning over program executions [J].
Brun, Y ;
Ernst, MD .
ICSE 2004: 26TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2004, :480-490
[10]   Hipikat: Recommending pertinent software development artifacts [J].
Cubranic, D ;
Murphy, GC .
25TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2003, :408-418