Fair and Balanced? Bias in Bug-Fix Datasets

被引:208
作者
Bird, Christian [1 ]
Bachmann, Adrian
Aune, Eirik [1 ]
Duffy, John [1 ]
Bernstein, Abraham
Filkov, Vladimir [1 ]
Devanbu, Premkumar [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
来源
7TH JOINT MEETING OF THE EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND THE ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING | 2009年
关键词
SAMPLE SELECTION BIAS; METHODOLOGY;
D O I
10.1145/1595696.1595716
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software engineering-researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data(1).
引用
收藏
页码:121 / 130
页数:10
相关论文
共 51 条
[21]  
Conover W.J., 1971, Practical nonparametric statistics
[22]   Hipikat: A project memory for software development [J].
Cubranic, D ;
Murphy, GC ;
Singer, J ;
Booth, KS .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2005, 31 (06) :446-465
[23]  
Dowdy S., 2004, STAT RES
[24]   Do crosscutting concerns cause defects? [J].
Eaddy, Marc ;
Zimmermann, Thomas ;
Sherwood, Kaitlin D. ;
Garg, Vibhav ;
Murphy, Gail C. ;
Nagappan, Nachiappan ;
Aho, Alfred V. .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2008, 34 (04) :497-515
[25]   PUBLICATION BIAS IN CLINICAL RESEARCH [J].
EASTERBROOK, PJ ;
BERLIN, JA ;
GOPALAN, R ;
MATTHEWS, DR .
LANCET, 1991, 337 (8746) :867-872
[26]   Populating a release history database from version control and bug tracking systems [J].
Fischer, M ;
Pinzger, M ;
Gall, H .
INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, PROCEEDINGS, 2003, :23-32
[27]  
Gasser L., 2003, P C COOP INN TECHN
[28]   Explicating sensationalism in television news: Content and the bells and whistles of form [J].
Grabe, ME ;
Zhou, SH ;
Barnett, B .
JOURNAL OF BROADCASTING & ELECTRONIC MEDIA, 2001, 45 (04) :635-655
[29]  
Grady R.B., 1987, SOFTWARE METRICS EST, V1
[30]  
Gunes Koru A., 2005, P 2005 WORKSHOP PRED, P1, DOI DOI 10.1145/1082983.1083172