STATLOG - COMPARISON OF CLASSIFICATION ALGORITHMS ON LARGE REAL-WORLD PROBLEMS

被引:163
作者
KING, RD
FENG, C
SUTHERLAND, A
机构
[1] UNIV STRATHCLYDE, DEPT STAT, GLASGOW G1 1XW, LANARK, SCOTLAND
[2] TURING INST LTD, GLASGOW, LANARK, SCOTLAND
关键词
D O I
10.1080/08839519508945477
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes work in the StatLog project comparing classification algorithms on large real-world problems. The algorithms compared were from symbolic learning (CART, C4.5, NewlD, AC(2), ITrule, Cal5, CN2), statistics (Naive Bayes, k-nearest neighbor, kernel density, linear discriminant, quadratic discriminant, logistic regression, projection pursuit, Bayesian networks), and neural networks (backpropagation, radial basis functions). Twelve datasets were used:five from image analysis, three from medicine, and two each from engineering and finance. We found that which algorithm performed best depended critically on the data set investigated. We therefore developed a set of data set descriptors to help decide which algorithms are suited to particular data sets. For example, data sets with extreme distributions (skew > 1 and kurtosis > 7) and with many binary/categorical attributes (> 38%) tend to favor symbolic learning algorithms. We suggest how classification algorithms can be extended in a number of directions.
引用
收藏
页码:289 / 333
页数:45
相关论文
共 70 条
[1]  
AHA D, 1992, 9TH INT C MACH LEARN, P1
[2]   INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW ;
KIBLER, D ;
ALBERT, MK .
MACHINE LEARNING, 1991, 6 (01) :37-66
[3]  
ATLAS L, 1991, SYSTEMS MAN CYBERNET, P915
[4]  
BONELLI P, 1991, ICGA 91 GENETIC ALGO, P288
[5]  
BUNTINE W, 1989, 6TH P INT WORKSH MAC, P94
[6]  
CASID S, 1991, 1991 ESPR C
[7]  
CHERKAOUI O, 1991, 1991 P INT C SAN MAT
[8]  
Clark P., 1989, Machine Learning, V3, P261, DOI 10.1007/BF00116835
[9]  
Clark P., 1991, MACHINE LEARNING EWS, P151, DOI [10.1007/bfb0017011, DOI 10.1007/BFB0017011]
[10]  
COX DR, 1966, RES PAPERS STATISTIC, V45