Optimal techniques for class-dependent attribute discretization

被引:2
作者
Bryson, N
Joseph, A
机构
[1] Virginia Commonwealth Univ, Sch Business, Richmond, VA 23284 USA
[2] Univ Miami, Coral Gables, FL 33124 USA
关键词
data mining; attribute discretization; decision trees; machine learning; entropy; parametric linear programming;
D O I
10.1057/palgrave.jors.2601174
中图分类号
C93 [管理学];
学科分类号
12 ; 1201 ; 1202 ; 120202 ;
摘要
Preprocessing of raw data has been shown to improve performance of knowledge discovery processes. Discretization of quantitative attributes is a key component of preprocessing and has the potential to greatly impact the efficiency of the process and the quality of its outcomes. In attribute discretization, the value domain of an attribute is partitioned into a finite set of intervals so that the attribute can be described using a small number of discrete representations. Discretization therefore involves two decisions, on the number of intervals and the placement of interval boundaries. Previous approaches for quantitative attribute discretization have used heuristic algorithms to identify partitions of the attribute value domain. Therefore, these approaches cannot be guaranteed to provide the optimal solution for the given discretization criterion and number of intervals. In this paper, we use linear programming (LP) methods to formulate the attribute discretization problem. The LP formulation allows the discretization criterion and the number of intervals to be integral considerations of the problem. We conduct experiments and identify optimal solutions for various discretization criteria and numbers of intervals.
引用
收藏
页码:1130 / 1143
页数:14
相关论文
共 19 条
[1]   LEARNING BOOLEAN CONCEPTS IN THE PRESENCE OF MANY IRRELEVANT FEATURES [J].
ALMUALLIM, H ;
DIETTERICH, TG .
ARTIFICIAL INTELLIGENCE, 1994, 69 (1-2) :279-305
[2]  
[Anonymous], 1992, The Tenth National Conference on Artificial Intelligence
[3]  
[Anonymous], P 10 NAT C ART INT S
[4]  
Apte C., 1996, P ADV KNOWL DISC DAT, P541
[5]  
CATLETT J, 1991, P EUR WORK SESS LEAR, V482, P164
[6]  
CHING J, 1995, IEEE T PATTERN ANAL, V17, P631
[7]  
DOUGHERTY J, 1995, P 12 INT C MACH LEAR, P195
[8]  
Gass S., 1955, NAV RES LOG, V2, P39, DOI [DOI 10.1002/NAV.3800020106, 10.1002/nav.3800020106]
[9]   Integer programming methods for normalisation and variable selection in mathematical programming discriminant analysis models [J].
Glen, JJ .
JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 1999, 50 (10) :1043-1053
[10]   W-efficient partitions and the solution of the sequential clustering problem [J].
Joseph, A ;
Bryson, N .
ANNALS OF OPERATIONS RESEARCH, 1997, 74 (0) :305-319