PUBLIC: A decision tree classifier that integrates building and pruning

被引:160
作者
Rastogi, R
Shim, K
机构
[1] Bell Labs, Murray Hill, NJ 07974 USA
[2] Korea Adv Inst Sci & Technol, Taejon 305701, South Korea
[3] Adv Informat Technol Res Ctr, Taejon 305701, South Korea
关键词
data mining; classification; decision tree;
D O I
10.1023/A:1009887311454
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
Classification is an important problem in data mining. Given a database of records, each with a class label, a classifier generates a concise and meaningful description for each class that can be used to classify subsequent records. A number of popular classifiers construct decision trees to generate class models. These classifiers first build a decision tree and then prune subtrees from the decision tree in a subsequent pruning phase to improve accuracy and prevent "overfitting". Generating the decision tree in two distinct phases could result in a substantial amount of wasted effort since an entire subtree constructed in the first phase may later be pruned in the next phase. In this paper, we propose PUBLIC, an improved decision tree classifier that integrates the second "pruning" phase with the initial "building" phase. In PUBLIC, a node is not expanded during the building phase, if it is determined that it will be pruned during the subsequent pruning phase. In order to make this determination for a node, before it is expanded, PUBLIC computes a lower bound on the minimum cost subtree rooted at the node. This estimate is then used by PUBLIC to identify the nodes that are certain to be pruned, and for such nodes, not expend effort on splitting them. Experimental results with real-life as well as synthetic data sets demonstrate the effectiveness of PUBLIC's integrated approach which has the ability to deliver substantial performance improvements.
引用
收藏
页码:315 / 344
页数:30
相关论文
共 25 条
[1]
AGRAWAL R, 1992, PROC INT CONF VERY L, P560
[2]
DATABASE MINING - A PERFORMANCE PERSPECTIVE [J].
AGRAWAL, R ;
IMIELINSKI, T ;
SWAMI, A .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1993, 5 (06) :914-925
[3]
[Anonymous], 1989, GENETIC ALGORITHM SE
[4]
[Anonymous], 1993, C4 5 PROGRAMS MACH L
[5]
[Anonymous], 1993, P 13 INT JOINT C ART
[6]
Bishop C. M., 1995, NEURAL NETWORKS PATT
[7]
Breiman L., 1984, BIOMETRICS, DOI DOI 10.2307/2530946
[8]
CHEESEMAN P, 1988, 5 INT C MACH LEARN J
[9]
FAYYAD U, 1991, THESIS U MICHIGAN
[10]
FUKUDA T, 1996, P INT C VER LARG DAT