Stability and scalability in decision trees

被引:17
作者
Aluja-Banet, T
Nafria, E
机构
[1] Univ Politecn Catalunya, Dept Stat & Operat Res, ES-08034 Barcelona, Spain
[2] Taylor Nelson Sofres Audiencia Medios, Barcelona 08190, Spain
关键词
segmentation tree; CART; data mining; stability; scalability;
D O I
10.1007/BF03354613
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 [统计学]; 070103 [概率论与数理统计]; 0714 [统计学];
摘要
Tree-based methods are statistical procedures for automatic learning from data, whose main applications are integrated into a data-mining environment for decision support systems. Here, we focus on two problems of decision trees: the stability of the rules obtained and their applicability to huge data sets. Since the tree-growing process is highly dependent on data, i.e. small fluctuations in data can cause big changes in the tree-growing process, we focused instead on the stability of the trees themselves. To this end we propose a series of data diagnostics to prevent internal instability in the tree-growing process before a particular split is made. Indeed, to be effective in actual managerial problems they must be applicable to massive amounts of stored data with maximum efficiency. For this reason we studied the theoretical complexity of such an algorithm. Finally, we present an algorithm that can cope with such problems, with linear cost upon the individuals, which can use a robust impurity measure as a splitting criterion.
引用
收藏
页码:505 / 520
页数:16
相关论文
共 22 条
[1]
ALUJA T, 1996, P COMP STAT COMPSTAT
[2]
ALUJA T, 1998, VISUALISING CATEGORI
[3]
ALUJA T, 1998, DATA SCI CLASSIFICAT
[4]
[Anonymous], 1964, DETECTION INTERACTIO
[5]
Arminger G, 1997, COMPUTATION STAT, V12, P293
[6]
Symbolic object description of strata by segmentation trees [J].
Bravo, MC ;
García-Santesmases, JM .
COMPUTATIONAL STATISTICS, 2000, 15 (01) :13-24
[7]
Technical note: Some properties of splitting criteria [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (01) :41-47
[8]
Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[9]
Breiman L., 1984, BIOMETRICS, DOI DOI 10.2307/2530946
[10]
CELEUX G, 1982, REV STAT APPL, V30, P39