LOTUS: An algorithm for building accurate and comprehensible logistic regression trees

被引:63
作者
Chan, KY
Loh, WY
机构
[1] Natl Univ Singapore, Dept Stat & Appl Probabil, Singapore 117546, Singapore
[2] Univ Wisconsin, Dept Stat, Madison, WI 53706 USA
关键词
piecewise linear logistic regression; recursive partitioning; trend-adjusted chi-square test; unbiased variable selection;
D O I
10.1198/106186004X13064
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Logistic regression is a powerful technique for fitting models to data with a binary response variable, but the models are difficult to interpret if collinearity, nonlinearity, or interactions are present. Besides, it is hard to judge model adequacy because there are few diagnostics for choosing variable transformations and no true goodness-of-fit test. To overcome these problems, this article proposes to fit a piecewise (multiple or simple) linear logistic regression model by recursively partitioning the data and fitting a different logistic regression in each partition. This allows nonlinear features of the data to be modeled without requiring variable transformations. The binary tree that results from the partitioning process is pruned to minimize a cross-validation estimate of the predicted deviance. This obviates the need for a formal goodness-of-fit test. The resulting model is especially easy to interpret if a simple linear logistic regression is fitted to each partition, because the tree structure and the set of graphs of the fitted functions in the partitions comprise a complete visual description of the model. Trend-adjusted chi-square tests are used to control bias in variable selection at the intermediate nodes. This protects the integrity of inferences drawn from the tree structure. The method is compared with standard stepwise logistic regression on 30 real datasets, with several containing tens to hundreds of thousands of observations. Averaged across the datasets, the results show that the method reduces predicted mean deviance by 9% to 16%. We use an example from the Dutch insurance industry to demonstrate how the method can identify and produce an intelligible profile of prospective customers.
引用
收藏
页码:826 / 852
页数:27
相关论文
共 27 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]  
Allison P.D., 1999, LOGISTIC REGRESSION
[3]  
[Anonymous], 1998, APPL CATEGORICAL DAT
[4]  
[Anonymous], 1992, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (AI'92), DOI DOI 10.1142/9789814536271
[5]  
[Anonymous], 2000, 200009 LEID I ADV CO
[6]  
[Anonymous], 2000, COIL CHALLENGE 2000
[7]   TESTS FOR LINEAR TRENDS IN PROPORTIONS AND FREQUENCIES [J].
ARMITAGE, P .
BIOMETRICS, 1955, 11 (03) :375-386
[8]  
Blake C., 2000, UCI REPOSITORY MACHI
[9]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[10]  
CHAUDHURI P, 1994, STAT SINICA, V4, P143