Cautionary remarks on the use of clusterwise regression

被引:39
作者
Brusco, Michael J. [1 ]
Cradit, J. Dennis [2 ]
Steinley, Douglas [3 ]
Fox, Gavin L.
机构
[1] Florida State Univ, Coll Business, Dept Marketing, Tallahassee, FL 32306 USA
[2] So Illinois Univ, Carbondale, IL 62901 USA
[3] Univ Missouri, Columbia, MO 65211 USA
关键词
D O I
10.1080/00273170701836653
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Clusterwise linear regression is a multivariate statistical procedure that attempts to cluster objects with the objective of minimizing the sum of the error sums of squares for the within-cluster regression models. In this article, we show that the minimization of this criterion makes no effort to distinguish the error explained by the within-cluster regression models from the error explained by the clustering process. In some cases, most of the variation in the response variable is explained by clustering the objects, with little additional benefit provided by the within-cluster regression models. Accordingly, there is tremendous potential for overfitting with clusterwise regression, which is demonstrated with numerical examples and simulation experiments. To guard against the misuse of clusterwise regression, we recommend a benchmarking procedure that compares the results for the observed empirical data with those obtained across a set of random permutations of the response measures. We also demonstrate the potential for overfitting via an empirical application related to the prediction of reflective judgment using high school and college performance measures.
引用
收藏
页码:29 / 49
页数:21
相关论文
共 47 条
[1]  
[Anonymous], 2002, US MATLAB VERS 6
[2]   ON THE DANGERS OF AVERAGING ACROSS SUBJECTS WHEN USING MULTIDIMENSIONAL-SCALING OR THE SIMILARITY-CHOICE MODEL [J].
ASHBY, FG ;
MADDOX, WT ;
LEE, WW .
PSYCHOLOGICAL SCIENCE, 1994, 5 (03) :144-151
[3]  
Aurifeille J.-M., 2000, European Journal of Economic and Social Systems, V14, P93, DOI 10.1051/ejess:2000111
[4]  
Banfield C. F., 1977, Applied Statistics, V26, P206, DOI 10.2307/2347039
[5]   REPLICATING CLUSTER-ANALYSIS - METHOD, CONSISTENCY, AND VALIDITY [J].
BRECKENRIDGE, JN .
MULTIVARIATE BEHAVIORAL RESEARCH, 1989, 24 (02) :147-161
[6]   ConPar: a method for identifying groups of concordant subject proximity matrices for subsequent multidimensional scaling analyses [J].
Brusco, MJ ;
Cradit, JD .
JOURNAL OF MATHEMATICAL PSYCHOLOGY, 2005, 49 (02) :142-154
[7]   A simulated annealing heuristic for a bicriterion partitioning problem in market segmentation [J].
Brusco, MJ ;
Cradit, JD ;
Stahl, S .
JOURNAL OF MARKETING RESEARCH, 2002, 39 (01) :99-109
[8]   Multicriterion clusterwise regression for joint segmentation settings: An application to customer value [J].
Brusco, MJ ;
Cradit, JD ;
Tashchian, A .
JOURNAL OF MARKETING RESEARCH, 2003, 40 (02) :225-234
[9]   Patterns of change in depressive symptoms during smoking cessation: Who's at risk for relapse? [J].
Burgess, ES ;
Brown, RA ;
Kahler, CW ;
Niaura, R ;
Abrams, DB ;
Goldstein, MG ;
Miller, IW .
JOURNAL OF CONSULTING AND CLINICAL PSYCHOLOGY, 2002, 70 (02) :356-361
[10]   ANALYSIS OF INDIVIDUAL DIFFERENCES IN MULTIDIMENSIONAL SCALING VIA AN N-WAY GENERALIZATION OF ECKART-YOUNG DECOMPOSITION [J].
CARROLL, JD ;
CHANG, JJ .
PSYCHOMETRIKA, 1970, 35 (03) :283-&