Torturing data for the sake of generality: How valid are our regression models?

被引:119
作者
Olden, JD [1 ]
Jackson, DA [1 ]
机构
[1] Univ Toronto, Dept Zool, Toronto, ON M5S 3G5, Canada
来源
ECOSCIENCE | 2000年 / 7卷 / 04期
关键词
multiple regression analysis; variable selection. explanatory; predictive power; bias; bootstrap; jackknife; cross validation; Monte Carlo simulations;
D O I
10.1080/11956860.2000.11682622
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Multiple regression analysis continues to be a quantitative tool used extensively in the ecological literature, Consequently, methods for model selection and validation are important considerations, yet ecologists appear to pay little attention to how the choice of method can potentially influence the outcome and interpretation of their results. In this study we review commonly employed model selection and validation methods and use a Monte Carlo simulation approach to evaluate their ability to accurately estimate variable inclusion in the final regression model and model prediction error. We found that all methods of model selection erroneously excluded or included variables in the final model and the error rate depended on sample size and the number of predictor variables. In general, forward selection, backward elimination and stepwise selection showed better performance with small sample sizes, whereas a modified bootstrap approach outperformed other methods with larger sample sizes. Model selection using all-subsets or exhaustive search was highly biased, at times never selecting the correct predictor variables. Methods for model validation were also highly biased, with resubstitution and data-splitting (i.e,, dividing the data into training and test samples) techniques producing biased and variable estimates of model prediction error. In contrast, jackknife validation was generally unbiased. Using an empirical example we show that the interpretation of the ecological relationships between fish species richness and lake habitat is highly dependent on the type of model selection and validation method employed. The fact that model selection is frequently unsuited to determine correct ecological relationships, and that traditional approaches for model validation over-estimate the strength and value of our empirical models, is a major concern.
引用
收藏
页码:501 / 510
页数:10
相关论文
共 43 条
[1]   RELATIONSHIP BETWEEN VARIABLE SELECTION AND DATA AUGMENTATION AND A METHOD FOR PREDICTION [J].
ALLEN, DM .
TECHNOMETRICS, 1974, 16 (01) :125-127
[2]  
[Anonymous], J STAT COMPUT SIMUL
[3]   COMPARING SUBSET REGRESSION PROCEDURES [J].
BERK, KN .
TECHNOMETRICS, 1978, 20 (01) :1-6
[4]  
BERK KN, 1978, P STAT COMPUT SECT A, P309
[6]   MODEL UNCERTAINTY, DATA MINING AND STATISTICAL-INFERENCE [J].
CHATFIELD, C .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 1995, 158 :419-466
[7]  
Chessman BC, 1999, FRESHWATER BIOL, V41, P747, DOI 10.1046/j.1365-2427.1999.00419.x
[8]   ECOLOGICAL USES FOR GENETIC ALGORITHMS - PREDICTING FISH DISTRIBUTIONS IN COMPLEX PHYSICAL HABITATS [J].
DANGELO, DJ ;
HOWARD, LM ;
MEYER, JL ;
GREGORY, SV ;
ASHKENAS, LR .
CANADIAN JOURNAL OF FISHERIES AND AQUATIC SCIENCES, 1995, 52 (09) :1893-1908
[9]  
Dunham JB, 1999, ECOL APPL, V9, P642, DOI 10.1890/1051-0761(1999)009[0642:MSOBTI]2.0.CO
[10]  
2