On measuring and correcting the effects of data mining and model selection

被引:326
作者
Ye, JM [1 ]
机构
[1] Univ Chicago, Grad Sch Business, Chicago, IL 60637 USA
关键词
data mining; degrees of freedom; effect of model selection; goodness-of-fit; half-normal plot; nonparametric regression; sensitivity;
D O I
10.2307/2669609
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In the theory of linear models, the concept of degrees of freedom plays an important role. This concept is often used for measurement of model complexity, for obtaining an unbiased estimate of the error variance, and for comparison of different models. I have developed a concept of generalized degrees of freedom (GDF) that is applicable to complex modeling procedures. The definition is based on the sum of the sensitivity of each fitted value to perturbation in the corresponding observed value. The concept is nonasymptotic in nature and does not require analytic knowledge of the modeling procedures. The concept of GDF offers a unified framework under which complex and highly irregular modeling procedures can be analyzed in the same way as classical linear models. By using this framework, many difficult problems can be solved easily. For example, one can now measure the number of observations used in a variable selection process. Different modeling procedures, such as a tree-based regression and a projection pursuit regression, can be compared on the basis of their residual sums of squares and the GDF that they cost. I apply the proposed framework to measure the effect of variable selection in linear models, leading to corrections of selection bias in various goodness-of-fit statistics. The theory also has interesting implications for the effect of general model searching by a human modeler.
引用
收藏
页码:120 / 131
页数:12
相关论文
共 28 条
[1]  
Akaike H, 1973, INT S INF THEOR
[2]  
[Anonymous], [No title captured]
[3]  
[Anonymous], 1994, Modern applied statistics with S-Plus
[4]  
[Anonymous], 1990, SUBSET SELECTION REG, DOI DOI 10.1007/978-1-4899-2939-6
[5]   VARIABLE SELECTION IN NONPARAMETRIC REGRESSION WITH CATEGORICAL COVARIATES [J].
BICKEL, P ;
PING, Z .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (417) :90-97
[7]  
Breiman L., 1984, Classification and Regression Trees, DOI DOI 10.2307/2530946
[8]  
CHIPMAN H, 1996, BAYESIAN CART TECHNI
[9]  
Clark LA, 1992, Statistical Models in S
[10]   SMOOTHING NOISY DATA WITH SPLINE FUNCTIONS [J].
WAHBA, G .
NUMERISCHE MATHEMATIK, 1975, 24 (05) :383-393