Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure

被引:220
作者
Courvoisier, Delphine S. [1 ,2 ]
Combescure, Christophe [1 ,2 ]
Agoritsas, Thomas [1 ,2 ]
Gayet-Ageron, Angele [1 ,2 ]
Perneger, Thomas V. [1 ,2 ]
机构
[1] Univ Hosp Geneva, Div Clin Epidemiol, CH-1205 Geneva, Switzerland
[2] Univ Geneva, Fac Med, CH-1211 Geneva 4, Switzerland
关键词
Model adequacy; Model building; Type I error; Power; Event per variable; Logistic regression; ASSUMPTIONS; SIMULATION;
D O I
10.1016/j.jclinepi.2010.11.012
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objective: Logistic regression is commonly used in health research, and it is important to be sure that the parameter estimates can be trusted. A common problem occurs when the outcome has few events; in such a case, parameter estimates may be biased or unreliable. This study examined the relation between correctness of estimation and several data characteristics: number of events per variable (EPV), number of predictors, percentage of predictors that are highly correlated, percentage of predictors that were non-null, size of regression coefficients, and size of correlations. Study Design: Simulation studies. Results: In many situations, logistic regression modeling may pose substantial problems even if the number of EPV exceeds 10. Moreover, the number of EPV is not the only element that impacts on the correctness of parameter estimation. High regression coefficients and high correlations between the predictors may cause large problems in the estimation process. Finally, power is generally very low, even at 20 EPV. Conclusion: There is no single rule based on EPV that would guarantee an accurate estimation of logistic regression parameters. Instead, the number of predictors, probable size of the regression coefficients based on previous literature, and correlations among the predictors must be taken into account as guidelines to determine the necessary sample size. (C) 2011 Elsevier Inc. All rights reserved.
引用
收藏
页码:993 / 1000
页数:8
相关论文
共 15 条
[1]  
Altman DG, 2000, STAT MED, V19, P453, DOI 10.1002/(SICI)1097-0258(20000229)19:4<453::AID-SIM350>3.3.CO
[2]  
2-X
[3]   Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders [J].
Cepeda, MS ;
Boston, R ;
Farrar, JT ;
Strom, BL .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2003, 158 (03) :280-287
[4]  
HARRELL FE, 1985, CANCER TREAT REP, V69, P1071
[5]  
Harrell FE, 1996, STAT MED, V15, P361, DOI 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO
[6]  
2-4
[7]  
Hsieh FY, 1998, STAT MED, V17, P1623, DOI 10.1002/(SICI)1097-0258(19980730)17:14<1623::AID-SIM871>3.0.CO
[8]  
2-S
[9]   INTERPRETING MODEL COEFFICIENTS WHEN THE TRUE MODEL FORM IS UNKNOWN [J].
MALDONADO, G ;
GREENLAND, S .
EPIDEMIOLOGY, 1993, 4 (04) :310-318
[10]   Prognosis and prognostic research: what, why, and how? [J].
Moons, Karel G. M. ;
Royston, Patrick ;
Vergouwe, Yvonne ;
Grobbee, Diederick E. ;
Altman, Douglas G. .
BMJ-BRITISH MEDICAL JOURNAL, 2009, 338 :1317-1320