Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

被引:287
作者
Bondell, Howard D. [1 ]
Reich, Brian J. [1 ]
机构
[1] N Carolina State Univ, Dept Stat, Raleigh, NC 27695 USA
关键词
correlation; penalization; predictive group; regression; shrinkage; supervised clustering; variable selection;
D O I
10.1111/j.1541-0420.2007.00843.x
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.
引用
收藏
页码:115 / 123
页数:9
相关论文
共 14 条
[1]   Finding predictive gene groups from microarray data [J].
Dettling, M ;
Bühlmann, P .
JOURNAL OF MULTIVARIATE ANALYSIS, 2004, 90 (01) :106-131
[2]   Least angle regression - Rejoinder [J].
Efron, B ;
Hastie, T ;
Johnstone, I ;
Tibshirani, R .
ANNALS OF STATISTICS, 2004, 32 (02) :494-499
[3]  
GILL PE, 2005, 051 NA U CAL DEP MAT
[4]  
HASTIE T, 2002, GENOME BIOL, V2
[5]   Simultaneous gene clustering and subset selection for sample classification via MDL [J].
Jörnsten, R ;
Yu, B .
BIOINFORMATICS, 2003, 19 (09) :1100-1109
[6]   A MULTIVARIATE EXPONENTIAL DISTRIBUTION [J].
MARSHALL, AW ;
OLKIN, I .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1967, 62 (317) :30-&
[7]   Averaged gene expressions for regression [J].
Park, Mee Young ;
Hastie, Trevor ;
Tibshirani, Robert .
BIOSTATISTICS, 2007, 8 (02) :212-227
[8]  
ROSSET S, 2007, IN PRESS ANN STAT, V35
[9]   Sparsity and smoothness via the fused lasso [J].
Tibshirani, R ;
Saunders, M ;
Rosset, S ;
Zhu, J ;
Knight, K .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2005, 67 :91-108