FEATURE SELECTION IN OMICS PREDICTION PROBLEMS USING CAT SCORES AND FALSE NONDISCOVERY RATE CONTROL

被引:92
作者
Ahdesmaeki, Miika [1 ,2 ]
Strimmer, Korbinian [1 ]
机构
[1] Univ Leipzig, IMISE, D-04107 Leipzig, Germany
[2] Tampere Univ Technol, Dept Signal Proc, FI-33101 Tampere, Finland
关键词
Feature selection; linear discriminant analysis; correlation; James-Stein estimator; small n; large p" setting; correlation-adjusted t-score; false discovery rates; higher criticism; LINEAR DISCRIMINANT-ANALYSIS; SHRUNKEN CENTROIDS; CLASSIFICATION; REGRESSION; DISCOVERY; RANKING; BAYES;
D O I
10.1214/09-AOAS277
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James-Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package "sda" available from the R repository CRAN.
引用
收藏
页码:503 / 519
页数:17
相关论文
共 30 条
[1]   A general modular framework for gene set enrichment analysis [J].
Ackermann, Marit ;
Strimmer, Korbinian .
BMC BIOINFORMATICS, 2009, 10
[2]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[3]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[4]   Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations [J].
Bickel, PJ ;
Levina, E .
BERNOULLI, 2004, 10 (06) :989-1010
[5]   Optimality Driven Nearest Centroid Classification from Genomic Data [J].
Dabney, Alan R. ;
Storey, John D. .
PLOS ONE, 2007, 2 (10)
[6]   Higher criticism thresholding: Optimal feature selection when useful features are rare and weak [J].
Donoho, David ;
Jin, Jiashun .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (39) :14790-14795
[7]   Large-scale simultaneous hypothesis testing: The choice of a null hypothesis [J].
Efron, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) :96-104
[8]   EFFICIENCY OF LOGISTIC REGRESSION COMPARED TO NORMAL DISCRIMINANT-ANALYSIS [J].
EFRON, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1975, 70 (352) :892-898
[9]  
EFRON B, 2008, EMPIRICAL BAYES ESTI
[10]   Microarrays, empirical Bayes and the two-groups model [J].
Efron, Bradley .
STATISTICAL SCIENCE, 2008, 23 (01) :1-22