Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

被引:685
作者
Cao, Kim-Anh Le [1 ]
Boitard, Simon [2 ]
Besse, Philippe [3 ,4 ]
机构
[1] Univ Queensland, Queensland Facil Adv Bioinformat, St Lucia, Qld 4072, Australia
[2] INRA, Lab Genet Cellulaire UMR444, F-31326 Castanet Tolosan, France
[3] Univ Toulouse, Inst Math Toulouse, F-31062 Toulouse, France
[4] CNRS, UMR 5219, F-31062 Toulouse, France
关键词
PARTIAL LEAST-SQUARES; TUMOR CLASSIFICATION; GENE SELECTION; CANCER CLASSIFICATION; DIMENSION REDUCTION; R PACKAGE; PREDICTION; REGRESSION; DIAGNOSIS; TOOL;
D O I
10.1186/1471-2105-12-253
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Background: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits. Results: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework. Conclusions: sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.
引用
收藏
页数:16
相关论文
共 58 条
[1]
FEATURE SELECTION IN OMICS PREDICTION PROBLEMS USING CAT SCORES AND FALSE NONDISCOVERY RATE CONTROL [J].
Ahdesmaeki, Miika ;
Strimmer, Korbinian .
ANNALS OF APPLIED STATISTICS, 2010, 4 (01) :503-519
[2]
[Anonymous], 2008, STABILITY SELECTION
[3]
[Anonymous], 1999, The Nature Statist. Learn. Theory
[4]
[Anonymous], MODEL CONSISTENT SPA
[5]
[Anonymous], ENCY STAT SCI
[6]
[Anonymous], BIOINFORMATICS
[7]
[Anonymous], BIOINFORMATICS
[8]
[Anonymous], STAT APPL GENETICS M
[9]
[Anonymous], MIXOMICS
[10]
Effective dimension reduction methods for tumor classification using gene expression data [J].
Antoniadis, A ;
Lambert-Lacroix, S ;
Leblanc, F .
BIOINFORMATICS, 2003, 19 (05) :563-570