A Sparse PLS for Variable Selection when Integrating Omics Data

被引:381
作者
Le Cao, Kim-Anh [2 ]
Rossouw, Debra [1 ]
Robert-Granie, Christele
Besse, Philippe
机构
[1] Univ Stellenbosch, ZA-7600 Stellenbosch, South Africa
[2] Univ Toulouse, Toulouse, France
关键词
joint analysis; two-block data set; multivariate regression; dimension reduction;
D O I
10.2202/1544-6115.1390
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Recent biotechnology advances allow for multiple types of omics data, such as transcriptomic, proteomic or metabolomic data sets to be integrated. The problem of feature selection has been addressed several times in the context of classification, but needs to be handled in a specific manner when integrating data. In this study, we focus on the integration of two-block data that are measured on the same samples. Our goal is to combine integration and simultaneous variable selection of the two data sets in a one-step procedure using a Partial Least Squares regression (PLS) variant to facilitate the biologists' interpretation. A novel computational methodology called "sparse PLS" is introduced for a predictive analysis to deal with these newly arisen problems. The sparsity of our approach is achieved with a Lasso penalization of the PLS loading vectors when computing the Singular Value Decomposition. Sparse PLS is shown to be effective and biologically meaningful. Comparisons with classical PLS are performed on a simulated data set and on real data sets. On one data set, a thorough biological interpretation of the obtained results is provided. We show that sparse PLS provides a valuable variable selection tool for highly dimensional data sets.
引用
收藏
页数:32
相关论文
共 44 条
[1]  
[Anonymous], 1988, Journal of chemometrics
[2]  
[Anonymous], SPARSE PARTIAL LEAST
[3]  
[Anonymous], 1966, Multivariate Analysis
[4]  
BELY M, 1990, AM J ENOL VITICULT, V41, P319
[5]  
Boulesteix A.L., 2004, STAT APPL GENET MOL, V3, P33, DOI [10.2202/1544-6115.1075, DOI 10.2202/1544-6115.1075]
[6]   Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach [J].
Boulesteix, Anne-Laure ;
Strimmer, Korbinian .
THEORETICAL BIOLOGY AND MEDICAL MODELLING, 2005, 2
[7]  
Burnham AJ, 1996, J CHEMOMETR, V10, P31, DOI 10.1002/(SICI)1099-128X(199601)10:1<31::AID-CEM398>3.0.CO
[8]  
2-1
[9]  
Bushel PR, 2007, BMC SYST BIOL, V1, DOI 10.1186/1752-0509-1-15
[10]   Data integration in plant biology:: the O2PLS method for combined modeling of transcript and metabolite data [J].
Bylesjo, Max ;
Eriksson, Daniel ;
Kusano, Miyako ;
Moritz, Thomas ;
Trygg, Johan .
PLANT JOURNAL, 2007, 52 (06) :1181-1191