Statistical significance of variables driving systematic variation in high-dimensional data

被引:149
作者
Chung, Neo Christopher [1 ]
Storey, John D. [1 ,2 ]
机构
[1] Princeton Univ, Lewis Sigler Inst Integrat Genom, Princeton, NJ 08544 USA
[2] Princeton Univ, Dept Mol Biol, Princeton, NJ 08544 USA
关键词
PRINCIPAL-COMPONENTS-ANALYSIS; CELL-CYCLE; GENE-EXPRESSION; IDENTIFICATION; DECOMPOSITION; ASSOCIATION; MICROARRAY; CONFIDENCE; PATTERNS;
D O I
10.1093/bioinformatics/btu674
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.
引用
收藏
页码:545 / 554
页数:10
相关论文
共 45 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   Singular value decomposition for genome-wide expression data processing and modeling [J].
Alter, O ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10101-10106
[3]   ASYMPTOTIC THEORY FOR PRINCIPAL COMPONENT ANALYSIS [J].
ANDERSON, TW .
ANNALS OF MATHEMATICAL STATISTICS, 1963, 34 (01) :122-&
[4]  
[Anonymous], 2004, Proceedings of the Twenty-First International Conference on Machine Learning, DOI [10.1145/1015330.1015408, DOI 10.1145/1015330.1015408]
[5]  
[Anonymous], 2002, Series: Springer Series in Statistics
[6]   REMARKS ON PARALLEL ANALYSIS [J].
BUJA, A ;
EYUBOGLU, N .
MULTIVARIATE BEHAVIORAL RESEARCH, 1992, 27 (04) :509-540
[7]   A genome-wide transcriptional analysis of the mitotic cell cycle [J].
Cho, RJ ;
Campbell, MJ ;
Winzeler, EA ;
Steinmetz, L ;
Conway, A ;
Wodicka, L ;
Wolfsberg, TG ;
Gabrielian, AE ;
Landsman, D ;
Lockhart, DJ ;
Davis, RW .
MOLECULAR CELL, 1998, 2 (01) :65-73
[8]   Application of genome-wide expression analysis to human health and disease [J].
Cobb, JP ;
Mindrinos, MN ;
Miller-Graziano, C ;
Calvano, SE ;
Baker, HV ;
Xiao, WZ ;
Laudanski, K ;
Brownstein, BH ;
Elson, CM ;
Hayden, DL ;
Herndon, DN ;
Lowry, SF ;
Maier, RV ;
Schoenfeld, DA ;
Moldawer, LL ;
Davis, RW ;
Tompkins, RG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (13) :4801-4806
[9]  
DeRisi J, 1996, NAT GENET, V14, P457
[10]   Dissecting Inflammatory Complications in Critically Injured Patients by Within-Patient Gene Expression Changes: A Longitudinal Clinical Genomics Study [J].
Desai, Keyur H. ;
Tan, Chuen Seng ;
Leek, Jeffrey T. ;
Maier, Ronald V. ;
Tompkins, Ronald G. ;
Storey, John D. .
PLOS MEDICINE, 2011, 8 (09)