clusterExperiment and RSEC: A Bioconductor package and framework for clustering of single-cell and other large gene expression datasets

被引:34
作者
Risso, Davide [1 ]
Purvis, Liam [2 ]
Fletcher, Russell B. [3 ]
Das, Diya [3 ,4 ]
Ngai, John [3 ]
Dudoit, Sandrine [2 ,4 ,5 ]
Purdom, Elizabeth [2 ]
机构
[1] Weill Cornell Med, Div Biostat & Epidemiol, New York, NY USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Dept Mol & Cell Biol, 229 Stanley Hall, Berkeley, CA 94720 USA
[4] UC Univ Calif Berkeley, Berkeley Inst Data Sci, Berkeley, CA USA
[5] Univ Calif Berkeley, Div Epidemiol & Biostat, Berkeley, CA 94720 USA
基金
美国国家卫生研究院;
关键词
FALSE DISCOVERY RATE; RNA-SEQ; NULL;
D O I
10.1371/journal.pcbi.1006378
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Clustering of genes and/or samples is a common task in gene expression analysis. The goals in clustering can vary, but an important scenario is that of finding biologically meaningful subtypes within the samples. This is an application that is particularly appropriate when there are large numbers of samples, as in many human disease studies. With the increasing popularity of single-cell transcriptome sequencing (RNA-Seq), many more controlled experiments on model organisms are similarly creating large gene expression datasets with the goal of detecting previously unknown heterogeneity within cells. It is common in the detection of novel subtypes to run many clustering algorithms, as well as rely on subsampling and ensemble methods to improve robustness. We introduce a Bioconductor R package, clusterExperiment, that implements a general and flexible strategy we entitle Resampling-based Sequential Ensemble Clustering (RSEC). RSEC enables the user to easily create multiple, competing clusterings of the data based on different techniques and associated tuning parameters, including easy integration of resampling and sequential clustering, and then provides methods for consolidating the multiple clusterings into a final consensus clustering. The package is modular and allows the user to separately apply the individual components of the RSEC procedure, i.e., apply multiple clustering algorithms, create a consensus clustering or choose tuning parameters, and merge clusters. Additionally, clusterExperiment provides a variety of visualization tools for the clustering process, as well as methods for the identification of possible cluster signatures or biomarkers. The R package clusterExperiment is publicly available through the Bioconductor Project, with a detailed manual (vignette) as well as well documented help pages for each function.
引用
收藏
页数:16
相关论文
共 41 条
[1]  
Abul O, 2003, 44 HAW INT C SYST SC
[2]  
[Anonymous], 2016, NUCL ACIDS RES, DOI DOI 10.1093/NAR/GKW430
[3]  
Ben-Hur Asa, 2003, Methods Mol Biol, V224, P159
[4]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[5]   Deficiency in prohormone convertase PC1 impairs prohormone processing in Prader-Willi syndrome [J].
Burnett, Lisa C. ;
LeDuc, Charles A. ;
Sulsona, Carlos R. ;
Paull, Daniel ;
Rausch, Richard ;
Eddiry, Sanaa ;
Carli, Jayne F. Martin ;
Morabito, Michael V. ;
Skowronski, Alicja A. ;
Hubner, Gabriela ;
Zimmer, Matthew ;
Wang, Liheng ;
Day, Robert ;
Levy, Brynn ;
Fennoy, Ilene ;
Dubern, Beatrice ;
Poitou, Christine ;
Clement, Karine ;
Butler, Merlin G. ;
Rosenbaum, Michael ;
Salles, Jean Pierre ;
Tauber, Maithe ;
Driscoll, Daniel J. ;
Egli, Dieter ;
Leibel, Rudolph L. .
JOURNAL OF CLINICAL INVESTIGATION, 2017, 127 (01) :293-305
[6]   Integrating single-cell transcriptomic data across different conditions, technologies, and species [J].
Butler, Andrew ;
Hoffman, Paul ;
Smibert, Peter ;
Papalexi, Efthymia ;
Satija, Rahul .
NATURE BIOTECHNOLOGY, 2018, 36 (05) :411-+
[7]   Bagging to improve the accuracy of a clustering procedure [J].
Dudoit, S ;
Fridlyand, J .
BIOINFORMATICS, 2003, 19 (09) :1090-1099
[8]   Large-scale simultaneous hypothesis testing: The choice of a null hypothesis [J].
Efron, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) :96-104
[9]   MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data [J].
Finak, Greg ;
McDavid, Andrew ;
Yajima, Masanao ;
Deng, Jingyuan ;
Gersuk, Vivian ;
Shalek, Alex K. ;
Slichter, Chloe K. ;
Miller, Hannah W. ;
McElrath, M. Juliana ;
Prlic, Martin ;
Linsley, Peter S. ;
Gottardo, Raphael .
GENOME BIOLOGY, 2015, 16
[10]   Deconstructing Olfactory Stem Cell Trajectories at Single-Cell Resolution [J].
Fletcher, Russell B. ;
Das, Diya ;
Gadye, Levi ;
Street, Kelly N. ;
Baudhuin, Ariane ;
Wagner, Allon ;
Cole, Michael B. ;
Flores, Quetzal ;
Choi, Yoon Gi ;
Yosef, Nir ;
Purdom, Elizabeth ;
Dudoit, Sandrine ;
Risso, Davide ;
Ngai, John .
CELL STEM CELL, 2017, 20 (06) :817-+