Power analysis and sample size estimation for RNA-Seq differential expression

被引:161
作者
Ching, Travers [1 ,2 ]
Huang, Sijia [1 ,2 ]
Garmire, Lana X. [1 ,2 ]
机构
[1] Univ Hawaii, Ctr Canc, Honolulu, HI 96813 USA
[2] Univ Hawaii Manoa, Grad Program Mol Biosci & Bioengn, Honolulu, HI 96822 USA
关键词
RNA-Seq; sample size; power analysis; simulation; bioinformatics; GENE-EXPRESSION; STATISTICAL-METHODS; MODEL; NORMALIZATION; STRATEGIES; DISPERSION; SEQUENCE; PACKAGE; TOOL;
D O I
10.1261/rna.046011.114
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
It is crucial for researchers to optimize RNA-seq experimental designs for differential expression detection. Currently, the field lacks general methods to estimate power and sample size for RNA-Seq in complex experimental designs, under the assumption of the negative binomial distribution. We simulate RNA-Seq count data based on parameters estimated from six widely different public data sets (including cell line comparison, tissue comparison, and cancer data sets) and calculate the statistical power in paired and unpaired sample experiments. We comprehensively compare five differential expression analysis packages (DESeq, edgeR, DESeq2, sSeq, and EBSeq) and evaluate their performance by power, receiver operator characteristic (ROC) curves, and other metrics including areas under the curve (AUC), Matthews correlation coefficient (MCC), and F-measures. DESeq2 and edgeR tend to give the best performance in general. Increasing sample size or sequencing depth increases power; however, increasing sample size is more potent than sequencing depth to increase power, especially when the sequencing depth reaches 20 million reads. Long intergenic noncoding RNAs (lincRNA) yields lower power relative to the protein coding mRNAs, given their lower expression level in the same RNA-Seq experiment. On the other hand, paired-sample RNA-Seq significantly enhances the statistical power, confirming the importance of considering the multifactor experimental design. Finally, a local optimal power is achievable for a given budget constraint, and the dominant contributing factor is sample size rather than the sequencing depth. In conclusion, we provide a power analysis tool (http://www2.hawaii.edu/similar to lgarmire/RNASeqPowerCalculator.htm) that captures the dispersion in the data and can serve as a practical reference under the budget constraint of RNA-Seq experiments.
引用
收藏
页码:1684 / 1696
页数:13
相关论文
共 45 条
[1]   Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data [J].
Aban, Inmaculada B. ;
Cutter, Gary R. ;
Mavinga, Nsoki .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2009, 53 (03) :820-833
[2]   Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[3]   Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays [J].
Bottomly, Daniel ;
Walter, Nicole A. R. ;
Hunter, Jessica Ezzell ;
Darakjian, Priscila ;
Kawane, Sunita ;
Buck, Kari J. ;
Searles, Robert P. ;
Mooney, Michael ;
McWeeney, Shannon K. ;
Hitzemann, Robert .
PLOS ONE, 2011, 6 (03)
[4]   Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments [J].
Bullard, James H. ;
Purdom, Elizabeth ;
Hansen, Kasper D. ;
Dudoit, Sandrine .
BMC BIOINFORMATICS, 2010, 11
[5]   Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression [J].
Busby, Michele A. ;
Stewart, Chip ;
Miller, Chase A. ;
Grzeda, Krzysztof R. ;
Marth, Gabor T. .
BIOINFORMATICS, 2013, 29 (05) :656-657
[6]   Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses [J].
Cabili, Moran N. ;
Trapnell, Cole ;
Goff, Loyal ;
Koziol, Magdalena ;
Tazon-Vega, Barbara ;
Regev, Aviv ;
Rinn, John L. .
GENES & DEVELOPMENT, 2011, 25 (18) :1915-1927
[7]   Statistical methods on detecting differentially expressed genes for RNA-seq data [J].
Chen, Zhongxue ;
Liu, Jianzhong ;
Ng, Hon Keung Tony ;
Nadarajah, Saralees ;
Kaufman, Howard L. ;
Yang, Jack Y. ;
Deng, Youping .
BMC SYSTEMS BIOLOGY, 2011, 5
[8]   ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets [J].
Frazee, Alyssa C. ;
Langmead, Ben ;
Leek, Jeffrey T. .
BMC BIOINFORMATICS, 2011, 12
[9]   A Global Clustering Algorithm to Identify Long Intergenic Non-Coding RNA - with Applications in Mouse Macrophages [J].
Garmire, Lana X. ;
Garmire, David G. ;
Huang, Wendy ;
Yao, Joyee ;
Glass, Christopher K. ;
Subramaniam, Shankar .
PLOS ONE, 2011, 6 (09)
[10]   Calculating Sample Size Estimates for RNA Sequencing Data [J].
Hart, Steven N. ;
Therneau, Terry M. ;
Zhang, Yuji ;
Poland, Gregory A. ;
Kocher, Jean-Pierre .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2013, 20 (12) :970-978