Error estimates for the analysis of differential expression from RNA-seq count data

被引:35
作者
Burden, Conrad J. [1 ]
Qureshi, Sumaira E. [1 ]
Wilson, Susan R. [1 ,2 ]
机构
[1] Australian Natl Univ, Inst Math Sci, Canberra, ACT, Australia
[2] Univ New S Wales, Sch Math & Stat, Sydney, NSW, Australia
基金
英国医学研究理事会; 澳大利亚研究理事会;
关键词
RNA-seq; Differential expression analysis; False discovery rates; FALSE DISCOVERY RATE; NULL;
D O I
10.7717/peerj.576
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
070301 [无机化学]; 070403 [天体物理学]; 070507 [自然资源与国土空间规划学]; 090105 [作物生产系统与生态工程];
摘要
Background. A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. Results. We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. Conclusions. We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values, albeit with a very slow run time, is the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq2. When the number of biological replicates is sufficiently high, and within a range accessible to multiplexed experimental designs, the Polyfit extension improves the performance DESeq (for approximately 6 or more replicates per condition), making its performance comparable with that of edgeR and DESeq2 in our tests with synthetic data.
引用
收藏
页数:26
相关论文
共 29 条
[1]
Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[2]
CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[3]
Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays [J].
Bottomly, Daniel ;
Walter, Nicole A. R. ;
Hunter, Jessica Ezzell ;
Darakjian, Priscila ;
Kawane, Sunita ;
Buck, Kari J. ;
Searles, Robert P. ;
Mooney, Michael ;
McWeeney, Shannon K. ;
Hitzemann, Robert .
PLOS ONE, 2011, 6 (03)
[4]
A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis [J].
Dillies, Marie-Agnes ;
Rau, Andrea ;
Aubert, Julie ;
Hennequet-Antier, Christelle ;
Jeanmougin, Marine ;
Servant, Nicolas ;
Keime, Celine ;
Marot, Guillemette ;
Castel, David ;
Estelle, Jordi ;
Guernec, Gregory ;
Jagla, Bernd ;
Jouneau, Luc ;
Laloe, Denis ;
Le Gall, Caroline ;
Schaeffer, Brigitte ;
Le Crom, Stephane ;
Guedj, Mickael ;
Jaffrezic, Florence .
BRIEFINGS IN BIOINFORMATICS, 2013, 14 (06) :671-683
[5]
Dunne A., 1996, J ROY STAT SOC D-STA, V45, P397
[6]
Large-scale simultaneous hypothesis testing: The choice of a null hypothesis [J].
Efron, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (465) :96-104
[7]
A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments [J].
Esnaola, Mikel ;
Puig, Pedro ;
Gonzalez, David ;
Castelo, Robert ;
Gonzalez, Juan R. .
BMC BIOINFORMATICS, 2013, 14
[8]
Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons [J].
Jin, Jiashun ;
Cai, T. Tony .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2007, 102 (478) :495-506
[9]
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J].
Langmead, Ben ;
Trapnell, Cole ;
Pop, Mihai ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2009, 10 (03)
[10]
Normalization, testing, and false discovery rate estimation for RNA-sequencing data [J].
Li, Jun ;
Witten, Daniela M. ;
Johnstone, Iain M. ;
Tibshirani, Robert .
BIOSTATISTICS, 2012, 13 (03) :523-538