Normalization, testing, and false discovery rate estimation for RNA-sequencing data

被引:246
作者
Li, Jun [1 ]
Witten, Daniela M. [2 ]
Johnstone, Iain M. [1 ]
Tibshirani, Robert [3 ]
机构
[1] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[2] Univ Washington, Dept Biostat, Seattle, WA 98195 USA
[3] Stanford Univ, Dept Hlth Res & Policy & Stat, Stanford, CA 94305 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
Differential expression; FDR; Overdispersion; Poisson log-linear model; RNA-Seq; Score statistic; DIFFERENTIAL EXPRESSION ANALYSIS; GENE-EXPRESSION; STATISTICAL-METHODS; SEQ; SAGE; PACKAGE; MODEL;
D O I
10.1093/biostatistics/kxr031
中图分类号
Q [生物科学];
学科分类号
090105 [作物生产系统与生态工程];
摘要
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
引用
收藏
页码:523 / 538
页数:16
相关论文
共 39 条
[1]
Agresti A., 2002, CATEGORICAL DATA ANA, DOI [10.1002/0471249688, DOI 10.1002/0471249688]
[2]
Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[3]
Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates [J].
Baggerly, KA ;
Deng, L ;
Morris, JS ;
Aldaz, CM .
BMC BIOINFORMATICS, 2004, 5 (1)
[4]
CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[5]
Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays [J].
Bloom, Joshua S. ;
Khan, Zia ;
Kruglyak, Leonid ;
Singh, Mona ;
Caudy, Amy A. .
BMC GENOMICS, 2009, 10
[6]
Exploring the new world of the genome with DNA microarrays [J].
Brown, PO ;
Botstein, D .
NATURE GENETICS, 1999, 21 (Suppl 1) :33-37
[7]
Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments [J].
Bullard, James H. ;
Purdom, Elizabeth ;
Hansen, Kasper D. ;
Dudoit, Sandrine .
BMC BIOINFORMATICS, 2010, 11
[8]
Exploring the metabolic and genetic control of gene expression on a genomic scale [J].
DeRisi, JL ;
Iyer, VR ;
Brown, PO .
SCIENCE, 1997, 278 (5338) :680-686
[9]
Dudoit S, 2002, STAT SINICA, V12, P111
[10]
Eisen MB, 1999, METHOD ENZYMOL, V303, P179