Removing technical variability in RNA-seq data using conditional quantile normalization

被引:416
作者
Hansen, Kasper D. [2 ]
Irizarry, Rafael A. [2 ]
WU, Zhijin [1 ]
机构
[1] Brown Univ, Dept Biostat, Providence, RI 02912 USA
[2] Johns Hopkins Bloomberg Sch Publ Hlth, Dept Biostat, Baltimore, MD USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Gene expression; Normalization; RNA sequencing; DIFFERENTIAL EXPRESSION ANALYSIS; MODEL;
D O I
10.1093/biostatistics/kxr054
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The ability to measure gene expression on a genome-wide scale is one of the most promising accomplishments in molecular biology. Microarrays, the technology that first permitted this, were riddled with problems due to unwanted sources of variability. Many of these problems are now mitigated, after a decade's worth of statistical methodology development. The recently developed RNA sequencing (RNA-seq) technology has generated much excitement in part due to claims of reduced variability in comparison to microarrays. However, we show that RNA-seq data demonstrate unwanted and obscuring variability similar to what was first observed in microarrays. In particular, we find guanine-cytosine content (GC-content) has a strong sample-specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results. We also report on commonly observed data distortions that demonstrate the need for data normalization. Here, we describe a statistical methodology that improves precision by 42% without loss of accuracy. Our resulting conditional quantile normalization algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.
引用
收藏
页码:204 / 216
页数:13
相关论文
共 38 条
[1]   Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries [J].
Aird, Daniel ;
Ross, Michael G. ;
Chen, Wei-Sheng ;
Danielsson, Maxwell ;
Fennell, Timothy ;
Russ, Carsten ;
Jaffe, David B. ;
Nusbaum, Chad ;
Gnirke, Andreas .
GENOME BIOLOGY, 2011, 12 (02)
[2]   Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[3]   A comparison of normalization methods for high density oligonucleotide array data based on variance and bias [J].
Bolstad, BM ;
Irizarry, RA ;
Åstrand, M ;
Speed, TP .
BIOINFORMATICS, 2003, 19 (02) :185-193
[4]   Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays [J].
Bottomly, Daniel ;
Walter, Nicole A. R. ;
Hunter, Jessica Ezzell ;
Darakjian, Priscila ;
Kawane, Sunita ;
Buck, Kari J. ;
Searles, Robert P. ;
Mooney, Michael ;
McWeeney, Shannon K. ;
Hitzemann, Robert .
PLOS ONE, 2011, 6 (03)
[5]   Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments [J].
Bullard, James H. ;
Purdom, Elizabeth ;
Hansen, Kasper D. ;
Dudoit, Sandrine .
BMC BIOINFORMATICS, 2010, 11
[6]   Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data [J].
Carvalho, Benilton ;
Bengtsson, Henrik ;
Speed, Terence P. ;
Irizarry, Rafael A. .
BIOSTATISTICS, 2007, 8 (02) :485-499
[7]   Polymorphic Cis- and Trans-Regulation of Human Gene Expression [J].
Cheung, Vivian G. ;
Nayak, Renuka R. ;
Wang, Isabel Xiaorong ;
Elwyn, Susannah ;
Cousins, Sarah M. ;
Morley, Michael ;
Spielman, Richard S. .
PLOS BIOLOGY, 2010, 8 (09)
[8]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[9]   Digital Gene Expression Signatures for Maize Development [J].
Eveland, Andrea L. ;
Satoh-Nagasawa, Namiko ;
Goldshmidt, Alexander ;
Meyer, Sandra ;
Beatty, Mary ;
Sakai, Hajime ;
Ware, Doreen ;
Jackson, David .
PLANT PHYSIOLOGY, 2010, 154 (03) :1024-1039
[10]   Ensembl 2011 [J].
Flicek, Paul ;
Amode, M. Ridwan ;
Barrell, Daniel ;
Beal, Kathryn ;
Brent, Simon ;
Chen, Yuan ;
Clapham, Peter ;
Coates, Guy ;
Fairley, Susan ;
Fitzgerald, Stephen ;
Gordon, Leo ;
Hendrix, Maurice ;
Hourlier, Thibaut ;
Johnson, Nathan ;
Kaehaeri, Andreas ;
Keefe, Damian ;
Keenan, Stephen ;
Kinsella, Rhoda ;
Kokocinski, Felix ;
Kulesha, Eugene ;
Larsson, Pontus ;
Longden, Ian ;
McLaren, William ;
Overduin, Bert ;
Pritchard, Bethan ;
Riat, Harpreet Singh ;
Rios, Daniel ;
Ritchie, Graham R. S. ;
Ruffier, Magali ;
Schuster, Michael ;
Sobral, Daniel ;
Spudich, Giulietta ;
Tang, Y. Amy ;
Trevanion, Stephen ;
Vandrovcova, Jana ;
Vilella, Albert J. ;
White, Simon ;
Wilder, Steven P. ;
Zadissa, Amonida ;
Zamora, Jorge ;
Aken, Bronwen L. ;
Birney, Ewan ;
Cunningham, Fiona ;
Dunham, Ian ;
Durbin, Richard ;
Fernandez-Suarez, Xose M. ;
Herrero, Javier ;
Hubbard, Tim J. P. ;
Parker, Anne ;
Proctor, Glenn .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D800-D806