Quality control and robust estimation for cDNA microarrays with replicates

被引:6
作者
Gottardo, R [1 ]
Raftery, AE
Yeung, KY
Bumgarner, RE
机构
[1] Univ British Columbia, Dept Stat, Vancouver, BC V6T 1Z2, Canada
[2] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[3] Univ Washington, Dept Microbiol, Seattle, WA 98195 USA
基金
美国国家卫生研究院;
关键词
Bayesian hierarchical model; gene filtering; heteroscedasticity; Markov chain Monte Carlo; outlier; quality control; t distribution;
D O I
10.1198/016214505000001096
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We consider robust estimation of gene intensities from cDNA microarray data with replicates. Several statistical methods for estimating gene intensities from microarrays have been proposed, but little work has been done on robust estimation. This is particularly relevant for experiments with replicates, because even one outlying replicate can have a disastrous effect on the estimated intensity for the gene concerned. Because of the many steps involved in the experimental process from hybridization to image analysis, cDNA microarray data often contain outliers. For example, an outlying data value could occur because of scratches or dust on the surface, imperfections in the glass, or imperfections in the array production. We develop a Bayesian hierarchical model for robust estimation of cDNA microarray intensities. Outliers are modeled explicitly using a t-distribution, and our model also addresses such classical issues as design effects, normalization, transformation, and nonconstant variance. Parameter estimation is carried out using Markov chain Monte Carlo. By identifying potential outliers, the method provides automatic quality control of replicate, array, and gene measurements. The method is applied to three publicly available gene expression datasets and compared with three other methods: ANOVA-normalized log ratios, the median log ratio, and estimation after the removal of outliers based on Dixon's test. We find that the between-replicate variability of the intensity estimates is lower for our method than for any of the others. We also address the issue of whether the background should be subtracted when estimating intensities. It has been argued that this should not be done because it increases variability, whereas the arguments for doing so are that there is a physical basis for the image background, and that not doing so will bias downward the estimated log ratios of differentially expressed genes. We show that the arguments on both sides of this debate are correct for our data, but that by using our model one can have the best of both worlds: One can subtract the background without increasing variability by much.
引用
收藏
页码:30 / 40
页数:11
相关论文
共 42 条
[21]  
Kerr M K, 2001, Biostatistics, V2, P183, DOI 10.1093/biostatistics/2.2.183
[22]   Analysis of variance for gene expression microarray data [J].
Kerr, MK ;
Martin, M ;
Churchill, GA .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (06) :819-837
[23]  
LEWIN A, 2003, BAYESIAN MODELLING D
[24]   Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection [J].
Li, C ;
Wong, WH .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (01) :31-36
[25]  
LINDLEY DV, 1972, J ROY STAT SOC B, V34, P1
[26]  
Lönnstedt I, 2002, STAT SINICA, V12, P31
[27]  
Neal RM, 2003, ANN STAT, V31, P705, DOI 10.1214/aos/1056562461
[28]   Detecting differential gene expression with a semiparametric hierarchical mixture method [J].
Newton, MA ;
Noueiry, A ;
Sarkar, D ;
Ahlquist, P .
BIOSTATISTICS, 2004, 5 (02) :155-176
[29]   On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data [J].
Newton, MA ;
Kendziorski, CM ;
Richmond, CS ;
Blattner, FR ;
Tsui, KW .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2001, 8 (01) :37-52
[30]  
Raftery AE., 1996, Markov chain Monte Carlo in practice, P115