Exploration, normalization, and summaries of high density oligonucleotide array probe level data

被引:8492
作者
Irizarry, RA
Hobbs, B
Collin, F
Beazer-Barclay, YD
Antonellis, KJ
Scherf, U
Speed, TP
机构
[1] Johns Hopkins Univ, Dept Biostat, Baltimore, MD 21205 USA
[2] WEHI, Div Genet & Bioinformat, Melbourne, Vic, Australia
[3] Gene Log Inc, Berkeley, CA USA
[4] Gene Log Inc, Gaithersburg, MD USA
[5] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
关键词
EXPRESSION;
D O I
10.1093/biostatistics/4.2.249
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix. GeneChip(R) system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip(R) arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip(R) arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip(R) arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities. An R package with the functions used for the analyses in this paper is part of the Bioconductor project and can be downloaded (http://www.bioconductor.org). Supplemental material, such as color versions of the figures, is available on the web (http://www.biostatjhsph.edu/similar toririzarr/affy).
引用
收藏
页码:249 / 264
页数:16
相关论文
共 12 条
  • [1] *AFF, 1999, AFF MICR SUIT US GUI
  • [2] Quantitative analysis of mRNA amplification by in vitro transcription
    Baugh, L. R.
    Hill, A. A.
    Brown, E. L.
    Hunter, Craig P.
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (05)
  • [3] BOLSTAD BM, 2002, IN PRESS BIOINFORMAT
  • [4] Dudoit S, 2002, STAT SINICA, V12, P111
  • [5] Hartemink AJ., 2001, SPIE BIOS
  • [6] Genomic analysis of gene expression in C-elegans
    Hill, AA
    Hunter, CP
    Tsung, BT
    Tucker-Kellogg, G
    Brown, EL
    [J]. SCIENCE, 2000, 290 (5492) : 809 - 812
  • [7] Hill AA, 2001, GENOME BIOL, V2
  • [8] Holder D, 2001, P ASA ANN M 2001 ATL
  • [9] HUBBELL E, 2001, GEN LOG WORKSH LOW L
  • [10] Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection
    Li, C
    Wong, WH
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (01) : 31 - 36