Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data

被引:237
作者
Jeffery, Ian B. [1 ]
Higgins, Desmond G.
Culhane, Aedin C.
机构
[1] Natl Univ Ireland Univ Coll Dublin, Conway Inst, Dublin 4, Ireland
[2] Dana Farber Canc Inst, Dept Biostat & Computat Biol, Boston, MA 02115 USA
关键词
D O I
10.1186/1471-2105-7-359
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets. Results: In this study, we compared the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis (BGA), Area under the receiver operating characteristic (ROC) curve, the Welch t-statistic, fold change, rank products, and sets of randomly selected genes. In each case these methods were applied to 9 different binary ( two class) microarray datasets. Firstly we found little agreement in gene lists produced by the different methods. Only 8 to 21% of genes were in common across all 10 feature selection methods. Secondly, we evaluated the class prediction efficiency of each gene list in training and test cross-validation using four supervised classifiers. Conclusion: We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
引用
收藏
页数:16
相关论文
共 39 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]  
BHATTACHARYYA PVS, 2001, P 1 SIAM INT C DAT M
[3]   Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments [J].
Breitling, R ;
Armengaud, P ;
Amtmann, A ;
Herzyk, P .
FEBS LETTERS, 2004, 573 (1-3) :83-92
[4]  
BREITLING R, 2005, J BIOINFORM COMPUT B, V3, P171
[5]   Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival [J].
Chiaretti, S ;
Li, XC ;
Gentleman, R ;
Vitale, A ;
Vignetti, M ;
Mandelli, F ;
Ritz, J ;
Foa, R .
BLOOD, 2004, 103 (07) :2771-2778
[6]   Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset [J].
Choe, SE ;
Boutros, M ;
Michelson, AM ;
Church, GM ;
Halfon, MS .
GENOME BIOLOGY, 2005, 6 (02)
[7]   MADE4:: an R package for multivariate analysis of gene expression data [J].
Culhane, AC ;
Thioulouse, J ;
Perrière, G ;
Higgins, DG .
BIOINFORMATICS, 2005, 21 (11) :2789-2790
[8]   Between-group analysis of microarray data [J].
Culhane, AC ;
Perrière, G ;
Considine, EC ;
Cotter, TG ;
Higgins, DG .
BIOINFORMATICS, 2002, 18 (12) :1600-1608
[9]   Support vector machine classification and validation of cancer tissue samples using microarray expression data [J].
Furey, TS ;
Cristianini, N ;
Duffy, N ;
Bednarski, DW ;
Schummer, M ;
Haussler, D .
BIOINFORMATICS, 2000, 16 (10) :906-914
[10]   Resampling-based multiple testing for microarray data analysis [J].
Ge, YC ;
Dudoit, S ;
Speed, TP .
TEST, 2003, 12 (01) :1-77