Identifying genes that contribute most to good classification in microarrays

被引:49
作者
Baker, Stuart G. [1 ]
Kramer, Barnett S.
机构
[1] NCI, Div Canc Prevent, Biometry Res Grp, Bethesda, MD 20892 USA
[2] NIH, Off Dis Prevent, Bethesda, MD 20892 USA
关键词
D O I
10.1186/1471-2105-7-407
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The goal of most microarray studies is either the identification of genes that are most differentially expressed or the creation of a good classification rule. The disadvantage of the former is that it ignores the importance of gene interactions; the disadvantage of the latter is that it often does not provide a sufficient focus for further investigation because many genes may be included by chance. Our strategy is to search for classification rules that perform well with few genes and, if they are found, identify genes that occur relatively frequently under multiple random validation (random splits into training and test samples). Results: We analyzed data from four published studies related to cancer. For classification we used a filter with a nearest centroid rule that is easy to implement and has been previously shown to perform well. To comprehensively measure classification performance we used receiver operating characteristic curves. In the three data sets with good classification performance, the classification rules for 5 genes were only slightly worse than for 20 or 50 genes and somewhat better than for 1 gene. In two of these data sets, one or two genes had relatively high frequencies not noticeable with rules involving 20 or 50 genes: desmin for classifying colon cancer versus normal tissue; and zyxin and secretory granule proteoglycan genes for classifying two types of leukemia. Conclusion: Using multiple random validation, investigators should look for classification rules that perform well with few genes and select, for further study, genes with relatively high frequencies of occurrence in these classification rules.
引用
收藏
页数:7
相关论文
共 23 条
[11]   Promotion of colon cancer metastases in rat liver by fish oil diet is not due to reduced stroma formation [J].
Klieverik, L ;
Fehres, O ;
Griffini, P ;
Van Noorden, CJF ;
Frederiks, WM .
CLINICAL & EXPERIMENTAL METASTASIS, 2001, 18 (05) :371-377
[12]   Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling [J].
Li, X ;
Rao, SQ ;
Wang, YD ;
Gong, BS .
NUCLEIC ACIDS RESEARCH, 2004, 32 (09) :2685-2694
[13]   Desmin is essential for the tensile strength and integrity of myofibrils but not for myogenic commitment, differentiation, and fusion of skeletal muscle [J].
Li, ZL ;
Mericskay, M ;
Agbulut, O ;
ButlerBrowne, G ;
Carlsson, L ;
Thornell, LE ;
Babinet, C ;
Paulin, D .
JOURNAL OF CELL BIOLOGY, 1997, 139 (01) :129-144
[14]   Regularized binormal ROC method in disease classification using microarray data [J].
Ma, Shuangge ;
Song, Xiao ;
Huang, Jian .
BMC BIOINFORMATICS, 2006, 7 (1)
[15]   Prediction of cancer outcome with microarrays: a multiple random validation strategy [J].
Michiels, S ;
Koscielny, S ;
Hill, C .
LANCET, 2005, 365 (9458) :488-492
[16]   Localization of serglycin in human neutrophil granulocytes and their precursors [J].
Niemann, CU ;
Cowland, JB ;
Klausen, P ;
Askaa, J ;
Calafat, J ;
Borregaard, N .
JOURNAL OF LEUKOCYTE BIOLOGY, 2004, 76 (02) :406-415
[17]  
Pepe MS, 2003, STAT EVALUATION MED
[18]   Prediction of central nervous system embryonal tumour outcome based on gene expression [J].
Pomeroy, SL ;
Tamayo, P ;
Gaasenbeek, M ;
Sturla, LM ;
Angelo, M ;
McLaughlin, ME ;
Kim, JYH ;
Goumnerova, LC ;
Black, PM ;
Lau, C ;
Allen, JC ;
Zagzag, D ;
Olson, JM ;
Curran, T ;
Wetmore, C ;
Biegel, JA ;
Poggio, T ;
Mukherjee, S ;
Rifkin, R ;
Califano, A ;
Stolovitzky, G ;
Louis, DN ;
Mesirov, JP ;
Lander, ES ;
Golub, TR .
NATURE, 2002, 415 (6870) :436-442
[19]   Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data [J].
Simon, R .
BRITISH JOURNAL OF CANCER, 2003, 89 (09) :1599-1604
[20]  
Tang EK, 2006, BMC BIOINFORMATICS, V7, DOI 10.1186/1471-2105-7-95