Strong feature sets from small samples

被引:69
作者
Kim, S
Dougherty, ER
Barrera, J
Chen, YD
Bittner, ML
Trent, JM
机构
[1] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77843 USA
[2] Univ Sao Paulo, Dept Ciencia Comp, Sao Paulo, Brazil
[3] NIH, Natl Human Genome Res Inst, Canc Genet Branch, Bethesda, MD 20892 USA
关键词
perceptron; gene expression; classification; cancer;
D O I
10.1089/10665270252833226
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.
引用
收藏
页码:127 / 146
页数:20
相关论文
共 40 条
[1]   MODIFIED CONTROLLED RANDOM SEARCH ALGORITHMS [J].
ALI, MM ;
STOREY, C .
INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 1994, 53 (3-4) :229-235
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]   The molecular chaperone αA-crystallin enhances lens epithelial cell growth and resistance to UVA stress [J].
Andley, UP ;
Song, Z ;
Wawrousek, EF ;
Bassnett, S .
JOURNAL OF BIOLOGICAL CHEMISTRY, 1998, 273 (47) :31252-31261
[4]  
[Anonymous], 1989, GENETIC ALGORITHM SE
[5]   CYCLIN D1 PROTEIN EXPRESSION AND FUNCTION IN HUMAN BREAST-CANCER [J].
BARTKOVA, J ;
LUKAS, J ;
MULLER, H ;
LUTZHOFT, D ;
STRAUSS, M ;
BARTEK, J .
INTERNATIONAL JOURNAL OF CANCER, 1994, 57 (03) :353-361
[6]  
BAUM E, 2000, COMPLEXITY, V4, P193
[7]   Tissue classification with gene expression profiles [J].
Ben-Dor, A ;
Bruhn, L ;
Friedman, N ;
Nachman, I ;
Schummer, M ;
Yakhini, Z .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (3-4) :559-583
[8]   Molecular classification of cutaneous malignant melanoma by gene expression profiling [J].
Bittner, M ;
Meitzer, P ;
Chen, Y ;
Jiang, Y ;
Seftor, E ;
Hendrix, M ;
Radmacher, M ;
Simon, R ;
Yakhini, Z ;
Ben-Dor, A ;
Sampas, N ;
Dougherty, E ;
Wang, E ;
Marincola, F ;
Gooden, C ;
Lueders, J ;
Glatfelter, A ;
Pollock, P ;
Carpten, J ;
Gillanders, E ;
Leja, D ;
Dietrich, K ;
Beaudry, C ;
Berens, M ;
Alberts, D ;
Sondak, V ;
Hayward, N ;
Trent, J .
NATURE, 2000, 406 (6795) :536-540
[9]  
Bresina JL, 1996, PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, P271
[10]  
Brotherick I, 1998, CYTOMETRY, V32, P301, DOI 10.1002/(SICI)1097-0320(19980801)32:4<301::AID-CYTO7>3.3.CO