Classification of microarray data with penalized logistic regression

被引:37
作者
Eilers, PHC [1 ]
Boer, JM [1 ]
van Ommen, GJ [1 ]
van Houwelingen, HC [1 ]
机构
[1] Leiden Univ, Med Ctr, Dept Med Stat, Leiden, Netherlands
来源
MICROARRAYS: OPTICAL TECHNOLOGIES AND INFORMATICS | 2001年 / 4266卷
关键词
AIC; genetic expression; cross-validation; generalized linear models; multicollinearity; multivariate calibration; ridge regression; singular value decomposition;
D O I
10.1117/12.427987
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
Classification of microarray data needs a firm statistical basis. In principle, logistic regression can provide it, modeling the probability of membership of a class with (transforms of) linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. A second problem is over-fitting: a model may fit well to a data set, but perform badly when used to classify new data. We propose penalized likelihood as a solution to both problems. The values of the regression coefficients are constrained in a similar way as in ridge regression. All variables play an equal role., there is no ad-hoc selection of "most relevant" or "most expressed" genes. The dimension of the resulting systems of equations is equal to the number of variables, and generally will be too large for most computers, but it can dramatically be reduced with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance. We find that penalized logistic regression performs well on a public data set (the MIT ALL/AML data).
引用
收藏
页码:187 / 198
页数:12
相关论文
共 22 条
  • [1] Singular value decomposition for genome-wide expression data processing and modeling
    Alter, O
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10101 - 10106
  • [2] [Anonymous], 1989, MULTIVARIATE CALIBRA
  • [3] Burnham K. P., 1998, MODEL SELECTION INFE
  • [4] DUDOIT S, 2000, 576 U CAL DEP STAT B
  • [5] FEARN T, 1983, J R STAT SOC C-APPL, V32, P73
  • [6] A STATISTICAL VIEW OF SOME CHEMOMETRICS REGRESSION TOOLS
    FRANK, IE
    FRIEDMAN, JH
    [J]. TECHNOMETRICS, 1993, 35 (02) : 109 - 135
  • [7] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
    Golub, TR
    Slonim, DK
    Tamayo, P
    Huard, C
    Gaasenbeek, M
    Mesirov, JP
    Coller, H
    Loh, ML
    Downing, JR
    Caligiuri, MA
    Bloomfield, CD
    Lander, ES
    [J]. SCIENCE, 1999, 286 (5439) : 531 - 537
  • [8] Hastie T., 1990, Generalized additive model
  • [9] Hastie T., 2000, Genome Biology, V1, pr
  • [10] HOERL AE, 1985, J R STAT SOC C-APPL, V34, P114