Relation between permutation-test P values and classifier error estimates

被引：18

作者：

Hsing, T ^{[1
]}

Attoor, S

Dougherty, E

机构：

[1] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA

[2] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77843 USA

[3] Texas A&M Univ, College Stn, TX 77843 USA

[4] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA

来源：

MACHINE LEARNING | 2003年 / 52卷 / 1-2期

关键词：

classification; error estimation; genomics; microarrays; p value; pattern recognition;

D O I：

10.1023/A:1023985022691

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Gene-expression-based classifiers suffer from the small number of microarrays usually available for classifier design. Hence, one is confronted with the dual problem of designing a classifier and estimating its error with only a small sample. Permutation testing has been recommended to assess the dependency of a designed classifier on the specific data set. This involves randomly permuting the labels of the data points, estimating the error of the designed classifiers for each permutation, and then finding the p value of the error for the actual labeling relative to the population of errors for the random labelings. This paper addresses the issue of whether or not this p value is informative. It provides both analytic and simulation results to show that the permutation p value is, up to very small deviation, a function of the error estimate. Moreover, even though the p value is a monotonically increasing function of the error estimate, in the range of the error where the majority of the p values lie, the function is very slowly increasing, so that inversion is problematic. Hence, the conclusion is that the p value is less informative than the error estimate. This result demonstrates that random labeling does not provide any further insight into the accuracy of the classifier or the precision of the error estimate. We have no knowledge beyond the error estimate itself and the various distribution-free, classifier-specific bounds developed for this estimate.

引用

页码：11 / 30

页数：20

共 21 条

[1]

Allander SV, 2001, CANCER RES, V61, P8624

[2] MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [J].

Armstrong, SA ;

Staunton, JE ;

Silverman, LB ;

Pieters, R ;

de Boer, ML ;

Minden, MD ;

Sallan, SE ;

Lander, ES ;

Golub, TR ;

Korsmeyer, SJ .

NATURE GENETICS, 2002, 30 (01) :41-47

[3]

BAI Z, BROKEN SAMPLE PROBLE

[4] Tissue classification with gene expression profiles [J].

Ben-Dor, A ;

Bruhn, L ;

Friedman, N ;

Nachman, I ;

Schummer, M ;

Yakhini, Z .

JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (3-4) :559-583

[5] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].

Bhattacharjee, A ;

Richards, WG ;

Staunton, J ;

Li, C ;

Monti, S ;

Vasa, P ;

Ladd, C ;

Beheshti, J ;

Bueno, R ;

Gillette, M ;

Loda, M ;

Weber, G ;

Mark, EJ ;

Lander, ES ;

Wong, W ;

Johnson, BE ;

Golub, TR ;

Sugarbaker, DJ ;

Meyerson, M .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795

[6] Exploring the metabolic and genetic control of gene expression on a genomic scale [J].

DeRisi, JL ;

Iyer, VR ;

Brown, PO .

SCIENCE, 1997, 278 (5338) :680-686

[7]

Devroye L., 1996, A probabilistic theory of pattern recognition

[8] Small sample issues for microarray-based classification [J].

Dougherty, ER .

COMPARATIVE AND FUNCTIONAL GENOMICS, 2001, 2 (01) :28-34

[9] Expression profiling using cDNA microarrays [J].

Duggan, DJ ;

Bittner, M ;

Chen, YD ;

Meltzer, P ;

Trent, JM .

NATURE GENETICS, 1999, 21 (Suppl 1) :10-14

[10] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].

Golub, TR ;

Slonim, DK ;

Tamayo, P ;

Huard, C ;

Gaasenbeek, M ;

Mesirov, JP ;

Coller, H ;

Loh, ML ;

Downing, JR ;

Caligiuri, MA ;

Bloomfield, CD ;

Lander, ES .

SCIENCE, 1999, 286 (5439) :531-537

← 1 2 3 →