Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data

被引:194
作者
Liu, Yufeng [1 ]
Hayes, David Neil [2 ]
Nobel, Andrew
Marron, J. S. [2 ]
机构
[1] Univ N Carolina, Carolina Ctr Genome Sci, Dept Stat & Operat Res, Chapel Hill, NC 27599 USA
[2] Univ N Carolina, Lineberger Comprehens Canc Ctr, Chapel Hill, NC 27599 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Clustering; High-dimension low-sample data; k-means; Microarray gene expression data; p value; Statistical significance;
D O I
10.1198/016214508000000454
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Clustering methods provide a powerful tool for the exploratory analysis of high-dimension, low-sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are ''really there'', as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.
引用
收藏
页码:1281 / 1293
页数:13
相关论文
共 37 条
  • [1] The high-dimension, low-sample-size geometric representation holds under mild conditions
    Ahn, Jeongyoun
    Marron, J. S.
    Muller, Keith M.
    Chi, Yueh-Yun
    [J]. BIOMETRIKA, 2007, 94 (03) : 760 - 766
  • [2] [Anonymous], 2005, FINDING GROUPS DATA, DOI DOI 10.1002/9780470316801
  • [3] [Anonymous], 1975, CLUSTERING ALGORITHM
  • [4] [Anonymous], GENOME BIOL
  • [5] A cluster validity framework for genome expression data
    Azuaje, F
    [J]. BIOINFORMATICS, 2002, 18 (02) : 319 - 320
  • [6] BAIK J, 2004, ARXIVEMATHST048165V1
  • [7] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING
    BANFIELD, JD
    RAFTERY, AE
    [J]. BIOMETRICS, 1993, 49 (03) : 803 - 821
  • [8] Benjamini Y, 2001, ANN STAT, V29, P1165
  • [9] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
    Bhattacharjee, A
    Richards, WG
    Staunton, J
    Li, C
    Monti, S
    Vasa, P
    Ladd, C
    Beheshti, J
    Bueno, R
    Gillette, M
    Loda, M
    Weber, G
    Mark, EJ
    Lander, ES
    Wong, W
    Johnson, BE
    Golub, TR
    Sugarbaker, DJ
    Meyerson, M
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) : 13790 - 13795
  • [10] ON SOME SIGNIFICANCE TESTS IN CLUSTER-ANALYSIS
    BOCK, HH
    [J]. JOURNAL OF CLASSIFICATION, 1985, 2 (01) : 77 - 108