Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data

被引：194

作者：

Liu, Yufeng ^{[1
]}

Hayes, David Neil ^{[2
]}

Nobel, Andrew

Marron, J. S. ^{[2
]}

机构：

[1] Univ N Carolina, Carolina Ctr Genome Sci, Dept Stat & Operat Res, Chapel Hill, NC 27599 USA

[2] Univ N Carolina, Lineberger Comprehens Canc Ctr, Chapel Hill, NC 27599 USA

来源：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION | 2008年 / 103卷 / 483期

基金：

美国国家科学基金会; 美国国家卫生研究院;

关键词：

Clustering; High-dimension low-sample data; k-means; Microarray gene expression data; p value; Statistical significance;

D O I：

10.1198/016214508000000454

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Clustering methods provide a powerful tool for the exploratory analysis of high-dimension, low-sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are ''really there'', as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.

引用

页码：1281 / 1293

页数：13

共 37 条

[1] The high-dimension, low-sample-size geometric representation holds under mild conditions
Ahn, Jeongyoun
Marron, J. S.
Muller, Keith M.
Chi, Yueh-Yun
[J]. BIOMETRIKA, 2007, 94 (03) : 760 - 766
[2] [Anonymous], 2005, FINDING GROUPS DATA, DOI DOI 10.1002/9780470316801
[3] [Anonymous], 1975, CLUSTERING ALGORITHM
[4] [Anonymous], GENOME BIOL
[5] A cluster validity framework for genome expression data
Azuaje, F
[J]. BIOINFORMATICS, 2002, 18 (02) : 319 - 320
[6] BAIK J, 2004, ARXIVEMATHST048165V1
[7] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING
BANFIELD, JD
RAFTERY, AE
[J]. BIOMETRICS, 1993, 49 (03) : 803 - 821
[8] Benjamini Y, 2001, ANN STAT, V29, P1165
[9] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
Bhattacharjee, A
Richards, WG
Staunton, J
Li, C
Monti, S
Vasa, P
Ladd, C
Beheshti, J
Bueno, R
Gillette, M
Loda, M
Weber, G
Mark, EJ
Lander, ES
Wong, W
Johnson, BE
Golub, TR
Sugarbaker, DJ
Meyerson, M
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) : 13790 - 13795
[10] ON SOME SIGNIFICANCE TESTS IN CLUSTER-ANALYSIS
BOCK, HH
[J]. JOURNAL OF CLASSIFICATION, 1985, 2 (01) : 77 - 108

← 1 2 3 4 →