Critical limitations of consensus clustering in class discovery

被引:250
作者
Senbabaoglu, Yasin [1 ]
Michailidis, George [2 ]
Li, Jun Z. [3 ]
机构
[1] Univ Michigan, Dept Computat Med & Bioinformat, Ann Arbor, MI 48109 USA
[2] Univ Michigan, Dept Stat & EECS, Ann Arbor, MI 48109 USA
[3] Univ Michigan, Dept Human Genet, Ann Arbor, MI 48109 USA
关键词
GENE-EXPRESSION DATA; CLINICALLY RELEVANT SUBTYPES; CANCER; VALIDATION;
D O I
10.1038/srep06207
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
070301 [无机化学]; 070403 [天体物理学]; 070507 [自然资源与国土空间规划学]; 090105 [作物生产系统与生态工程];
摘要
Consensus clustering (CC) has been adopted for unsupervised class discovery in many genomic studies. It calculates how frequently two samples are grouped together in repeated clustering runs, and uses the resulting pairwise "consensus rates" for visual demonstration that clusters exist, for comparing cluster stability, and for estimating the optimal cluster number (K). However, the sensitivity and specificity of CC have not been systemically assessed. Through simulations we find that CC is able to divide randomly generated unimodal data into apparently stable clusters for a range of K, essentially reporting chance partitions of cluster-less data. For data with known structure, the common implementations of CC perform poorly in identifying the true K. These results suggest that CC should be applied and interpreted with caution. We found that a new metric based on CC, the proportion of ambiguously clustered pairs (PAC), infers K equally or more reliably than similar methods in simulated data with known K. Our overall approach involves the use of realistic null distributions based on the observed gene-gene correlation structure in a given study, and the implementation of PAC to more accurately estimate K. We discuss the strength of our approach in the context of other ensemble-based methods.
引用
收藏
页数:13
相关论文
共 38 条
[1]
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]
[Anonymous], ADV NEURAL INF PROCE
[3]
Integrated genomic analyses of ovarian carcinoma [J].
Bell, D. ;
Berchuck, A. ;
Birrer, M. ;
Chien, J. ;
Cramer, D. W. ;
Dao, F. ;
Dhir, R. ;
DiSaia, P. ;
Gabra, H. ;
Glenn, P. ;
Godwin, A. K. ;
Gross, J. ;
Hartmann, L. ;
Huang, M. ;
Huntsman, D. G. ;
Iacocca, M. ;
Imielinski, M. ;
Kalloger, S. ;
Karlan, B. Y. ;
Levine, D. A. ;
Mills, G. B. ;
Morrison, C. ;
Mutch, D. ;
Olvera, N. ;
Orsulic, S. ;
Park, K. ;
Petrelli, N. ;
Rabeno, B. ;
Rader, J. S. ;
Sikic, B. I. ;
Smith-McCune, K. ;
Sood, A. K. ;
Bowtell, D. ;
Penny, R. ;
Testa, J. R. ;
Chang, K. ;
Dinh, H. H. ;
Drummond, J. A. ;
Fowler, G. ;
Gunaratne, P. ;
Hawes, A. C. ;
Kovar, C. L. ;
Lewis, L. R. ;
Morgan, M. B. ;
Newsham, I. F. ;
Santibanez, J. ;
Reid, J. G. ;
Trevino, L. R. ;
Wu, Y. -Q. ;
Wang, M. .
NATURE, 2011, 474 (7353) :609-615
[4]
A sober look at clustering stability [J].
Ben-David, Shai ;
von Luxburg, Ulrike ;
Pal, David .
LEARNING THEORY, PROCEEDINGS, 2006, 4005 :5-19
[5]
Ben-Hur Asa, 2002, Pac Symp Biocomput, P6
[6]
Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma [J].
Beroukhim, Rameen ;
Getz, Gad ;
Nghiemphu, Leia ;
Barretina, Jordi ;
Hsueh, Teli ;
Linhart, David ;
Vivanco, Igor ;
Lee, Jeffrey C. ;
Huang, Julie H. ;
Alexander, Sethu ;
Du, Jinyan ;
Kau, Tweeny ;
Thomas, Roman K. ;
Shah, Kinial ;
Soto, Horacio ;
Perner, Sven ;
Prensner, John ;
Debiasi, Ralph M. ;
Demichelis, Francesca ;
Hatton, Charlie ;
Rubin, Mark A. ;
Garraway, Levi A. ;
Nelson, Stan F. ;
Liau, Linda ;
Mischel, Paul S. ;
Cloughesy, Tim F. ;
Meyerson, Matthew ;
Golub, Todd A. ;
Lander, Eric S. ;
Mellinghoff, Ingo K. ;
Sellers, William R. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (50) :20007-20012
[7]
Model order selection for bio-molecular data clustering [J].
Bertoni, Alberto ;
Valentini, Giorgio .
BMC BIOINFORMATICS, 2007, 8 (Suppl 2)
[8]
Gene expression profiling identifies molecular subtypes of inflammatory breast cancer [J].
Bertucci, F ;
Finetti, P ;
Rougemont, J ;
Charafe-Jauffret, E ;
Cervera, N ;
Tarpin, C ;
Nguyen, C ;
Xerri, L ;
Houlgatte, M ;
Jacquemier, J ;
Viens, P ;
Birnbaum, D .
CANCER RESEARCH, 2005, 65 (06) :2170-2178
[9]
Comprehensive genomic characterization defines human glioblastoma genes and core pathways [J].
Chin, L. ;
Meyerson, M. ;
Aldape, K. ;
Bigner, D. ;
Mikkelsen, T. ;
VandenBerg, S. ;
Kahn, A. ;
Penny, R. ;
Ferguson, M. L. ;
Gerhard, D. S. ;
Getz, G. ;
Brennan, C. ;
Taylor, B. S. ;
Winckler, W. ;
Park, P. ;
Ladanyi, M. ;
Hoadley, K. A. ;
Verhaak, R. G. W. ;
Hayes, D. N. ;
Spellman, Paul T. ;
Absher, D. ;
Weir, B. A. ;
Ding, L. ;
Wheeler, D. ;
Lawrence, M. S. ;
Cibulskis, K. ;
Mardis, E. ;
Zhang, Jinghui ;
Wilson, R. K. ;
Donehower, L. ;
Wheeler, D. A. ;
Purdom, E. ;
Wallis, J. ;
Laird, P. W. ;
Herman, J. G. ;
Schuebel, K. E. ;
Weisenberger, D. J. ;
Baylin, S. B. ;
Schultz, N. ;
Yao, Jun ;
Wiedemeyer, R. ;
Weinstein, J. ;
Sander, C. ;
Gibbs, R. A. ;
Gray, J. ;
Kucherlapati, R. ;
Lander, E. S. ;
Myers, R. M. ;
Perou, C. M. ;
McLendon, Roger .
NATURE, 2008, 455 (7216) :1061-1068
[10]
Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser [J].
Cline, Melissa S. ;
Craft, Brian ;
Swatloski, Teresa ;
Goldman, Mary ;
Ma, Singer ;
Haussler, David ;
Zhu, Jingchun .
SCIENTIFIC REPORTS, 2013, 3