Unsupervised learning with random forest predictors

被引：376

作者：

Shi, T ^{[1
]}

Horvath, S

机构：

[1] Univ Calif Los Angeles, Dept Human Genet, Gondca Ctr, Los Angeles, CA 90095 USA

[2] Univ Calif Los Angeles, Dept Biostat, Gondca Ctr, Los Angeles, CA 90095 USA

[3] Johnson & Johnson Co, Ortho Clin Diagnost, San Diego, CA 92121 USA

来源：

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS | 2006年 / 15卷 / 01期

关键词：

biomarkers; cluster analysis; dissimilarity; ensemble predictors; tumor markers;

D O I：

10.1198/106186006X94072

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the "observed" data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl1 RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.

引用

页码：118 / 138

页数：21

共 19 条

[1] High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes [J].

Allen, E ;

Horvath, S ;

Tong, F ;

Kraft, P ;

Spiteri, E ;

Riggs, AD ;

Marahrens, Y .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (17) :9940-9945

[2]

[Anonymous], 2003, RANDOM FORESTS MANUA

[3] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[4]

Breiman L., 2017, Classification And Regression Trees, DOI [10.1201/9781315139470, DOI 10.1201/9781315139470]

[5]

Cox D.R., 1990, ANAL SURVIVAL DATA

[6]

Cox T., 2001, MULTIDIMENSIONAL SCA

[7]

Hastie T., 2009, ELEMENTS STAT LEARNI

[8] COMPARING PARTITIONS [J].

HUBERT, L ;

ARABIE, P .

JOURNAL OF CLASSIFICATION, 1985, 2 (2-3) :193-218

[9] NONPARAMETRIC-ESTIMATION FROM INCOMPLETE OBSERVATIONS [J].

KAPLAN, EL ;

MEIER, P .

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1958, 53 (282) :457-481

[10]

Kaufman L., 1990, Finding Groups in Data: An Introduction to Cluster Analysis, DOI DOI 10.1002/9780470316801

← 1 2 →