Identifying SNPs predictive of phenotype using random forests

被引:253
作者
Bureau, A
Dupuis, J
Falls, K
Lunetta, KL
Hayward, B
Keith, TP
Van Eerdewegh, P
机构
[1] Oscient Pharmaceut, Dept Human Genet, Waltham, MA USA
[2] Univ Lethbridge, Sch Hlth Sci, Lethbridge, AB T1K 3M4, Canada
[3] Boston Univ, Sch Publ Hlth, Dept Biostat, Boston, MA USA
[4] Boston Univ, Sch Med, Dept Neurol, Boston, MA 02118 USA
[5] Calileo Genom Inc, St Laurent, PQ, Canada
[6] Harvard Univ, Sch Med, Dept Psychiat, Boston, MA 02115 USA
关键词
genotype-phenotype association; predictive importance; classification trees; case-control study;
D O I
10.1002/gepi.20041
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide. (C) 2004 Wiley-Liss, Inc.
引用
收藏
页码:171 / 182
页数:12
相关论文
共 20 条
  • [1] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [2] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [3] Breiman L., 2003, RANDOM FORESTS VERSI
  • [4] Mapping complex traits using Random Forests
    Bureau, A
    Dupuis, J
    Hayward, B
    Falls, K
    Van Eerdewegh, P
    [J]. BMC GENETICS, 2003, 4 (Suppl 1)
  • [5] Tree and spline based association analysis of gene-gene interaction models for ischemic stroke
    Ccok, NR
    Zee, RYL
    Ridker, PM
    [J]. STATISTICS IN MEDICINE, 2004, 23 (09) : 1439 - 1453
  • [6] Fine genetic mapping using haplotype analysis and the missing data problem
    Chiano, MN
    Clayton, DG
    [J]. ANNALS OF HUMAN GENETICS, 1998, 62 : 55 - 60
  • [7] EXCOFFIER L, 1995, MOL BIOL EVOL, V12, P921
  • [8] Finding genes that underlie complex traits
    Glazier, AM
    Nadeau, JH
    Aitman, TJ
    [J]. SCIENCE, 2002, 298 (5602) : 2345 - 2349
  • [9] *GOLD HEL INC, 2002, HEL TREE MAN VERS 2
  • [10] INVESTIGATION OF LINKAGE BETWEEN A QUANTITATIVE TRAIT AND A MARKER LOCUS
    HASEMAN, JK
    ELSTON, RC
    [J]. BEHAVIOR GENETICS, 1972, 2 (01) : 3 - 19