Mining data to find subsets of high activity

被引:11
作者
Amaratunga, D
Cabrera, J [1 ]
机构
[1] Johnson & Johnson Pharmaceut Res & Dev, Raritan, NJ 08807 USA
[2] Rutgers State Univ, Dept Stat, Piscataway, NJ 08855 USA
关键词
ARF; data mining; recursive partitioning; classification tree;
D O I
10.1016/j.jspi.2003.06.014
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Many data mining problems in biometrics research are concerned with trying to identify the characteristics of a subset of cases that responds substantially differently from the rest of the cases. For example, when studying the relationship between a response variable Y and a set of predictor variables, it is often of interest to determine what ranges of values of the predictor variables are associated with a high likelihood of Y = 1 (if Y is a Bernoulli variable) or with high values of Y (if Y is a continuous variable). We describe a criterion (H) and a recursive partitioning method (ARF) that directly addresses this question. A computational algorithm that makes ARF feasible for use even with very large datasets is presented. The basic version of ARF can be generalized to the case of multiple response variables, Y1,...,Y-t and other settings. We illustrate the effectiveness of ARF by mining a structure activity database, a hospital database, and some other real and simulated datasets. We conclude by proposing a basic paradigm for data mining. (C) 2003 Published by Elsevier B.V.
引用
收藏
页码:23 / 41
页数:19
相关论文
共 18 条
[1]  
Agresti A., 1990, Analysis of categorical data
[2]  
Blake C.L., 1998, UCI repository of machine learning databases
[3]  
Breiman L., 1998, CLASSIFICATION REGRE
[4]  
Cabrera J., 2002, Statistical Consulting
[5]  
Clark L. A., 1992, COMPSTAT 1990 P COMP
[6]  
FAYYAD UM, 1996, ADV KNOWLEDGE DISCOV
[7]  
Friedman J., 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5
[8]   Bump hunting in high-dimensional data [J].
Friedman J.H. ;
Fisher N.I. .
Statistics and Computing, 1999, 9 (2) :123-143
[9]  
FRIEDMAN JH, 1998, DATA MINING STAT WHA
[10]   AN EXTREME VALUE THEORY FOR LONG HEAD RUNS [J].
GORDON, L ;
SCHILLING, MF ;
WATERMAN, MS .
PROBABILITY THEORY AND RELATED FIELDS, 1986, 72 (02) :279-287