Mining optimized association rules for numeric attributes

被引:37
作者
Fukuda, T
Morimoto, Y
Morishita, S
Tokuyama, T
机构
[1] IBM Corp, Res, Tokyo Res Lab, Yamato, Kanagawa 242, Japan
[2] Univ Tokyo, Inst Med Sci, Tokyo 108, Japan
关键词
D O I
10.1006/jcss.1998.1595
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Given a huge database, we address the problem of finding association rules for numeric attributes, such as (Balance is an element of l) double right arrow (CardLoan = yes), which implies that bank customers whose balances fall in a range l are likely to use card loan with a probability greater than p. The above rule is interesting only if the range l has some special feature with respect to the interrelation between Balance and CardLoan. It is required that the number of customers whose balances are contained in l (called the support of l) is sufficient and also that the probability p of the condition CardLoan = yes being met (called the confidence ratio) be much higher than the average probability of the condition over all the data. Our goal is to realize a system that finds such appropriate ranges automatically. We mainly focus on computing two optimized ranges: one that maximizes the support on the condition that the confidence ratio is at least a given threshold value, and another that maximizes the confidence ratio on the condition that the support is at least a given threshold number. Using techniques from computational geometry, we present novel algorithms that compute the optimized ranges in linear time if the data are sorted. Since sorting data with respect to each numeric attribute is expensive in the case of huge databases that occupy much more space than the main memory, we instead apply randomized bucketing as the preprocessing method and thus obtain an efficient rule-finding system. Tests show that our implementation is fast not only in theory but also in practice. The efficiency of our algorithm enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time. (C) 1999 Academic Press.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 20 条
[1]  
AGRAWAL R, 1992, PROC INT CONF VERY L, P560
[2]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[3]   DATABASE MINING - A PERFORMANCE PERSPECTIVE [J].
AGRAWAL, R ;
IMIELINSKI, T ;
SWAMI, A .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1993, 5 (06) :914-925
[4]  
Agrawal R, 1994, P 20 INT C VER LARG, V1215, P487
[5]  
[Anonymous], P 1996 ACM SIGMOD IN
[6]  
[Anonymous], P PYOC ACM SIGMOD IN
[7]  
Bentley J., 1984, Communications of the ACM, V27, P865, DOI 10.1145/358234.381162
[8]  
Friedman JH., 1984, BIOMETRICS, V40, P874, DOI [DOI 10.2307/2530946, 10.2307/2530946]
[9]  
HAN JW, 1992, PROC INT CONF VERY L, P547
[10]  
Mehta M., 1996, P 5 INT C EXT DAT TE