Scoring the data using association rules

被引:39
作者
Liu, B
Ma, YM
Wong, CK
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 117543, Singapore
[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
data mining; scoring; target selection; association rules; classifications;
D O I
10.1023/A:1021931008240
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
In many data mining applications, the objective is to select data cases of a target class. For example, in direct marketing, marketers want to select likely buyers of a particular product for promotion. In such applications, it is often too difficult to predict who will definitely be in the target class (e.g., the buyer class) because the data used for modeling is often very noisy and has a highly imbalanced class distribution. Traditionally, classification systems are used to solve this problem. Instead of classifying each data case to a definite class (e.g., buyer or non-buyer), a classification system is modified to produce a class probability estimate (or a score) for the data case to indicate the likelihood that the data case belongs to the target class (e.g., the buyer class). However, existing classification systems only aim to find a subset of the regularities or rules that exist in data. This subset of rules only gives a partial picture of the domain. In this paper, we show that the target selection problem can be mapped to association rule mining to provide a more powerful solution to the problem. Since association rule mining aims to find all rules in data, it is thus able to give a complete picture of the underlying relationships in the domain. The complete set of rules enables us to assign a more accurate class probability estimate to each data case. This paper proposes an effective and efficient technique to compute class probability estimates using association rules. Experiment results using public domain data and real-life application data show that in general the new technique performs markedly better than the state-of-the-art classification system C4.5, boosted C4.5, and the Naive Bayesian system.
引用
收藏
页码:119 / 135
页数:17
相关论文
共 35 条
[1]
Online generation of association rules [J].
Aggarwal, CG ;
Yu, PS .
14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, :402-411
[2]
Agrawal R., 1994, P 20 INT C VER LARG, V1215, P487
[3]
[Anonymous], P 1996 ACM SIGMOD IN
[4]
[Anonymous], P ACM SIGMOD 98
[5]
[Anonymous], 1999, SIGKDD 99
[6]
[Anonymous], 1993, P 13 INT JOINT C ART
[7]
Constraint-based rule mining in large, dense databases [J].
Bayardo, RJ ;
Agrawal, R ;
Gunopulos, D .
15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, :188-197
[8]
Breiman L., 1984, BIOMETRICS, DOI DOI 10.2307/2530946
[9]
Brin S., 1997, SIGMOD Record, V26, P255, DOI [10.1145/253262.253327, 10.1145/253262.253325]
[10]
Chan P. K., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P164