Free-sets: A condensed representation of Boolean data for the approximation of frequency queries

被引:144
作者
Boulicaut, JF [1 ]
Bykowski, A [1 ]
Rigotti, C [1 ]
机构
[1] Inst Natl Sci Appl, Lab Ingn Syst Informat, F-69621 Villeurbanne, France
关键词
condensed representations; frequent pattern discovery; association rules;
D O I
10.1023/A:1021571501451
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a large collection of transactions containing items, a basic common data mining problem is to extract the so-called frequent itemsets (i.e., sets of items appearing in at least a given number of transactions). In this paper, we propose a structure called free-sets, from which we can approximate any itemset support (i.e., the number of transactions containing the itemset) and we formalize this notion in the framework of epsilon-adequate representations (H. Mannila and H. Toivonen, 1996. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 189-194). We show that frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent itemset. Experiments on real dense data sets show a significant reduction of the size of the output when compared with standard frequent itemset extraction. Furthermore, the experiments show that the extraction of frequent free-sets is still possible when the extraction of frequent itemsets becomes intractable, and that the supports of the frequent free-sets can be used to approximate very closely the supports of the frequent itemsets. Finally, we consider the effect of this approximation on association rules ( a popular kind of patterns that can be derived from frequent itemsets) and show that the corresponding errors remain very low in practice.
引用
收藏
页码:5 / 22
页数:18
相关论文
共 15 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]  
Agrawal R., 1996, Advances in Knowledge Discovery and Data Mining, P307
[3]  
Agrawal R., 1994, P 20 INT C VER LARG, V1215, P487
[4]  
[Anonymous], P 1998 ACM SIGMOD IN
[5]  
[Anonymous], P ACM SIGMOD 98
[6]  
Bayardo R. J. Jr., 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, P123
[7]  
Boulicaut JF, 2000, LECT NOTES ARTIF INT, V1805, P62
[8]  
BOULICAUT JF, IN PRESS P 4 EUR C P
[9]  
BYKOWSKI A, 2000, P 2000 INT WORKSH WE, P27
[10]  
Fujiwara S., 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), P501, DOI 10.1109/ICDE.2000.839449