Mining top-K frequent itemsets from data streams

被引:66
作者
Wong, Raymond Chi-Wing [1 ]
Fu, Ada Wai-Chee [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong, Peoples R China
关键词
data mining algorithm; data stream; top K frequent itemset mining; sliding window; Chernoff bound; probabilistic algorithm;
D O I
10.1007/s10618-006-0042-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small. Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in mining the data in a sliding window. The experiments show that the results are accurate.
引用
收藏
页码:193 / 217
页数:25
相关论文
共 26 条
[1]
Agrawal R., IBM SYNTHETIC DATA G
[2]
Babcock B., 2003, SIGMOD
[3]
CHANG JH, 2003, SIGKDD
[4]
CHARIKAR M, 2002, 29 INT C AUT LANG PR
[5]
CHEUNG Y, 2004, IEEE T KNOWLEDGE DAT
[6]
Cheung Y.-L., 2002, SPIE C DAT MIN
[7]
Datar Mayur, 2002, SIAM J COMPUTING
[8]
Demaine E. D., 2002, P 10 ANN EUR S ALG
[9]
FU AWC, 2000, ISMIS
[10]
Giannella C., 2003, NEXT GENERATION DATA