Turbo-charging vertical mining of large databases

被引:23
作者
Shenoy, P
Haritsa, JR
Sudarshan, S
Bhalotia, G
Bawa, M
Shah, D
机构
[1] Lucent Bell Labs, Murray Hill, NJ 07974 USA
[2] Indian Inst Sci, SERC, Database Syst Lab, Bangalore 560012, Karnataka, India
[3] Indian Inst Technol, Bombay 400076, Maharashtra, India
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 [计算机科学与技术];
摘要
In a vertical representation of a market-basket database, each item is associated with a column of values representing the transactions in which it is present. The association-rule mining algorithms that have been recently proposed for this representation show performance improvements over their classical horizontal counterparts, but are either efficient only for certain database sizes, or assume particular characteristics of the database contents, or are applicable only to specific kinds of database schemas. We present here a new vertical mining algorithm called VIPER, which is general-purpose, making no special requirements of the underlying database. VIPER stores data in compressed bit-vectors called " snakes" and integrates a number of novel optimizations for efficient snake generation, intersection, counting and storage. We analyze the performance of VIPER for a range of synthetic database workloads. Our experimental results indicate significant performance gains, especially for large databases, over previously proposed vertical and horizontal mining algorithms. In fact, there are even workload regions where VIPER outperforms an optimal, but practically infeasible, horizontal mining algorithm.
引用
收藏
页码:22 / 33
页数:12
相关论文
共 11 条
[1]
Agarwal R., 1994, P 20 INT C VER LARG, V487, P499
[2]
[Anonymous], 1993, PROC 1993 ACM SIGMOD
[3]
DUNKEL B, 1999, P 15 INT C DAT ENG I
[4]
GARDARIN G, 1998, 199818 U VERS
[5]
GOLOMB SW, 1966, IEEE T INFORMATION T, V12
[6]
HOLSHEIMER M, 1995, P 1 INT C KNOWL DISC
[7]
Ogihara ZP., 1997, 3 INT C KNOWL DISC D
[8]
SAVASERE A, 1995, P 2U INT C VER LARG
[9]
SHENOY P, 2000, TR2000002 DSL IND I
[10]
YEN SJ, 1996, P 4 INT C PAR DISTR