Approximate query processing using wavelets

被引:125
作者
Chakrabarti K. [1 ]
Garofalakis M. [2 ]
Rastogi R. [2 ]
Shim K. [3 ]
机构
[1] University of Illinois, Urbana, IL 61801
[2] Bell Laboratories, Murray Hill, NJ 07974
[3] KAIST and AITrc, Taejon 305-701, 373-1 Kusong-Dorg, Yusong-Gu
关键词
Approximate query answers; Data synopses; Query processing; Wavelet decomposition;
D O I
10.1007/s007780100049
中图分类号
学科分类号
摘要
Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in its query processing scope, typically focusing on specific forms of aggregate queries. Furthermore, conventional approaches based on sampling or histograms appear to be inherently limited when it comes to approximating the results of complex queries over high-dimensional DSS data sets. In this paper, we propose the use of multi-dimensional wavelets as an effective tool for general-purpose approximate query processing in modern, high-dimensional applications. Our approach is based on building wavelet-coefficient synopses of the data and using these synopses to provide approximate answers to queries. We develop novel query processing algorithms that operate directly on the wavelet-coefficient synopses of relational tables, allowing us to process arbitrarily complex queries entirely in the wavelet-coefficient domain. This guarantees extremely fast response times since our approximate query execution engine can do the bulk of its processing over compact sets of wavelet coefficients, essentially postponing the expansion into relational tuples until the end-result of the query. We also propose a novel wavelet decomposition algorithm that can build these synopses in an I/O-efficient manner. Finally, we conduct an extensive experimental study with synthetic as well as real-life data sets to determine the effectiveness of our wavelet-based approach compared to sampling and histograms. Our results demonstrate that our techniques: (1) provide approximate answers of better quality than either sampling or histograms; (2) offer query execution-time speedups of more than two orders of magnitude; and (3) guarantee extremely fast synopsis construction times that scale linearly with the size of the data.
引用
收藏
页码:199 / 223
页数:24
相关论文
共 34 条
  • [1] Acharya S., Gibbons P.B., Poosala V., Ramaswamy S., Join synopses for approximate query answering, Proc. 1999 ACM SIGMOD International Conference on Management of Data, pp. 275-286, (1999)
  • [2] Amsaleg L., Bonnet P., Franklin M.J., Tomasic A., Urban T., Improving responsiveness for wide-area data access, IEEE Data Engineering Bulletin, 20, 3, pp. 3-11, (1997)
  • [3] Barbara D., DuMouchel W., Faloutsos C., Haas P.J., Hellerstein J.M., Ioannidis Y., Jagadish H.V., Johnson T., Ng R., Poosala V., Ross K.A., Sevcik K.C., The New Jersey data reduction report, IEEE Data Engineering Bulletin, 20, 4, pp. 3-45, (1997)
  • [4] Cochran W.G., Sampling Techniques, (1977)
  • [5] Deshpande P.M., Ramasamy K., Shukla A., Naughton J.F., Caching multidimensional queries using chunks, Proc. 1998 ACM SIGMOd International Conference on Management of Data, pp. 259-270, (1998)
  • [6] Ester M., Kohlhammer J., Kriegel H.-P., The DC-tree: A fully dynamic index structure for data warehouses, Proc. Sixteenth International Conference on Data Engineering, pp. 379-388, (2000)
  • [7] Gibbons P.B., Matias Y., New sampling-based summary statistics for improving approximate query answers, Proc. 1998 ACM SIGMOD International Conference on Management of Data, pp. 331-342, (1998)
  • [8] Gibbons P.B., Matias Y., Poosala V., Fast incremental maintenance of approximate histograms, Proc. 23rd International Conference on Very Large Data Bases, pp. 466-475, (1997)
  • [9] Gibbons P.B., Matias Y., Poosala V., Aqua Project White Paper, (1997)
  • [10] Haas P.J., Large-sample and deterministic confidence intervals for online aggregation, Proc. Ninth International Conference on Scientific and Statistical Database Management, (1997)