Maintaining stream statistics over sliding windows

被引:262
作者
Datar, M [1 ]
Gionis, A
Indyk, P
Motwani, R
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[2] MIT, Comp Sci Lab, Cambridge, MA 02139 USA
关键词
statistics; data streams; sliding windows; approximation algorithms;
D O I
10.1137/S0097539701398363
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far. We refer to this model as the sliding window model. We consider the following basic problem: Given a stream of bits, maintain a count of the number of 1 s in the last N elements seen from the stream. We show that, using O(1/epsilon log(2) N) bits of memory, we can estimate the number of 1 s to within a factor of 1 + epsilon. We also give a matching lower bound of Omega(1/epsilon log(2) N) memory bits for any deterministic or randomized algorithms. We extend our scheme to maintain the sum of the last N positive integers and provide matching upper and lower bounds for this more general problem as well. We also show how to efficiently compute the L-p norms (p is an element of[1, 2]) of vectors in the sliding window model using our techniques. Using our algorithm, one can adapt many other techniques to work for the sliding window model with a multiplicative overhead of O(1/epsilon log N) in memory and a 1 + epsilon factor loss in accuracy. These include maintaining approximate histograms, hash tables, and statistics or aggregates such as sum and averages.
引用
收藏
页码:1794 / 1813
页数:20
相关论文
共 16 条
  • [1] Alon N., 1996, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, P20, DOI 10.1145/237814.237823
  • [2] [Anonymous], P 1998 INT C VER LAR
  • [3] Chaudhuri S, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P263, DOI 10.1145/304181.304206
  • [4] *CISC SYST, 2000, NETFL SERV APPL
  • [5] Cortes C., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P9, DOI 10.1145/347090.347094
  • [6] FEIGENBAUM J, 1999, P 40 ANN S FDN COMP, P501
  • [7] FRALEIGH C, 2000, TR00ATL101801 SPRINT
  • [8] Gilbert A. C., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P79
  • [9] Approximating a data stream for querying and estimation: Algorithms and performance evaluation
    Guha, S
    Koudas, N
    [J]. 18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, : 567 - 576
  • [10] Clustering data streams
    Guha, S
    Mishra, N
    Motwani, R
    O'Callaghan, L
    [J]. 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, : 359 - 366