Estimating Statistical Aggregates on Probabilistic Data Streams

被引:20
作者
Jayram, T. S. [1 ]
McGregor, Andrew [2 ]
Muthukrishnan, S. [3 ]
Vee, Erik
机构
[1] IBM Almaden Res, Almaden, CA USA
[2] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
[3] Google Inc, New York, NY USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2008年 / 33卷 / 04期
关键词
Algorithms; Theory; Probabilistic streams; OLAP; mean; median; frequency moments;
D O I
10.1145/1412331.1412338
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over a potentially exponential number of classical deterministic streams, where each item is deterministically one of the domain values. We present algorithms for computing commonly used aggregates on a probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F-2, the second frequency moment. We present the first known streaming algorithms for estimating F-0, the number of distinct items on probabilistic streams. Our work also gives an efficient one-pass algorithm for estimating the median, and a two-pass algorithm for estimating the range.
引用
收藏
页数:30
相关论文
共 23 条
[1]   The space complexity of approximating the frequency moments [J].
Alon, N ;
Matias, Y ;
Szegedy, M .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1999, 58 (01) :137-147
[2]  
[Anonymous], 2004, P 2004 ACM SIGMOD IN
[3]  
[Anonymous], P 2005 INT C VER LAR
[4]  
Babcock B., 2002, Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), P1, DOI DOI 10.1145/543613.543615
[5]  
Bar-Yossef Z., 2002, P 6 INT WORKSH RAND, P1
[6]  
Batu T., 2004, P 15 ANN ACM SIAM S, P910
[7]  
BURDICK D, 2006, INT C VER LARG DAT B, P391
[8]   The Space Complexity of Pass-Efficient Algorithms for Clustering [J].
Chang, Kevin L. ;
Kannan, Ravi .
PROCEEDINGS OF THE SEVENTHEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2006, :1157-1166
[9]  
Cormode G., 2007, P ACM SIGMOD INT C M
[10]  
Dalvi Nilesh N., 2004, P INT C VER LARG DAT