Comparing data streams using Hamming norms (How to zero in)

被引:47
作者
Cormode, G
Datar, M
Indyk, P
Muthukrishnan, S
机构
[1] Rutgers State Univ, Ctr Discrete Math & Comp Sci, Piscataway, NJ 08854 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[3] MIT, Comp Sci Lab, Cambridge, MA 02139 USA
[4] Rutgers State Univ, Div Comp & Informat Sci, Piscataway, NJ 08854 USA
[5] AT&T Res, Florham Pk, NJ 07932 USA
关键词
data stream analysis; approximate query processing; data structures and algorithms; data reduction;
D O I
10.1109/TKDE.2003.1198388
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed "on the fly" as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l(0) sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
引用
收藏
页码:529 / 540
页数:12
相关论文
共 42 条
[1]   The space complexity of approximating the frequency moments [J].
Alon, N ;
Matias, Y ;
Szegedy, M .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1999, 58 (01) :137-147
[2]  
Babu S, 2001, SIGMOD REC, V30, P109, DOI 10.1145/603867.603884
[3]  
Bar-Yossef Z, 2002, SIAM PROC S, P623
[4]  
Bar-Yossef Z., 2002, P 6 INT WORKSH RAND, P1
[5]   ESTIMATING THE NUMBER OF SPECIES - A REVIEW [J].
BUNGE, J ;
FITZPATRICK, M .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) :364-373
[6]   METHOD FOR SIMULATING STABLE RANDOM-VARIABLES [J].
CHAMBERS, JM ;
MALLOWS, CL ;
STUCK, BW .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1976, 71 (354) :340-344
[7]  
Charikar Moses, 2000, PODS, P268
[8]  
CHAUDHURI S, 1998, P ACM SIGMOD, V27, P436
[9]  
Cohen E., 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), P489, DOI 10.1109/ICDE.2000.839448
[10]  
CORMODE G, 2002, 200235 DIMACS