A LINEAR-TIME PROBABILISTIC COUNTING ALGORITHM FOR DATABASE APPLICATIONS

被引:287
作者
WHANG, KY
VANDERZANDEN, BT
TAYLOR, HM
机构
[1] CORNELL UNIV,DEPT COMP SCI,ITHACA,NY 14853
[2] UNIV DELAWARE,DEPT MATH SCI,NEWARK,DE 19716
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 1990年 / 15卷 / 02期
关键词
D O I
10.1145/78922.78925
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O1990 time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design. © 1990, ACM. All rights reserved.
引用
收藏
页码:208 / 229
页数:22
相关论文
共 19 条
[1]  
[Anonymous], 1968, INTRO PROBABILITY TH
[2]   APPROXIMATING THE NUMBER OF UNIQUE VALUES OF AN ATTRIBUTE WITHOUT SORTING [J].
ASTRAHAN, MM ;
SCHKOLNICK, M ;
WHANG, KY .
INFORMATION SYSTEMS, 1987, 12 (01) :11-15
[3]   DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES [J].
BITTON, D ;
DEWITT, DJ .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02) :255-265
[4]  
Conte C.D.B. S. D., 2017, SOC INDUS APPL MATH
[5]  
DEMOLOMBE R, 1980, P VLDB, P55
[6]  
GELENBE R, 1982, 8TH P INT C VER LARG, P325
[7]  
Johnson N.L., 1977, URN MODELS THEIR APP
[8]  
MOOD AM, 1974, INTRO THEORY STATIST
[9]  
OLKEN F, 1986, 12TH P INT C VER LAR, P160
[10]  
ROWE NC, 1985, IEEE T SOFTWARE ENG, V10, P1081