A block sampling approach to distinct value estimation

被引:3
作者
Brutlag, JD
Richardson, TS
机构
[1] Microsoft WebTV Networks, Mt View, CA 94043 USA
[2] Univ Washington, Dept Stat, Seattle, WA 98195 USA
基金
美国国家科学基金会;
关键词
block sample estimators; estimating the number of unknown species; sampling from databases; zipf distribution;
D O I
10.1198/106186002760180572
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
To process queries efficiently in a relational database it is often of value to estimate the number of distinct values occurring in a particular field. In contexts where complete enumeration is costly. estimators based on a subsample are an attractive alternative. Though many such estimators have already been proposed. most are based on a simple random sampling of records which is wasteful in a database where records are retrieved in blocks. This article seeks to develop estimators that perform well in contexts where blocks of records are sampled randomly. Building on a result of Good's we derive six such estimators and evaluate their performance.
引用
收藏
页码:389 / 404
页数:16
相关论文
共 17 条
[1]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[2]  
BRUTLAG JD, 1999, THESIS U WASHINGTON
[3]   ESTIMATING THE NUMBER OF SPECIES - A REVIEW [J].
BUNGE, J ;
FITZPATRICK, M .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) :364-373
[4]   ESTIMATING THE NUMBER OF CLASSES VIA SAMPLE COVERAGE [J].
CHAO, A ;
LEE, SM .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (417) :210-217
[5]  
Chaudhuri S., 1998, Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS 1998, P34, DOI 10.1145/275487.275492
[6]  
CHAUDHURI S, 1998, P ACM SIGMOD INT C M, P436
[7]   CAPTURE-RECAPTURE ESTIMATION [J].
DARROCH, JN ;
RATCLIFF, D .
BIOMETRICS, 1980, 36 (01) :149-153
[8]  
Faloutsos C, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P307
[9]  
Garcia-Molina H., 1999, DATABASE SYSTEM IMPL
[10]  
GOOD IJ, 1953, BIOMETRIKA, V40, P45