SAMPLING TO ESTIMATE THE NUMBER OF DUPLICATES IN A DATABASE

被引:1
作者
BUNGE, JA [1 ]
HANDLEY, JC [1 ]
机构
[1] ONLINE COMP LIB CTR INC,DUBLIN,OH 43017
关键词
D O I
10.1016/0167-9473(91)90053-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The problem of estimating the number of duplicate records in a database by sampling is explored. Sampling is done with probability of selection proportional to size, with size being defined as the number of records equivalent to a given record. The cost of sampling this way forces the sample sizes to be small. Two estimators are given for with- and without-replacement sampling. A sample-based estimator of the variance of the with-replacement estimator is available but not for the without-replacement estimator and thus is determined by computer simulation. Because of the size of the population and the many populations possible, efficient simulation software was constructed and is described. Simulations show that in our application, both estimators are accurate for small samples and the estimates are nearly identical. This is supported by mathematical analysis.
引用
收藏
页码:65 / 74
页数:10
相关论文
共 6 条
[1]   ESTIMATION OF FINITE POPULATION PROPERTIES WHEN SAMPLING IS WITHOUT REPLACEMENT AND PROPORTIONAL TO MAGNITUDE [J].
ANDREATTA, G ;
KAUFMAN, GM .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1986, 81 (395) :657-666
[2]  
Cochran W. G., 2007, SAMPLING TECHNIQUES
[3]  
GOOD LA, ANN MATH STAT, V20, P572
[4]  
KNOTT M, ANN MATH STAT, V38, P1255
[5]  
Press W.H., 1994, NUMERICAL RECIPES C, V2nd ed.
[6]  
1988, OCLC198788 ANN REP