Compressing massive geophysical datasets using vector quantization

被引:12
作者
Braverman, A [1 ]
机构
[1] CALTECH, Jet Prop Lab, Multi Angle Imaging SpectroRadiometer Instrument, Pasadena, CA 91109 USA
关键词
clustering; ECVQ algorithm; K-means algorithm; Monte Carlo methods; self-consistency;
D O I
10.1198/106186002317375613
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
This article presents a procedure for compressing massive geophysical datasets. A dataset is stratified geographically, and a penalized clustering algorithm applied to each stratum independently. The algorithm, called Monte Carlo extended ECVQ, is based on the entropy-constrained vector quantizer algorithm (ECVQ). ECVQ trades off error induced by compression against data reduction to produce a set of representative points, each of which stands for some number of input observations. Since the data are massive, a preliminary set of representatives is determined from a stratum sample, then the full stratum is clustered by assigning each observation to the nearest representative. After replacing the initial representatives by means of these clusters, the new representatives and their associated counts are a compressed version, or summary, of the original stratum data. With the initial set of representatives determined from a sample, the final summary is subject to sampling variation. A statistical model for the relationship between compressed and uncompressed data provides a framework for assessing this variability. Test data from the International Satellite Cloud Climatology Project are used to demonstrate the procedure.
引用
收藏
页码:44 / 62
页数:19
相关论文
共 14 条
[1]  
[Anonymous], 1998, 15 INT C MACH LEARN
[2]  
[Anonymous], 1996, INT SATELLITE CLOUD
[3]  
ASH RB, 1965, INFORMATION THEORY
[4]  
BRAVERMAN AJ, 1999, THESIS U CALIFORNIA
[5]   ENTROPY-CONSTRAINED VECTOR QUANTIZATION [J].
CHOU, PA ;
LOOKABAUGH, T ;
GRAY, RM .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (01) :31-42
[6]  
Cover T. M., 2005, ELEM INF THEORY, DOI 10.1002/047174882X
[7]  
DOWNING DJ, 1996, ORNL13114 COMP SCI M
[8]  
DUMOUCHEL W, 1999, KDD 99 P 5 ACM SIGKD, P6
[9]  
Gray R. M., 1990, Source Coding Theory
[10]   What shall we do with the data we are expecting from upcoming earth observation satellites? [J].
Kahn, R ;
Braverman, A .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 1999, 8 (03) :575-588