THE ZERO-FREQUENCY PROBLEM - ESTIMATING THE PROBABILITIES OF NOVEL EVENTS IN ADAPTIVE TEXT COMPRESSION

被引:298
作者
WITTEN, IH [1 ]
BELL, TC [1 ]
机构
[1] UNIV CANTERBURY,DEPT COMP SCI,CHRISTCHURCH 1,NEW ZEALAND
基金
加拿大自然科学与工程研究理事会;
关键词
ZERO-FREQUENCY PROBLEM; NOVEL EVENTS; ADAPTIVE TEXT COMPRESSION; POISSON PROCESS; PROBABILITY ESTIMATION;
D O I
10.1109/18.87000
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The zero-frequency problem is the problem of estimating the likelihood of a novel event occurring. It is important in adaptive statistical text compression because it is almost always necessary to reserve a small part of the code space for the unexpected (say the appearance of a new word); the alternative of allocating code space to every possible event (say a code for each ASCII character) invariably impairs coding efficiency since not all possible events actually occur. This paper reviews approaches that have been taken to the problem in adaptive text compression. Although several methods have been used, their suitability has been based on empirical evaluation rather than a well-founded model. We propose the application of a Poisson process model of novelty. Its ability to predict novel tokens is evaluated, and it consistently outperforms existing methods. It is also applied to a practical statistical coding scheme, where a slight modification is required to avoid divergence. The result is a well-founded zero-frequency model that explains observed differences in the performance of existing methods, and offers a small improvement in the coding efficiency of text compression over the best method previously known.
引用
收藏
页码:1085 / 1094
页数:10
相关论文
共 23 条
[21]   UNIVERSAL ALGORITHM FOR SEQUENTIAL DATA COMPRESSION [J].
ZIV, J ;
LEMPEL, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1977, 23 (03) :337-343
[22]   COMPRESSION OF INDIVIDUAL SEQUENCES VIA VARIABLE-RATE CODING [J].
ZIV, J ;
LEMPEL, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1978, 24 (05) :530-536
[23]  
[No title captured]