COMPRESSION, INFORMATION-THEORY, AND GRAMMARS - A UNIFIED APPROACH

被引:14
作者
BOOKSTEIN, A
KLEIN, ST
机构
[1] Univ. of Chicago, Chicago, IL
[2] Univ. of Chicago, Chicago, IL
关键词
Automata Theory--Grammars - Codes; Symbolic--Encoding - Computer Programming--Algorithms - Probability--Random Processes;
D O I
10.1145/78915.78917
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given. © 1990, ACM. All rights reserved.
引用
收藏
页码:27 / 49
页数:23
相关论文
共 30 条
[1]  
Aho A. V., 1974, DESIGN ANAL COMPUTER
[2]  
ASH R, 1965, INFORMATION THEORY
[3]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[4]  
EVEN S., 1979, GRAPH ALGORITHMS
[5]  
Feller W., 1950, INTRO PROBABILITY TH, V1
[6]  
FRAENKEL AS, 1983, ACTA INFORM, V20, P371, DOI 10.1007/BF00264280
[7]  
FUBIN F, 1976, COMMUN ACM, V19, P617
[8]  
Hadley G., 1964, NONLINEAR DYNAMIC PR
[9]  
HAMMING RW, 1986, CODING INFORMATION T
[10]  
Heaps HS, 1978, INFORMATION RETRIEVA