



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Notes from u.c. Berkeley's cs170: intro to cs theory lecture 16, covering the lempel-ziv algorithm for data compression. How the algorithm builds an alphabet and encodes a file using it, as well as its advantages and limitations. Additionally, it discusses lower bounds on data compression and the concept of entropy.
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




U.C. Berkeley — CS170: Intro to CS Theory Handout N Professor Luca Trevisan October 30, 2001
There is a sense in which the Huffman coding was “optimal”, but this is under several assumptions:
F = A(1) ◦ 0 ◦ A(2) ◦ 1 ◦ · · · ◦ A(7) ◦ 0 ◦ · · · ◦ A(5) ◦ 1 ◦ · · · ◦ A(i) ◦ j ◦ · · ·
and encode this by
1 ◦ 0 ◦ 2 ◦ 1 ◦ · · · ◦ 7 ◦ 0 ◦ · · · ◦ 5 ◦ 1 ◦ · · · ◦ i ◦ j ◦ · · ·
Figure 1: An example of the Lempel-Ziv algorithm.
The indices i of A(i) are in turn encoded as fixed length binary integers, and the bits j are just bits. Given the fixed length (say r) of the binary integers, we decode by taking every group of r + 1 bits of a compressed file, using the first r bits to look up a string in A, and concatenating the last bit. So when storing (or sending) an encoded file, a header containing A is also stored (or sent). Notice that while Huffman’s algorithm encodes blocks of fixed size into binary sequences of variable length, Lempel-Ziv encodes blocks of varying length into blocks of fixed size. Here is the algorithm for encoding, including building A. Typically a fixed size is available for A, and once it fills up, the algorithm stops looking for new characters.
A = {∅} ... start with an alphabet containing only an empty string i = 0 ... points to next place in file f to start encoding repeat find A(k) in the current alphabet that matches as many leading bits fiFi+1fi+2 · · · as possible ... initially only A(0) = empty string matches ... Let b be the number of bits in A(k) if A is not full, add A(k) ◦ fi+b to A ... fi+b is the first bit unmatched by A(k) output k ◦ Fi+b i = i + b + 1 until i > length(F )
Note that A is built “greedily”, based on the beginning of the file. Thus there are no optimality guarantees for this algorithm. It can perform badly if the nature of the file changes substantially after A is filled up, however the algorithm makes only one pass through the file (there are other possible implementations: A may be unbounded, and the index k would be encoded with a variable-length code itself). In Figure 1 there is an example of the algorithm running, where the alphabet A fills up after 6 characters are inserted. In this small example no compression is obtained, but if A were large, and the same long bit strings appeared frequently, compression would be substantial. The gzip manpage claims that source code and English text is typically compressed 60%-70%.
This means that, from the average point of view, the optimum prefix-free encoding of a random file is just to leave the file as it is. In practice, however, files are not completely random. Once we formalize the notion of a not-completely-random file, we can show that some compression is possible, but not below a certain limit. First, we observe that even if not all n-bits strings are possible files, we still have lower bounds.
Theorem 4 Let F ⊆ { 0 , 1 }n^ be a set of possible files, and let C{ 0 , 1 }n^ → { 0 , 1 }∗^ be an injective function. Then
Pr[|C(f )| ≤ (log 2 |F |) − t] ≤
2 t−^1
Proof: Part 1 and 2 is proved with the same ideas as in Theorem 2 and Theorem 3. Part 3 has a more complicated proof that we omit. 2
Suppose now that we are in the following setting:
the file contains n characters there are c different characters possible character i has probability p(i) of appearing in the file
What can we say about probable and expected length of the output of an encoding algorithm? Let us first do a very rough approximate calculation. When we pick a file according to the above distribution, very likely there will be about p(i) · n characters equal to i. Each file with these “typical” frequencies has a probability about p = ∏ i p(i) p(i)·n (^) of being
generated. Since files with typical frequencies make up almost all the probability mass, there must be about 1/p = ∏ i(1/p(i))p(i)·n^ files of typical frequencies.^ Now, we are in a setting which is similar to the one of parts 2 and 3 of Theorem 4, where F is the set of files with typical frequencies. We then expect the encoding to be of length at least log 2
∏ i p(i) p(i)·n (^) = n · ∑ i p(i) log 2 (1/p(i)). The quantity^
∑ i p(i) log 2 1 /(p(i)) is the expected number of bits that it takes to encode each character, and is called the entropy of the distribution over the characters. The notion of entropy, the discovery of several of its properties, (a formal version of) the calculation above, as well as a (inefficient) optimal compression algorithm, and much, much more, are due to Shannon, and appeared in the late 40s in one of the most influential research papers ever written.
Making the above calculation precise would be long, and involve a lot of ≤s. Instead, we will formalize a slightly different setting. Consider the set F of files such that
the file contains n characters there are c different characters possible character i occurs n · p(i) times in the file
We will show that F contains roughly 2n^
∑ i p(i) log^2 1 /p(i)^ files, and so a random element of F cannot be compressed to less than n ∑ i p(i) log 2 1 /p(i) bits. Picking a random element of F is almost but not quite the setting that we described before, but it is close enough, and interesting in its own. Let us call f (i) = n · p(i) the number of occurrences of character i in the file. We need two results, both from Math 55. The first gives a formula for |F |:
n! f (1)! · f (2)! · · · f (c)!
Here is a sketch of the proof of this formula. There are n! permutations of n characters, but many are the same because there are only c different characters. In particular, the f (1) appearances of character 1 are the same, so all f (1)! orderings of these locations are identical. Thus we need to divide n! by f (1)!. The same argument leads us to divide by all other f (i)!. Now we have an exact formula for |F |, but it is hard to interpret, so we replace it by a simpler approximation. We need a second result from Math 55, namely Stirling’s formula for approximating n!: n! ≈
2 πnn+.^5 e−n
This is a good approximation in the sense that the ratio n!/[
2 πnn+.^5 e−n] approaches 1 quickly as ∑ n grows. (In Math 55 we motivated this formula by the approximation log n! = n i=2 log^ i^ ≈^
∫ (^) n 1 log^ xdx.) We will use Stirling’s formula in the form
log 2 n! ≈ log 2
2 π + (n + .5) log 2 n − n log 2 e
Stirling’s formula is accurate for large arguments, so we will be interested in approxi- mating log 2 |F | for large n. Furthermore, we will actually estimate log^2 n^ | F^ |, which can be interpreted as the average number of bits per character to send a long file. Here goes:
log 2 |F | n
log 2 (n!/(f (1)! · · · f (c)!)) n = log 2 n! −
∑c i=1 log 2 f^ (i)! n ≈
n
· [log 2
2 π + (n + .5) log 2 n − n log 2 e
∑^ c
i=
(log 2
2 π + (f (i) + .5) log 2 f (i) − f (i) log 2 e)]