Lempel-Ziv Algorithm and Data Compression: U.C. Berkeley CS170 Lecture 16, Study notes of Computer Science

Notes from u.c. Berkeley's cs170: intro to cs theory lecture 16, covering the lempel-ziv algorithm for data compression. How the algorithm builds an alphabet and encodes a file using it, as well as its advantages and limitations. Additionally, it discusses lower bounds on data compression and the concept of entropy.

Typology: Study notes

Pre 2010

Uploaded on 10/01/2009

koofers-user-k4d
koofers-user-k4d 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
U.C. Berkeley CS170: Intro to CS Theory Handout N16
Professor Luca Trevisan October 30, 2001
Notes for Lecture 16
1 The Lempel-Ziv algorithm
There is a sense in which the Huffman coding was “optimal”, but this is under several
assumptions:
1. The compression is lossless, i.e. uncompressing the compressed file yields exactly the
original file. When lossy compression is permitted, as for video, other algorithms can
achieve much greater compression, and this is a very active area of research because
people want to be able to send video and audio over the Web.
2. We know all the frequencies f(i) with which each character appears. How do we get
this information? We could make two passes over the data, the first to compute the
f(i), and the second to encode the file. But this can be much more expensive than
passing over the data once for large files residing on disk or tape. One way to do
just one pass over the data is to assume that the fractions f(i)/n of each character in
the file are similar to files you’ve compressed before. For example you could assume
all Java programs (or English text, or PowerPoint files, or ...) have about the same
fractions of characters appearing. A second cleverer way is to estimate the fractions
f(i)/n on the fly as you process the file. One can make Huffman coding adaptive this
way.
3. We know the set of characters (the alphabet) appearing in the file. This may seem
obvious, but there is a lot of freedom of choice. For example, the alphabet could be
the characters on a keyboard, or they could be the key words and variables names
appearing in a program. To see what difference this can make, suppose we have a
file consisting of nstrings aaaa and nstrings bbbb concatenated in some order. If we
choose the alphabet {a, b}then 8nbits are needed to encode the file. But if we choose
the alphabet {aaaa, bbbb}then only 2nbits are needed.
Picking the correct alphabet turns out to be crucial in practical compression algorithms.
Both the UNIX compress and GNU gzip algorithms use a greedy algorithm due to Lempel
and Ziv to compute a good alphabet in one pass while compressing. Here is how it works.
If sand tare two bit strings, we will use the notation stto mean the bit string gotten
by concatenating sand t.
We let Fbe the file we want to compress, and think of it just as a string of bits, that is
0’s and 1’s. We will build an alphabet Aof common bit strings encountered in F, and use
it to compress F. Given A, we will break Finto shorter bit strings like
F=A(1) 0A(2) 1 · ·· A(7) 0 · ·· A(5) 1 · ·· A(i)j · ··
and encode this by
1021 · ·· 70 · ·· 51 · ·· ij · ··
pf3
pf4
pf5

Partial preview of the text

Download Lempel-Ziv Algorithm and Data Compression: U.C. Berkeley CS170 Lecture 16 and more Study notes Computer Science in PDF only on Docsity!

U.C. Berkeley — CS170: Intro to CS Theory Handout N Professor Luca Trevisan October 30, 2001

Notes for Lecture 16

1 The Lempel-Ziv algorithm

There is a sense in which the Huffman coding was “optimal”, but this is under several assumptions:

  1. The compression is lossless, i.e. uncompressing the compressed file yields exactly the original file. When lossy compression is permitted, as for video, other algorithms can achieve much greater compression, and this is a very active area of research because people want to be able to send video and audio over the Web.
  2. We know all the frequencies f (i) with which each character appears. How do we get this information? We could make two passes over the data, the first to compute the f (i), and the second to encode the file. But this can be much more expensive than passing over the data once for large files residing on disk or tape. One way to do just one pass over the data is to assume that the fractions f (i)/n of each character in the file are similar to files you’ve compressed before. For example you could assume all Java programs (or English text, or PowerPoint files, or ...) have about the same fractions of characters appearing. A second cleverer way is to estimate the fractions f (i)/n on the fly as you process the file. One can make Huffman coding adaptive this way.
  3. We know the set of characters (the alphabet) appearing in the file. This may seem obvious, but there is a lot of freedom of choice. For example, the alphabet could be the characters on a keyboard, or they could be the key words and variables names appearing in a program. To see what difference this can make, suppose we have a file consisting of n strings aaaa and n strings bbbb concatenated in some order. If we choose the alphabet {a, b} then 8n bits are needed to encode the file. But if we choose the alphabet {aaaa, bbbb} then only 2n bits are needed. Picking the correct alphabet turns out to be crucial in practical compression algorithms. Both the UNIX compress and GNU gzip algorithms use a greedy algorithm due to Lempel and Ziv to compute a good alphabet in one pass while compressing. Here is how it works. If s and t are two bit strings, we will use the notation s ◦ t to mean the bit string gotten by concatenating s and t. We let F be the file we want to compress, and think of it just as a string of bits, that is 0’s and 1’s. We will build an alphabet A of common bit strings encountered in F , and use it to compress F. Given A, we will break F into shorter bit strings like

F = A(1) ◦ 0 ◦ A(2) ◦ 1 ◦ · · · ◦ A(7) ◦ 0 ◦ · · · ◦ A(5) ◦ 1 ◦ · · · ◦ A(i) ◦ j ◦ · · ·

and encode this by

1 ◦ 0 ◦ 2 ◦ 1 ◦ · · · ◦ 7 ◦ 0 ◦ · · · ◦ 5 ◦ 1 ◦ · · · ◦ i ◦ j ◦ · · ·

F = 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 0

A(1) = A(0)0 = 0

A(2) = A(1)0 = 00

A(3) = A(0)1 = 1

A(4) = A(3)1 = 11

A(5) = A(3)0 = 10

set A is full A(6) = A(5)1 = 101

= A(6)0 = 1010

= A(1)0 = 00

Encoded F = (0,0),(1,0),(0,1), (3,1),(3,0), (5,1),(6,0),(1,0)

Figure 1: An example of the Lempel-Ziv algorithm.

The indices i of A(i) are in turn encoded as fixed length binary integers, and the bits j are just bits. Given the fixed length (say r) of the binary integers, we decode by taking every group of r + 1 bits of a compressed file, using the first r bits to look up a string in A, and concatenating the last bit. So when storing (or sending) an encoded file, a header containing A is also stored (or sent). Notice that while Huffman’s algorithm encodes blocks of fixed size into binary sequences of variable length, Lempel-Ziv encodes blocks of varying length into blocks of fixed size. Here is the algorithm for encoding, including building A. Typically a fixed size is available for A, and once it fills up, the algorithm stops looking for new characters.

A = {∅} ... start with an alphabet containing only an empty string i = 0 ... points to next place in file f to start encoding repeat find A(k) in the current alphabet that matches as many leading bits fiFi+1fi+2 · · · as possible ... initially only A(0) = empty string matches ... Let b be the number of bits in A(k) if A is not full, add A(k) ◦ fi+b to A ... fi+b is the first bit unmatched by A(k) output k ◦ Fi+b i = i + b + 1 until i > length(F )

Note that A is built “greedily”, based on the beginning of the file. Thus there are no optimality guarantees for this algorithm. It can perform badly if the nature of the file changes substantially after A is filled up, however the algorithm makes only one pass through the file (there are other possible implementations: A may be unbounded, and the index k would be encoded with a variable-length code itself). In Figure 1 there is an example of the algorithm running, where the alphabet A fills up after 6 characters are inserted. In this small example no compression is obtained, but if A were large, and the same long bit strings appeared frequently, compression would be substantial. The gzip manpage claims that source code and English text is typically compressed 60%-70%.

This means that, from the average point of view, the optimum prefix-free encoding of a random file is just to leave the file as it is. In practice, however, files are not completely random. Once we formalize the notion of a not-completely-random file, we can show that some compression is possible, but not below a certain limit. First, we observe that even if not all n-bits strings are possible files, we still have lower bounds.

Theorem 4 Let F ⊆ { 0 , 1 }n^ be a set of possible files, and let C{ 0 , 1 }n^ → { 0 , 1 }∗^ be an injective function. Then

  1. There is a file f ∈ F such that |C(f )| ≥ log 2 |F |.
  2. If we pick a file f uniformly at random from F , then for every t we have

Pr[|C(f )| ≤ (log 2 |F |) − t] ≤

2 t−^1

  1. If C is prefix-free, then when we pick a file f uniformly at random from F we have E[|C(f )|] ≥ log 2 |F |.

Proof: Part 1 and 2 is proved with the same ideas as in Theorem 2 and Theorem 3. Part 3 has a more complicated proof that we omit. 2

2.2 Introduction to Entropy

Suppose now that we are in the following setting:

the file contains n characters there are c different characters possible character i has probability p(i) of appearing in the file

What can we say about probable and expected length of the output of an encoding algorithm? Let us first do a very rough approximate calculation. When we pick a file according to the above distribution, very likely there will be about p(i) · n characters equal to i. Each file with these “typical” frequencies has a probability about p = ∏ i p(i) p(i)·n (^) of being

generated. Since files with typical frequencies make up almost all the probability mass, there must be about 1/p = ∏ i(1/p(i))p(i)·n^ files of typical frequencies.^ Now, we are in a setting which is similar to the one of parts 2 and 3 of Theorem 4, where F is the set of files with typical frequencies. We then expect the encoding to be of length at least log 2

∏ i p(i) p(i)·n (^) = n · ∑ i p(i) log 2 (1/p(i)). The quantity^

∑ i p(i) log 2 1 /(p(i)) is the expected number of bits that it takes to encode each character, and is called the entropy of the distribution over the characters. The notion of entropy, the discovery of several of its properties, (a formal version of) the calculation above, as well as a (inefficient) optimal compression algorithm, and much, much more, are due to Shannon, and appeared in the late 40s in one of the most influential research papers ever written.

2.3 A Calculation

Making the above calculation precise would be long, and involve a lot of ≤s. Instead, we will formalize a slightly different setting. Consider the set F of files such that

the file contains n characters there are c different characters possible character i occurs n · p(i) times in the file

We will show that F contains roughly 2n^

∑ i p(i) log^2 1 /p(i)^ files, and so a random element of F cannot be compressed to less than n ∑ i p(i) log 2 1 /p(i) bits. Picking a random element of F is almost but not quite the setting that we described before, but it is close enough, and interesting in its own. Let us call f (i) = n · p(i) the number of occurrences of character i in the file. We need two results, both from Math 55. The first gives a formula for |F |:

|F | =

n! f (1)! · f (2)! · · · f (c)!

Here is a sketch of the proof of this formula. There are n! permutations of n characters, but many are the same because there are only c different characters. In particular, the f (1) appearances of character 1 are the same, so all f (1)! orderings of these locations are identical. Thus we need to divide n! by f (1)!. The same argument leads us to divide by all other f (i)!. Now we have an exact formula for |F |, but it is hard to interpret, so we replace it by a simpler approximation. We need a second result from Math 55, namely Stirling’s formula for approximating n!: n! ≈

2 πnn+.^5 e−n

This is a good approximation in the sense that the ratio n!/[

2 πnn+.^5 e−n] approaches 1 quickly as ∑ n grows. (In Math 55 we motivated this formula by the approximation log n! = n i=2 log^ i^ ≈^

∫ (^) n 1 log^ xdx.) We will use Stirling’s formula in the form

log 2 n! ≈ log 2

2 π + (n + .5) log 2 n − n log 2 e

Stirling’s formula is accurate for large arguments, so we will be interested in approxi- mating log 2 |F | for large n. Furthermore, we will actually estimate log^2 n^ | F^ |, which can be interpreted as the average number of bits per character to send a long file. Here goes:

log 2 |F | n

log 2 (n!/(f (1)! · · · f (c)!)) n = log 2 n! −

∑c i=1 log 2 f^ (i)! n ≈

n

· [log 2

2 π + (n + .5) log 2 n − n log 2 e

∑^ c

i=

(log 2

2 π + (f (i) + .5) log 2 f (i) − f (i) log 2 e)]