Lempel-Ziv Algorithm and Data Compression: U.C. Berkeley CS170 Lecture 16 | Study notes Computer Science

U.C. Berkeley — CS170: Intro to CS Theory Handout N16

Professor Luca Trevisan October 30, 2001

Notes for Lecture 16

1 The Lempel-Ziv algorithm

There is a sense in which the Huffman coding was “optimal”, but this is under several

assumptions:

1. The compression is lossless, i.e. uncompressing the compressed file yields exactly the

original file. When lossy compression is permitted, as for video, other algorithms can

achieve much greater compression, and this is a very active area of research because

people want to be able to send video and audio over the Web.

2. We know all the frequencies f(i) with which each character appears. How do we get

this information? We could make two passes over the data, the first to compute the

f(i), and the second to encode the file. But this can be much more expensive than

passing over the data once for large files residing on disk or tape. One way to do

just one pass over the data is to assume that the fractions f(i)/n of each character in

the file are similar to files you’ve compressed before. For example you could assume

all Java programs (or English text, or PowerPoint files, or ...) have about the same

fractions of characters appearing. A second cleverer way is to estimate the fractions

f(i)/n on the fly as you process the file. One can make Huffman coding adaptive this

way.

3. We know the set of characters (the alphabet) appearing in the file. This may seem

obvious, but there is a lot of freedom of choice. For example, the alphabet could be

the characters on a keyboard, or they could be the key words and variables names

appearing in a program. To see what difference this can make, suppose we have a

file consisting of nstrings aaaa and nstrings bbbb concatenated in some order. If we

choose the alphabet {a, b}then 8nbits are needed to encode the file. But if we choose

the alphabet {aaaa, bbbb}then only 2nbits are needed.

Picking the correct alphabet turns out to be crucial in practical compression algorithms.

Both the UNIX compress and GNU gzip algorithms use a greedy algorithm due to Lempel

and Ziv to compute a good alphabet in one pass while compressing. Here is how it works.

If sand tare two bit strings, we will use the notation s◦tto mean the bit string gotten

by concatenating sand t.

We let Fbe the file we want to compress, and think of it just as a string of bits, that is

0’s and 1’s. We will build an alphabet Aof common bit strings encountered in F, and use

it to compress F. Given A, we will break Finto shorter bit strings like

F=A(1) ◦0◦A(2) ◦1◦ · ·· ◦ A(7) ◦0◦ · ·· ◦ A(5) ◦1◦ · ·· ◦ A(i)◦j◦ · ··

and encode this by

1◦0◦2◦1◦ · ·· ◦ 7◦0◦ · ·· ◦ 5◦1◦ · ·· ◦ i◦j◦ · ··

Lempel-Ziv Algorithm and Data Compression: U.C. Berkeley CS170 Lecture 16, Study notes of Computer Science

Related documents

Partial preview of the text

Download Lempel-Ziv Algorithm and Data Compression: U.C. Berkeley CS170 Lecture 16 and more Study notes Computer Science in PDF only on Docsity!

Notes for Lecture 16

1 The Lempel-Ziv algorithm

F = 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 0 0

A(1) = A(0)0 = 0

A(2) = A(1)0 = 00

A(3) = A(0)1 = 1

A(4) = A(3)1 = 11

A(5) = A(3)0 = 10

set A is full A(6) = A(5)1 = 101

= A(6)0 = 1010

= A(1)0 = 00

Encoded F = (0,0),(1,0),(0,1), (3,1),(3,0), (5,1),(6,0),(1,0)

2.2 Introduction to Entropy

2.3 A Calculation

|F | =