




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document from wellesley college covers various compression algorithms, including uniform and variable length encoding, huffman coding, dictionary methods, and lz78 and lz77. The concepts behind these techniques, their advantages and disadvantages, and provides examples. The document also touches upon the importance of information theory and the role of entropy in compression.
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





CS231 Algorithms Handout # 31 Prof. Lyn Turbak November 20, 2001 Wellesley College
The Big Picture
We want to be able to store and retrieve data, as well as communicate it with others. In general, this requires encoding the data and decoding the encoded data:
encode decode
data storage medium/ communications network
data
For the purpose of this lecture, we observe the following constraints:
Uniform-Length Encoding of Textual Data
For textual data, it is common to encode each character as an 8-bit byte using a uniform-length encoding known as ASCII. Each byte can be written as a decimal integer in the range [0 .. 255]. Below is a table showing ASCII values in the range [0 .. 127] and their associated characters:
0:^@ 1:^A 2:^B 3:^C 4:^D 5:^E 6:^F 7:^G 8:^H 9:\t 10:\n 11:^K 12:^L 13:^M 14:^N 15:^O 16:^P 17:^Q 18:^R 19:^S 20:^T 21:^U 22:^V 23:^W 24:^X 25:^Y 26:^Z 27:^[ 28:^\ 29:^] 30:^^ 31:^_ 32: 33:! 34:" 35:# 36:$ 37:% 38:& 39:’ 40:( 41:) 42:* 43:+ 44:, 45:- 46:. 47:/ 48:0 49:1 50:2 51:3 52:4 53:5 54:6 55: 56:8 57:9 58:: 59:; 60:< 61:= 62:> 63:? 64:@ 65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 91:[ 92:\ 93:] 94:^ 95:_ 96:‘ 97:a 98:b 99:c 100:d 101:e 102:f 103:g 104:h 105:i 106:j 107:k 108:l 109:m 110:n 111:o 112:p 113:q 114:r 115:s 116:t 117:u 118:v 119:w 120:x 121:y 122:z 123:{ 124:| 125:} 126:~ 127:^?
For a letter or symbol σ, the notation ^σ stands for the character specified by pressing the Control key and σ key at the same time. The notation \t stands for the tab character, and \n for the newline character. Integers in the range [128 .. 255] correspond to other special characters.
Variable Length Encoding of Textual Data
Uniform-length encodings do not take advantage of the fact that different letters have different frequencies. For instance, here is a table of letter frequencies in English (per 1000 letters)^1 :
E 130 I 74 D 44 F 28 Y 19 B 9 J 2 T 93 O 74 H 35 P 27 G 16 X 5 Z 1 N 78 A 73 L 35 U 27 W 16 K 3 R 77 S 63 C 30 M 25 V 13 Q 3
This suggests a variable-length encoding in which frequent letters, like E and T, have shorter representations than infrequent letters. Morse code is an example of a variable length encoding:
A .- K -.- U ..- 0 ----- B -... L .-.. V ...- 1 .---- C -.-. M -- W .-- 2 ..--- D -.. N -. X -...- 3 ...-- E. O --- Y -.-- 4 ....- F ..-. P .--. Z --.. 5 ..... G --. Q --.- 6 -.... H .... R .-. Full Stop .-.-.- 7 --... I .. S ... Comma --..-- 8 ---.. J .--- T - Query ..--.. 9 ----.
(^1) as reported by http://library.thinkquest.org/28005/flashed/thelab/cryptograms/frequency.shtml
Huffman Coding
Huffman coding is a technique for choosing variable-length codes for symbols based on their relative probabilities. It was invented by David Huffman in 1952.
Main Encoding Steps:
Decoding Step: Use bit strings to find each character of text in the encoding trie.
Example: Make the encoding trie for "miss mississippi"
Note: The trie itself must be encoded and transmitted in addition to the encoded text!
Dictionary Methods
Key Idea: Replace substrings in text by codewords = indices into a dictionary.
Simple version:
More complex version: Interleave codewords and character strings. E.g., ("m", 17, 17, "ippi"), where 17 is a codeword for "iss".
Which dictionary?
LZ77: Another Adaptive Dictionary Method
LZ77 is another adaptive dictionary compression method invented by Ziv and Lempel in 1977. The Linux gzip utility is a variant of LZ77.
Idea: Encode text via triples of the form (back, len, char), where:
Mississippi Example: Here is the LZ77 encoding of "i miss mississippi":
[(0,0,i), (0,0, ), (0,0,m), (0,0,i) (0,0,s), (1,1, ), (5,4,i), (3,3,p), (1,1,i)]
Doodley Example: Using LZ77, the doodley poem can be encoded with 35 triples = 105 bytes vs. the 200 uncompressed bytes:^2
[(0,0,w), (0,0,e), (0,0, ), (0,0,d), (0,0,o), (0,0,‘,’), (4,3,o), (3,1,l), (0,0,e), (0,0,y), (12,28,↵), (43,1,h), (0,0,a), (0,0,t), (9,1,w), (13,1, ), (0,0,m), (0,0,u), (0,0,s), (8,1,‘,’), (6,3,d), (21,1,i), (27,1,y), (14,34,↵), (14,8,d), (80,3,m), (12,34,↵), (11,1,n), (53,1,i), (11,1, ), (105,3,b), (77,5,b), (130,2,i), (26,3,b), (13,29,‘.’)]
(^2) In practice, the compression factor is better than this because the indices themselves are compressed, say via Huffman coding.
Lossless Image Compression
An image is a two-dimensional array of pixels.
Uniform Encoding: A simple bitmap encoding consists of the width and height of the image followed by a sequence of pixel values (often in row-major order).
Some Image Compression Techniques:
Other Issues