Data Compression-Data Representation And Algorithm Design-Lecture Slides, Slides of Data Representation and Algorithm Design

This lecture was delivered by Dr. Ameet Shashank at B R Ambedkar National Institute of Technology. Its relate to Data Representation and Algorithm Design course. Its main points are: Data, Compression, Encoding, Decoding, Message, Encode, Decode, Communication, Ratio

Typology: Slides

2011/2012

Uploaded on 07/15/2012

saandeep
saandeep 🇮🇳

4.5

(6)

99 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Compression
2
Data Compression
Compression reduces the size of a file:
!To save space when storing it.
!To save time when transmitting it.
!Most files have lots of redundancy.
Who needs compression?
!Moore's law: # transistors on a chip doubles every 18-24 months.
!Parkinson's law: data expands to fill space available.
!Text, images, sound, video, …
Basic concepts ancient (1950s), best technology recently developed.
All of the books in the world contain no more information t han is
broadcast as video in a single large American city in a single year.
Not all bits have equal value. -Carl Sagan
3
Applications of Data Compression
Generic file compression.
!Files: GZIP, BZIP, BOA.
!Archivers: PKZIP.
!File systems: NTFS.
Multimedia.
!Images: GIF, JPEG.
!Sound: MP3.
!Video: MPEG, DivX™, HDTV.
Communication.
!ITU-T T4 Group 3 Fax.
!V.42bis modem.
Databases. Google.
4
Encoding and Decoding
Message. Binary data M we want to compress.
Encode. Generate a "compressed" representation C(M).
Decode. Reconstruct original message or some approximation M'.
Compression ratio. Bits in C(M) / bits in M.
Lossless. M = M', 50-75% or lower.
Ex. Natural language, source code, executables.
Lossy. M ! M', 10% or lower.
Ex. Images, sound, video.
Encoder
MDecoder
C(M) M'
hopefully uses fewer bits
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Data Compression-Data Representation And Algorithm Design-Lecture Slides and more Slides Data Representation and Algorithm Design in PDF only on Docsity!

Data Compression

2

Data Compression

Compression reduces the size of a file:

! To save space when storing it.

! To save time when transmitting it.

! Most files have lots of redundancy.

Who needs compression?

! Moore's law: # transistors on a chip doubles every 18-24 months.

! Parkinson's law: data expands to fill space available.

! Text, images, sound, video, …

Basic concepts ancient (1950s), best technology recently developed.

All of the books in the world contain no more information than is

broadcast as video in a single large American city in a single year.

Not all bits have equal value. - Carl Sagan

3

Applications of Data Compression

Generic file compression.

! Files: GZIP, BZIP, BOA.

! Archivers: PKZIP.

! File systems: NTFS.

Multimedia.

! Images: GIF, JPEG.

! Sound: MP3.

! Video: MPEG, DivX™, HDTV.

Communication.

! ITU-T T4 Group 3 Fax.

! V.42bis modem.

Databases. Google.

4

Encoding and Decoding

Message. Binary data M we want to compress.

Encode. Generate a "compressed" representation C(M).

Decode. Reconstruct original message or some approximation M'.

Compression ratio. Bits in C(M) / bits in M.

Lossless. M = M', 50-75% or lower.

Ex. Natural language, source code, executables.

Lossy. M! M', 10% or lower.

Ex. Images, sound, video.

M Encoder C(M) Decoder M'

hopefully uses fewer bits

5 Ancient Ideas Ancient ideas. ! Braille. ! Morse code. ! Natural languages. ! Mathematical notation. ! Decimal number system.

"Poetry is the art of lossy data compression."

6 Natural Encoding Natural encoding. ( 19 " 51) + 6 = 9 75 bits. needed to encode number of characters per line 000000000000000000000000000011111111111111000000000 000000000000000000000000001111111111111111110000000 000000000000000000000001111111111111111111111110000 000000000000000000000011111111111111111111111111000 000000000000000000001111111111111111111111111111110 000000000000000000011111110000000000000000001111111 000000000000000000011111000000000000000000000011111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000011100000000000000000000000000111 000000000000000000001111000000000000000000000001110 000000000000000000000011100000000000000000000111000 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011111111111111111111111111111111111111111111111111 011000000000000000000000000000000000000000000000011

19 - by- 51 raster of letter 'q' lying on its side

7 Run-Length Encoding Natural encoding. ( 19 " 51) + 6 = 9 75 bits. Run-length encoding. ( 63 " 6) + 6 = 384 bits. 28 14 9 26 18 7 23 24 4 22 26 3 20 30 1 19 7 18 7 19 5 22 5 19 3 26 3 19 3 26 3 19 3 26 3 19 3 26 3 20 4 23 3 1 22 3 20 3 3 1 50 1 50 1 50 1 50 1 50 1 2 46 2

19 - by- 51 raster of letter 'q' lying on its side RLE

63 6-bit run lengths 8 Run-Length Encoding Run-length encoding (RLE). ! Exploit long runs of repeated characters. ! Binary alphabet: runs alternate between 0 and 1; output counts. ! "File inflation" possible if runs are short. Applications. ! JPEG. ! ITU-T T4 fax machines. (black and white graphics)

13 ITU-T T 4 Group 3 Fax Group 3 fax. Transmit image comprised of up to 1728 pels per line, typically mostly white. RLE. Compute run-lengths of white and black pels. Prefix-free code. Encode run-lengths using following prefix-free code. picture element = black or white 00110101 0000110111 white black 0 run 1 000111 010 2 0111 11 3 1000 10 … … … 63 00110100 000001100111 64 11011 0000001111 128 10010 000011001000 192 010111 000011001001 … … … 1728 010011011 0000001100101

3W 1B 2W 2B 194W

14 How to represent? Use a binary trie. ! Symbols are stored in leaves. ! Encoding is path to leaf. Encoding. ! Method 1: start at leaf; follow path up to the root, and print bits in reverse order. ! Method 2: create ST of symbol-encoding pairs. Decoding. ! Start at root of tree. ! Go left if bit is 0; go right if 1. ! If leaf node, print symbol and return to root. a d

Prefix-Free Code: Encoding and Decoding !

c

r

b

char encoding

a 0 b 111 c 1011 d 100 r 110 ! 1010 15 How to Transmit the Trie How to transmit the trie? ! Send preorder traversal of trie.

  • we use * as sentinel for internal nodes
  • what if there is no sentinel? ! Send number of characters to decode. ! Send bits (packed 8 to the byte). ! If message is long, overhead of sending trie is small.

ad!c*rb

a d

c

r

b

char encoding

a 0 b 111 c 1011 d 100 r 110 ! 1010 16 Prefix-Free Decoding Implementation public class HuffmanDecoder { private Node root = new Node(); private class Node { char ch; Node left, right; Node() { ch = StdIn.readChar(); if (ch == '*') { left = new Node(); right = new Node(); } } boolean isInternal() { } }

ad!c*rb

build tree from preorder traversal

17

Prefix-Free Decoding Implementation

public void decode() { int N = StdIn.readInt(); for (int i = 0 ; i < N; i++) { Node x = root; while (x.isInternal()) { char bit = StdIn.readChar(); if (bit == ' 0 ') x = x.left; else if (bit == ' 1 ') x = x.right; } System.out.print(x.ch); } } use bits in real applications instead of chars

18

Huffman Codes

David Huffman 19

Huffman Coding

Q. How to create a good prefix-free code?

A. Huffman code. [David Huffman, 1950]

To compute Huffman code:

! Count frequencies ps for each symbol s in message.

! Start with a node corresponding to each symbol s with weight ps.

! Repeat:

– select two trees with min weight p 1 and p 2

– merge into single tree with weight p 1 + p 2

Applications. JPEG, MP3, MPEG, PKZIP, GZIP, …

20

Huffman Coding Example

Freq

E

Char

T A O I N R H L D

C

U

S 65

Total 838

Huff

R S N I

E

H

C U

D L

A O T

27 What Data Can be Compressed? Theorem. Impossible to losslessly compress all files. Pf. ! Consider all 1,000 bit messages. ! 21000 possible messages. ! Only 2^999 + 2^998 + … + 1 can be encoded with ' 9 99 bits. ! Only 1 in 2^499 can be encoded with ' 5 00 bits! 28 A Difficult File To Compress f gcglmklklamcnieffonbhjjoeflmmkggjdnccojiciicdnlfhmkcgplchjcfecncianbicdkmjmagmnoolbnkiehgklbpgnoabdcajnbfnbgejciocoenebeilephfgglcmjpinihhkpkpaemndeffflplahcgjlmfgeomjdmecmagaabhplcbjpcie c apelefabpgdajmeloifdkclepkdlehgfhkikdmbigoclajidenmoaajalglihhnenidaipgaeiimhlbcenblilfjenmiagcgfpaedannkbnmholjbggkicnccopgmpimmkdmoogiifmigoeeeiokmegfejlijdfbjagdkgldkcdegpnnhhadfpofhjo n gpojgbhmclikhaopddndlhhmaehpldlnhchkoeoajjdefbamcenkhnamdehegjknfmaehnmbemanenbfcecfadcbghepomjiggibkbcjokanpjkmnolboimmfaimgjnjaanaeadfmfahjlihnilgogmapljjobaaiifjinaeebjibdcpjacediigbdp k eoafgcgbmmcjlilolbmbdphdahffkilldfmaijmkdhbfpmlpgpgcbnbgedjnkecemihnlpknjfoacfjajellcandgbchkhfipffeackhminnpkinapadnnfpfldajmdalclkepihneikiolkaiegnpbdndimkmgjfnnbjnfgckolphhibbgokhhddfg m iffpjocephjfpmpknbapbiladijgonapahfopgbbdkaeflipfhgocbooinbmbfcgmhopfdfgbmbkgfiegdmknfgkecfhpckemeblhdbkgcholfbadljdnignelnjljdchgganffibmdalacnmejldmmfokfolkhccjoegkfejfoklbdmfejgpcagojk e ppnnddpehjfmcnnbnigeaanlmkbamonfgigfpepkhdnjckkgbeidegkdnljghbiiloaafckfelgfbjolialfabjcdapepmcgopdjpleaogdbddlgdijlfhccpfidaclaomahlffmepopmfjgbophhjkpmhhlnmphdemgcjpegckhggedapokhkfldjm cnegmibjkajcdcpjcpgjminhhakihfgiiachfepffnilcooiciepoapmdjniimfbolchkibkbmhbkgconimkdchahcnhap f gdcklidacpfiokdeinkcaelgechjbacpckpjbkbfclfjagkdklbmlgmnocopbbdamkdhgjdikdaafpbbblkdjcngiedeokikkjfaaokijhcblnbbcihgammhbbcopeanegppfmeeidnlmieonfmgpioooodfaddcopehceblgabpmifjkckeegaebji l kahoeekopgcgfgghbogekkmakmlaiipbfhbjiiibkedfkmdocpbpoblhoffbhheflfiappbjknbojdmljoffoeeimhpmcjcanhemaeeibpkeljilbfmhclaodedflgmmjjdhcjanjlpommhahpajeaekdhbhbjfppnjcaofmcinpdpieaannlbfickj m jmobgfgnpcbpkdhclcccmofdheilpdfhlmhnmbndmjimfepajjmmcboolcdlbkeggghopjcgehklhoghmlnbahhgijjkphakbmncolmonhokkgljdajmcfpmpcpcbhckpcghcmjioafdnjggmjbhdjcephefelecibikilcflimfbabcmkfbjekchgl anckidhmbeanmlabncnccpbhoafajjicnfeenppoekmlddholnbdjapbfcajblbooiaepfmmeoafedflmdcbaodgeahimc g ipocoaimomkjdlijhojebbbpoffmigohooigbfackkagdmjonmeedcldoidpemeoidibjckelmipiicdnfinolicmgagbplgfbdfpfjoinacfacjfpinnceoemcablagfoaiaimkheeoilimpieggleigiikbjikooolcgmollmfhkjdegkifiijkjf e jfjfhmedbdihdojadkplhlpndiefpdihbcfkgmdboonnjblcjhffghihfcilmencijiapmojgiolcdkopnijdjadmpnikfcnpgibpogjliaafnmpllmjoaahcpjkiibnodgkndbopalneljlndickbdmolemfhcjcfdopmfikfohmmknicmifkfoglj g biaellgogcfdbeamjndbalmipmlfpmhbdkgfibihmkeehlgklppfiokhbeiopknfkokfeoccehkbhmiilfclhehehcfbagelamfnbfbbfndfmnccjjomjeffhnpiphodncgolekifedfmlhljepmcnioeholffcdncjmgbkdpfiebbcdbmibbelefbp n jjplgebjakdcapkpobehcobpkojhcdagehblbjalnchonkfhhdponafhkffmblfgplobhdmknlkilaijgbpmgnfkkkjfacdookmldhjljenlhhljnhfoaiiglifnpacimmngoclaoblcdfjeebkmejodlnhbdfkheobhikfjfpehbnakjljcpbchlcg hllmemegncknmkkeoogilijmmkomllbkkabelmodcohdhppdakbelmlejdnmbfmcjdebefnjihnejmnogeeafldabjcgfo a cemhklcdkchmbkfbfnebiahfppkcaijegfihnlfohpdiocliffnaldgbnpapdgemffanmglefcojchdeifbnhkfbkjimaloifoedehgeplphcijcinlfnodoenpglnegiehmkdpdkekpgckhpkhobkndkjgagfjeiganoplgnloeldbajkpmobbcjpd cehglelckbhjilafccfipgebpc.... One million pseudo-random characters (a – p) 29 A Difficult File To Compress

% javac Rand.java

% java Rand > temp.txt

% compress – c temp.txt > temp.Z

% gzip – c temp.txt > temp.gz

% bzip 2 – c temp.txt > temp.bz 2

% ls – l

231 Rand.java

1000000 temp.txt

576861 temp.Z

570872 temp.gz

499329 temp.bz 2

resulting file sizes (bytes)

public class Rand {

public static void main(String[] args) {

for (int i = 0 ; i < 1000000 ; i++) {

char c = 'a';

c += (char) (Math.random() * 16 );

System.out.print(c);

231 bytes, but its output is hard to compress (assume random seed is fixed) 30 Information Theory Intrinsic difficulty of compression. ! Short program generates large data file. ! Optimal compression algorithm has to discover program! ! Undecidable problem. Q. How do we know if our algorithm is doing well? A. Want lower bound on # bits required by any compression scheme.

31 Language Model Q. How compression algorithms work? A. Exploit statistical biases of input messages. ! White patches occur in typical images. ! Word Princeton occurs more frequently than Yale. Compression is all about probability. ! Formulate probabilistic model to predict symbols.

  • simple: character counts, repeated strings
  • complex: models of a human face ! Use model to encode message. ! Use same model to decode message. Ex. Order 0 Markov model: each symbol s generated independently at random, with fixed probability p(s). 32 Entropy Entropy. [Shannon, 1948] ! Information content of symbol s is proportional to -log 2 p(s). ! Weighted average of information content over all symbols. ! Interface between coding and model. ! H ( S ) = " p ( s ) log 2 p ( s ) s # S $

p(a)

Model 1 1 / 2

p(b)

H(S)

Model 2 0. 900 0. 100 0. 469

Model 3 0. 990 0. 010 0. 0808

Model 4 1 0 0

p(a)

p(b)

p(c)

p(d)

p(e)

p(f)

H(S)

Fair die 2. 585

33 Entropy and Compression Theorem. [Shannon, 1948] If data source is an order 0 Markov model, any compression scheme must use ( H(S) bits per symbol on average. ! Cornerstone result of information theory. ! Ex: to transmit results of fair die, need ( 2.58 bits per roll. Theorem. [Huffman, 1952] If data source is an order 0 Markov model, Huffman code uses ' H(S) + 1 bits per symbol on average. Q. Is there any hope of doing better than Huffman coding? A. Yes. Huffman wastes up to 1 bit per symbol.

  • if H(S) is close to 0, this matters
  • can do better with "arithmetic coding" A. Yes. Source may not be order 0 Markov model. 34 Entropy of the English Language Q. How much redundancy is in the English language? A. Quite a bit.

"... randomising letters in the middle of words [has] little or no effect on the

ability of skilled readers to understand the text. This is easy to

denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all the

letetrs, keipeng the first two and last two the same, and reibadailty would

hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at

the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we

may have some pofrweul palrlael prsooscers at work. The resaon for this is

suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton.

We only need the first and last two letetrs to spot chganes in meniang."

39 LZW Algorithm Lempel-Ziv-Welch. [variant of LZ78] ! Create ST and associate an integer with each useful string. ! When input matches string in ST, output associated integer. Encoding. ! Find longest string s in ST that is a prefix of remaining part of string to compress. ! Output integer associated with s. ! Add s ) x to dictionary, where x is next char in string to compress. Ex. Dictionary: a, aa, ab, aba, abb, abaa, abaab, abaaa, ! String to be compressed: abaababbb… ! s = abaab, x = a. ! Output integer associated with s; insert abaaba into ST. 40 LZW Example

Word

_

a

b

z

Index

SEND

257 STOP

it

Word

tt

ty

y_

_b

bi

ty_

_bi

it_

_bin

Index

264 itt

Dictionary

Send

i

Input

t t y _ b t t y _

b 262

i

t 258

i

_

b

i

n 110

STOP 257

SEND 256

41 LZW Implementation Implementation. ! Use trie to create symbol table on-the-fly. ! Note that prefix of every word is also in ST. Encode. ! Lookup string suffix in trie. ! Output ST index at bottom. ! Add new node to bottom of trie. Decode. ! Lookup index in array ! Output string ! Insert string + next letter.

abaaba…

aa a ab a (^) b aba abb a b abaa a abaab b abaab a a abaaba a 42 LZW Encoder: Java Implementation public class LZWEncoder { public static void main(String[] args) { String text = StdIn.readAll(); StringST st = new StringST(); int i; for (i = 0 ; i < 256 ; i++) { String s = Character.toString((char) i); st.put(s, i); } while (text.length() > 0 ) { String s = st.prefix(text); System.out.println(st.get(s)); int length = s.length(); if (length < text.length()) st.put(text.substring( 0 , length + 1 ), i++); text = text.substring(length); } } } not the most efficient way in real applications, encode integers in binary longest prefix match

43 LZW Decoder: Java Implementation public class LZWDecoder { public static void main(String[] args) { ST<Integer, String> st = new ST<Integer, String>(); int i; for (i = 0 ; i < 256 ; i++) { String s = Character.toString((char) i); st.put(i, s); } int code = StdIn.readInt(); String prev = st.get(code); System.out.print(prev); while (!StdIn.isEmpty()) { code = StdIn.readInt(); String s = st.get(code); if (i == code) s = prev + prev.charAt( 0 ); System.out.print(s); st.put(i++, prev + s.charAt( 0 )); prev = s; } } } special case, e.g., for "ababababab" in real applications, integers will be encoded in binary 44 LZW Implementation Details What to do when ST gets too large? ! Throw away and start over. GIF ! Throw away when not effective. Unix compress 45 LZW in the Real World Lempel-Ziv and friends. ! LZ77. ! LZ78. ! LZW. ! Deflate = LZ77 variant + Huffman. PNG: LZ77. Winzip, gzip, jar: deflate. Unix compress: LZW. Pkzip: LZW + Shannon-Fano. GIF, TIFF, V.42bis modem: LZW. Google: zlib which is based on deflate. never expands a file LZ77 not patented # widely used in open source LZW patent #4,558,302 expired in US on June 20, 2003 some versions copyrighted 46 Summary Lossless compression. ! Simple approaches. [RLE] ! Represent fixed length symbols with variable length codes. [Huffman] ! Represent variable length symbols with fixed length codes. [LZW] Lossy compression. [not covered in this course] ! JPEG, MPEG, MP3. ! FFT, wavelets, fractals, SVD, … Limits on compression. Shannon entropy. Theoretical limits closely match what we can achieve in practice!