Lecture 4: Lossless Compression - Arithmetic Code & Dictionary Techniques, Study notes of Electrical and Electronics Engineering

A set of slides from a lecture in the ece-8873 data compression and modeling course at georgia tech, spring 2004. The lecture covers lossless compression techniques, specifically arithmetic coding and dictionary methods. Arithmetic coding is a method for lossless data compression that uses a continuous representation of probability. Dictionary methods, such as lz77 and lz78, use a dictionary to encode repeated patterns in the data. The lecture also discusses the efficiency and comparison of huffman and arithmetic codes.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-a36
koofers-user-a36 🇺🇸

4

(1)

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Spring 2004 ECE-8873 B. H. Juang Copyright 2004 Lecture #4, Slide #1
ECE 8873
Data Compression and Modeling
Lecture 4:
Lossless Compression – Arithmetic Code and
Dictionary Techniques
School of Electrical and Computer Engineering
Georgia Institute of Technology
Spring, 2004
Spring 2004 ECE-8833 B. H. Juang Copyright 2004 Lecture #4, Slide #2
Arithmetic Coding
A skewed source often has low entropy.
A source with small alphabet also has low entropy
in most practical situations.
Huffman codes are inefficient for skewed sources;
its deviation from optimality depends on ,
maximum of symbol probabilities.
Extended Huffman code may improve efficiency,
but implementation requires large table.
Is it possible to sequentially construct only the part
of the table that is needed?
max
P
Spring 2004 ECE-8833 B. H. Juang Copyright 2004 Lecture #4, Slide #3
Arithmetic Code
Key mechanism:
Code k-tuples of input symbols.
For each k-tuple input, generate a tag, a real number in
(0,1), according the probabilities in the alphabet; the tag
can be generated sequentially.
Represent each tag in binary code with length
commensurate (inversely) with the symbol’s probability;
the binary code can be truncated without compromising
the decodability due to the tag structure.
=
====
==
i
k
Xi
im
iXPiFaPiXP
iaXaaaA
1
21
)()( )()(
)( },,,,{
and
and Let L
Spring 2004 ECE-8833 B. H. Juang Copyright 2004 Lecture #4, Slide #4
Mapping a Symbol Sequence into a Tag
2.0)(,1.0)(,7.0)( },,,{ 321321 ==== aPaPaPaaaA and Let
5586.0)],,,[( ),,,,( 33213321 =aaaaTaaaa sequence For
0
0.7
0.8
1.0
0
0.49
0.56
0.7
0.49
0.56
0.539
0.546
0.56
0.546
0.5572
0.5558
0.56
0.5572
pf3
pf4
pf5

Partial preview of the text

Download Lecture 4: Lossless Compression - Arithmetic Code & Dictionary Techniques and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Spring 2004

ECE-8873 B. H. Juang

Copyright 2004

Lecture #4, Slide #

ECE 8873

Data Compression and Modeling

Lecture 4:

Lossless Compression – Arithmetic Code and

Dictionary Techniques

School of Electrical and Computer Engineering

Georgia Institute of Technology

Spring, 2004

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Arithmetic Coding

• A skewed source often has low entropy.• A source with small alphabet also has low entropy

in most practical situations.

• Huffman codes are inefficient for skewed sources;

its deviation from optimality depends on

maximum of symbol probabilities.

• Extended Huffman code may improve efficiency,

but implementation requires large table.

• Is it possible to sequentially construct only the part

of the table that is needed?

max

P

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Arithmetic Code

Key mechanism:

– Code k-tuples of input symbols.– For each k-tuple input, generate a tag, a real number in

(0,1), according the probabilities in the alphabet; the tagcan be generated sequentially.

– Represent each tag in binary code with length

commensurate (inversely) with the symbol’s probability;the binary code can be truncated without compromisingthe decodability due to the tag structure.

=

i k

X

i

i

m

i X P i F a P i X P

i a X a a a A

1

2

1

and

and

Let

L

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Mapping a Symbol Sequence into a Tag

3 2 1 3 2 1

a P a P a P a a a A

and

Let

)]

[(

3 3 2 1 3 3 2 1

= a a a a T a a a a

sequence

For

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Generating and Deciphering the Tag

Let

and

be the lower and upper bound of

the tag associated with a sequence of

n

symbols,

.

)

(

n

u

)

(

n

l

) 1

(

) 1

(

) 1

(

) (

) 1

(

) 1

(

) 1

(

) (

n X n n n n n X n n n n

x F l u u u

x F l u l l

)

,

,

,

(

2

1

n x

x

x

L

=

x These bounds can be computed via the followingrecursion:with

1

0

) 0 (

) 0 (^

=

=

u

l

and

The same recursion is used to convert a tag backinto the sequence of symbols.Need to know the cumulative probabilities,

F

(

x

).

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

The Deciphering Algorithm

Initialize

;

For each

k,

find

Find the value of

for which

Update

Repeat steps 2-4 until the entire sequence isfound.

1

0

) 0 (

) 0 (^

=

=

u

l

and

) (

) (^

k

k

u

l

and

)

/( )

tag (

) 1

(

) 1

(

) 1

(^

=

k

k

k

l

u

l

t

k x

)

(

) 1

(

k

X

k

X

x

F

t

x

F

<

n

k

,

, 2 , 1

L

=

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Another Way to Look at Arithmetic Code

1

2 a

1 a

3 a

2 a

1 a

3 a

2 a

1 a 3 a

2 a

1 a

3 a

2 a

1 a

3 a

2 a

1 a

3 a

2 a

1 a

3 a

3

3

a

a

2

3

a a

3

2

a

a

1

3

a

a

2

2

a

a

1

1

a a

1

2

a

a

3

1

a a

2

1

a a (^49). 0

tag

log

x

x

P

l

Use binary format for tag but truncated it to

bits.

= ) (

x

P

The truncation while causes deviation from the original tag value will not however movethe value out of the range the tag was originally in, thereby maintaining the decodability.

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Binary Code Assignment

1111

4

.

.

4

1101

4

.

.

.

3

101

3

.

.

.

2

01

2

.

.

.

1

Code

In Binary

Symbol

X F

X T

^

^

1

) (

log

x P

Each tag in [0,1) is represented by a binary fractionalnumber and truncated to

bits.

log

x

x

P

l

This binary code is uniquely decodable becausethe truncated tag always remains inand it is a prefix code.

[

x

x

X

X

F

F

Average code length:

length

block

m m X H I X H

A

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

o

l^

[ 3,5,C(

d

)]

o

l^

[ 2,1,C(

r

)]

o

l^

[ 7,4,C(

r

)]

[ 0,0,C(

d

)]

LZ77 Example

c a b r a c a d a b r a r r a r r a d

[ 0,0,C(

d

)]

o

l^

o

l^

c a b r a c a d

[ 7,4,C(

r

)]

a b r a

[ 2,1,C(

r

)]

r

r a r r a

[ 3,5,C(

d

)]

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

LZ

  • Asymptotic optimality: approached entropy.• Recurrence of codewords happens in recent memory.• Variations:
    • Encode the triple with variable length code;

PKZip, Zip, Lharc, PNG, gzip, and ARJ

  • Improved buffer search algorithm; hash table,…– Use a flag bit in the case of no match instead of

the original triple.

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

LZ

  • Avoid performance dependency on the buffer length as in LZ77.• Codebook may grow unbounded unless constrained.

wa b ba wabba_ wabba_wabba_w_

o o_

(^1234567891011121314)

<0,C(

w

)>

<0,C(

a

)>

<0,C(

b

)>

<3,C(

a

)>

<0,C(


)>

<1,C(

a

)>

<3,C(

b

)>

<2,C(


)>

<6,C(

b

)>

<4,C(


)>

<9,C(

b

)>

<8,C(

w

)>

<0,C(

o

)>

<13,C(


)>

Entry

Index

Encoded output

Dictionary

Initialcodebook

Input sequence: wabba_wabba_wabba_wabba_woo_

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

The Lempel-Ziv-Welch (LZW) Algorithm

  • 2

nd

element in code not transmitted. Only index is sent.

  • If p is in dictionary but pa is not, augment the dictionary with pa.

wabba_wabba_wabba_wabba_woo_woo_woo^ Initial primed dictionary

_^ a b o w

1 2 3 4 5

entry

index

a_wwabbba__waabbba_wwoooo__wooo__woo

141516171819202122232425

__ a b o w waabbbbaa__w wabbba_

(^12345678910111213)

entry

index

entry

index

Encoded sequence:52332168(10)(12)9(11)7(16)544(11)(21)(23)

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Predictive Coding

Symbols or letters in the sequence may have recursivedependency.

Conventional variable length coding may not fully capture thistype of redundancy.

Instead of coding each symbol in a memoryless fashion (evenwhen sophisticated parsing is involved), predict the symbolbased on information that the decoder (also) possesses and canuse, obtain the discrepancy between the input and the predictedone, and code such discrepancy. (Recall the structure ofinformation and representation of it.)

Decoder combines the decoded discrepancy with its version ofthe predicted result to recover the original.

Cleary & Witten (1984) – predictive coding with partial match

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Two options to transform the coding task:•

“Predict” current symbol based on past symbols; code residual

“Adapt” probability distribution based on past symbols; i.e., use

Structure in Information

Again, work on (or find) the structure first.

At times, information structure is recursive.

)}

,

,

|

(

{

2

1

L

i

i

i^

x

x

x

P

(1 2 5 7 1 3 0 -5 -3 -1 1 -2 -7 -4 -2 1 3 4)

(1 -1 1 0 -8 0 -5 -7 0 0 0 -5 -7 -1 0 1 0 -1)

2

1

=

i

i

i^

x

x

y

) ( ) , , ( ~

2

1

i i i i i i i

i^

H f x x x f x x x

− = − = − = ∆

L

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

General Block Diagram

i X

Encoder

(& Decoder)

Predictor

i ~ X

i

i ˆ X

i ˆ∆

) ( ) , ˆ , ˆ ( ~

2

1

i i i i i i i

i^

H f x x x f x x x

− = − = − = ∆

L

coding

lossless

of

case

in

i

i

i^

=

=

))

(

(

ˆ

α

β

i

i

i^

x

x

~

ˆ

ˆ

=

But, the notion of prediction can also be applied toprobability measures – foresee change in distributionbased on what is or are already observed.

Spring 2004

ECE-8833 B. H. Juang

Copyright 2004

Lecture #4, Slide #

Run Length Encoding

Simple but still useful.

Binary (e.g., FAX): Instead of sending 1's and 0's(mostly 1's for white with 0's for black), send length ofruns of white.

Often used for data with many repeated symbols,e.g., FAX and quantized transform coefficients (DCT,wavelet)

Can combine with other methods, e.g., JPEG doeslossless coding (Huffman or arithmetic) of combinedquantizer levels/runlengths.