An Introduction to Arithmetic Coding | Resúmenes de Matemáticas

Glen G. Langdon,

Jr.

An Introduction to Arithmetic Coding

Arithmetic coding is a data compression technique that encodes data (the data string) by creating a code string which represents a

fractional value

the number line between

and

The coding algorithm is symbolwise recursive; i.e., it operates upon and

encodes (decodes) one data symbol per iteration or recursion.

each recursion, the algorithm successively partitions an interval

of the number line between

and

and retains one of the partitions as the new interval. Thus, the algorithm successively deals

with smaller intervals, and the code string, viewed as a magnitude, lies in each of the nested intervals. The data string is recovered

by using magnitude comparisons

the code string to recreate how the encoder must have successively partitioned and retained

each nested subinterval. Arithmetic coding

differs

considerably from the more familiar compression coding techniques, such as

prefix (Huffman) codes.

Also,

it should not be confused with error control coding, whose object is to detect and correct errors in

computer operations. This paper presents the key notions of arithmetic compression coding by means of simple examples.

Introduction

Arithmetic coding maps a string of data (source) symbols to a

code string in such a way that the original data can be

recovered from the code string. The encoding and decoding

algorithms perform

arithmetic operations

on the code string.

One recursion of the algorithm handles one data symbol.

Arithmetic coding is actually a

family

of codes which share

the property of treating the code string as a magnitude. For a

brief history of the development of arithmetic coding, refer to

Appendix

Compression systems

The notion of compression systems captures the idea that data

may

transformed into something which is encoded, then

transmitted to a destination, then transformed back into the

original data. Any data compression approach, whether em-

ploying arithmetic coding, Huffman codes, or any other cod-

ing technique, has a

model

which makes some assumptions

about the

data

and the

events

encoded.

The code itself can

independent of the model. Some

systems which compress waveforms (eg, digitized speech)

may predict the next value and encode the error. In this model

the error and not the actual data is encoded. Typically, at the

encoder side of a compression system, the data to be com-

pressed feed a model unit. The model determines

the

event@) to

encoded, and

the estimate of the relative

frequency (probability) of the events. The encoder accepts the

event and some indication of its relative frequency and gen-

erates the code string.

simple model is the memoryless model, where the data

symbols themselves are encoded according to a single code.

Another model is the first-order Markov model, which uses

the previous symbol as the

context

for the current symbol.

Consider, for example, compressing English sentences. If the

data symbol (in this case, a letter) “q” is the previous letter,

we would expect the next letter to be “u.” The first-order

Markov model is a

dependent

model; we have a different

expectation for each symbol (or in the example, each letter),

depending on the context. The context is, in a sense, a state

governed by the past sequence of symbols. The purpose of a

context is to provide a probability distribution,

statistics,

for encoding (decoding) the next symbol.

Corresponding to the symbols are statistics. To simplify the

discussion, consider a single-context model, i.e., the memory-

less model. Data compression results from encoding the more-

frequent symbols with short code-string length increases, and

encoding the less-frequent events with long code length in-

creases. Let

denote the occurrences of the ith symbol in a

data string. For the memoryless model and a given code, let

denote the length (in bits) of the code-string increase associated

1984

by International Business Machines Corporation. Copying in printed form for private

use

is permitted without payment of

royalty provided that

(1)

each reproduction is done without alteration and

(2)

the

Journal

reference and IBM copyright notice are included on the

first page. The title and abstract, but no other portions, of this paper may

copied

distributed royalty free without further permission by

computer-based and other information-service systems. Permission to

republish

any other portion of this paper must

obtained from the Editor.

IBM

RES.

DEVELOP.

VOL.

NO.

MARCH

984

135

GLEN

LANGDON.

JR.

An Introduction to Arithmetic Coding, Resúmenes de Matemáticas

Documentos relacionados

Vista previa parcial del texto

¡Descarga An Introduction to Arithmetic Coding y más Resúmenes en PDF de Matemáticas solo en Docsity!

Glen G. Langdon,Jr.

Arithmetic coding is a data compression technique that encodes data (the data string) by creating a code string which represents a

fractional value on the number line between 0 and 1. The coding algorithm is symbolwise recursive; i.e., it operates upon and

encodes (decodes) one data symbol per iteration or recursion. On each recursion, the algorithm successively partitions an interval

of the number line between 0 and I , and retains one of the partitions as the new interval. Thus, the algorithm successively deals

with smaller intervals, and the code string, viewed as a magnitude, lies in each of the nested intervals. The data string is recovered

by using magnitude comparisons on the code string to recreate how the encoder must have successively partitioned and retained

each nested subinterval. Arithmetic coding differs considerably from the more familiar compression coding techniques, such as

prefix (Huffman) codes. Also, it should not be confused with error control coding, whose object is to detect and correct errors in

computer operations. This paper presents the keynotions of arithmetic compression coding by meansof simple examples.

1. Introduction

algorithms perform arithmetic operations on the code string.

Arithmetic coding is actually a family of codes which share

Compression systems

ing technique, has a model which makes some assumptions

about the data and the events encoded.

the previous symbol as the context for the current symbol.

Markov modelis a dependent model; we have a different

context is to provide a probability distribution, or statistics,

creases. Let e, denote the occurrences of the ith symbol in a

data string. For the memoryless model and a given code, let 4

Table Huffman 1 Example code. Encoder

a 0. l o o .Ooo

with symbol i. The code-stringlengthcorresponding to the

If 4 is large for data symbols of high relative frequency (large

compression. The wrong statistics (a popular symbol with a

quent events. Denote the relative frequency of symbol i as p,

where p, = cJN, and N is the total number of symbols in the

data string. If we use a fixed frequency for each data symbol

is to assign length 4 as -log pi. Here, logarithms are taken to

length for each symbol, we calculate the ideal code length for

instance of symbol i in the datastring by length value -log pt,

Model structure

reconstructed. For example, one could define an event such

Statistics estimation

tribution used for each context.Thecomputation may be

performed beforehand, or may be performed during the en-

coding process, typically by a counting technique. For Huff-

plex, i.e., has several contexts anda need to adapt to the data

Desirable properties of a coding method

For most applications, we desire thejrst-infirst-out (FIFO)

encoded. FIFO coding allows for adapting to the statistics of

the data string. With last-in jirst-out (LIFO) coding, the last

event encoded is the first event decoded, so adapting is diffi-

An initial view of Huffman and arithmetic codes

of 1, 2, 3, and 3. Let us order thealphabet {a, b, c, d) according

The encoding for the data string “a a b c” is 0.0. IO. 110,

where “. ” is used as a delimiter to show the substitution of

the codeword for the symbol. The code also has the prefix

performed by a matching or comparison process starting with

the first bit of the code string. For decodingcodestring

and data string “a a b.. ..”

d .ooo .oo 1 3

a .o I 1. I O 0 1

for the symbol i being encoded:

is 0 and theinterval width A is .O 1. For “ a a b,” the new code

point is .OO I , determined as 0 (current code point C), plus the

product (.Ol) X (.loo). The factor on theleft is the width A of

New interval width A

probabilities of the data symbols encoded so far. Thus, the

where thecurrent symbol is i. For example,afterencod-

val from the leftmost point C and width A of the current

values inTable I forthe symbol to be encoded. The two

operations (new codepoint andnew width) thus forma double

recursion. This double recursion is central to arithmetic cod-

sponds to the cumulativeprobability P of the preceding sym-

data string “ a a b,” is shown in Figure 3. In this example, we

retain Points 1 and 2 of the previous example, but no longer

and 3 to see that the intervalwidths are the same but the

locations of the intervals have been changed in Fig. 3 to

Let usagaincode the string “a a b c.” This example

“picture” of Fig. 3 for motivation.

The first “a” symbol yields the code point .O 1 1 and interval

First symbol ( a )

C N e w c o d e p o i n t C = O + I X(.011)=.011.

In the arithmeticcoding literature, we have called the value A