Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Encoding and Compression - Information Retrieval | CMPSCI 646, Study notes of Computer Science

University of Massachusetts - Amherst Computer Science

Prof. James Allan

Material Type: Notes; Professor: Allan; Class: Information Retrieval; Subject: Computer Science; University: University of Massachusetts - Amherst; Term: Fall 2007;

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-g13 🇺🇸

10 documents

1 / 45

This page cannot be seen from the preview

Don't miss anything!

Information Retrieval

James Allan

University of Massachusetts Amherst

Information

Retrieval

Compression and Statistics of text

University

Massachusetts

Amherst

CMPSCI 646

Fall 2007

Encoding and Compression

•Encoding transforms data from one representation to

another Model Model

•Compression is an encoding that takes less space

–e.g., to reduce load on memory, disk, I/O, network

•Lossless: decoder can reproduce message exactly

•Lossy: can reproduce message approximately

Encoder Decoder

Data DataEncoded Data

•Measuring compression

–Compression ratio: Enc/Orig Æ25Mb / 125Mb = 20%

–Or sometimes: Orig/Enc Æ125Mb / 25Mb = 500%

–Or sometimes (Orig-Enc)/Orig Æ100Mb/125Mb = 80%

–Or sometimes (Orig-Enc)/Enc Æ100Mb/25Mb = 400%

Discover Study notes of Computer Science University of Massachusetts - Amherst

Partial preview of the text

Download Encoding and Compression - Information Retrieval | CMPSCI 646 and more Study notes Computer Science in PDF only on Docsity!

Information Retrieval

James Allan

University of Massachusetts Amherst

Information Retrieval

Compression and Statistics of text

University of Massachusetts Amherst

CMPSCI 646

Fall 2007

Encoding and Compression

• Encoding transforms data from one representation to

another

Model Model

• Compression is an encoding that takes less space

e.g., to reduce load on memory, disk, I/O, network

• Lossless : decoder can reproduce message exactly

• Lossy : can reproduce message approximately

Encoder Decoder

Data Encoded Data Data

• Measuring compression

Compression ratio: Enc/Orig Æ 25Mb / 125Mb = 20%
Or sometimes: Orig/Enc Æ 125Mb / 25Mb = 500%
Or sometimes (Orig-Enc)/Orig Æ 100Mb/125Mb = 80%
Or sometimes (Orig-Enc)/Enc Æ 100Mb/25Mb = 400%

Compression

Advantages of Compression
- Save space in memory (e.g., compressed cache)
- Save space when storing (e gSave space when storing (e.g., disk, CD/DVD) disk CD/DVD)
- Save time when accessing (e.g., I/O)
- Save time when communicating (e.g., over network)
Disadvantages of Compression
- Costs time and computation to compress and uncompress
- Complicates or prevents random access

M i l l f i f ti ( JPEG)

May involve loss of information (e.g., JPEG)
Makes data corruption much more costly.
- Small errors may make all of the data inaccessible.

Text vs. data compression

Text compression (TC) predates most work on

general data compression (DC)

TC is a kind of data compression optimized for textTC is a kind of data compression optimized for text
- i.e., based on a language and a language model
TC can be faster or simpler than general DC
- Assumptions made about the data.
Text compression assumes a language and model
- Data compression typically learns the model on the fly

Some forms of text compression are really data compression

Some forms of text compression are really data compression
Text compression is effective when assumptions met
Data compression is effective on almost any data with a skewed

distribution

Fixed Length Compression: Short bytes

If alphabet ≤ 32 symbols, use 5 bits per symbol
- Saves 3 bits per character
If alphabet > 32 symbols and ≤ 60
- use 1-30 for most frequent symbols (“base case”)
- use 1-30 for less frequent symbols (“shift case”)
- use 0 and 31 to shift back and forth (e.g., typewriter)
- Example:
  - Fast →
    - 55=25 bits rather than 48=32 bits
- Works well when shifts do not occur often

Works well when shifts do not occur often.

LaTeX → T
- 95=45 bits rather than 58=40 bits
Optimizations: Just one shift toggle, Temporary shift, and shift-lock
Variations: Multiple “cases”

Fixed Length Compression: Bigrams

Use a byte (8 bits) as storage unit (values 0-255)
Use values 0-87 for blank, upper case, lower case, digits and

25 special characters25 special characters

Use values 88-255 for common bigrams (master + combining)
- Master (8): blank, A, E, I, O, N, T, U
- Combining (21): blank, plus everything except J, K, Q, X, Y Z
- Total codes: 88 + 8 * 21 = 88 + 168 = 256
Example: Fast Æ saves one byte
Pro: Simple, fast, requires little memory.

Con: Based on a small symbol set
Con: Maximum compression is 50%.
- Average is lower (33%?).
Variation: 128 ASCII characters and 128 bigrams.
Extension: Escape character for ASCII 128-255.

Fixed Length Compression: n -grams

Similar to bigrams, but extended to cover sequences

of 2 or more characters

Can select common n-grams in advanceCan select common n-grams in advance
Could learn them for specific applications
The goal is that each encoded unit of length > 1

occur with very high (and roughly equal) probability

Popular today for:
- OCR data
  - scanning errors make bigram assumptions less applicable

g g p pp

Asian languages
- two and three symbol words are common
- longer n -grams can capture phrases and names

Fixed length compression summary

Three methods presented. All are
- simple
- – very effective when their assumptions are correctvery effective when their assumptions are correct
All are based on a small symbol set, to varying degrees
- some only handle a small symbol set
- some handle a larger symbol set, but compress best when a few symbols

comprise most of the data

All are based on a strong assumption about the choice of

language (English, here)

Bigram and n -gram methods are also based on strong

assumptions about common sequences of symbols

Intuitions about which characters/sequences are frequent

Generalization for More Symbols

Use more than 2 cases
- 1xxx for 2^3 = 8 most frequent symbols, and
- 0xxx1xxx for next 2^6 = 64 symbols, and
- 0xxx0xxx1xxx for next 2^9 = 512 symbols, and
- …
Average code length on WSJ89 is 6.2 bits per symbol
- 23.0% compression ratio

P V i bl b f b l

Pro: Variable number of symbols
Con: Only 72 symbols in 1 byte (or less)

Example: Unicode as UTF- n

Encode Unicode into variable-length pieces
- Unicode allows 1,114,112 code values (0 to 0x10FFFF, or 21 bits)
  - 17 “planes” (0 to 16) of 65,536 coding values
  - Basic Multilingual Plane is 0 to 0xFFFF
All UTF- n are equally valid ways to encode Unicode
- It is possible to map between them without loss of information
- All Unicode code values can be represented
Consider UTF-
- Each Unicode value encoded in 32 bits directly

Fi d l th d fi d l ti h t

Fixed-length and fixed-location characters
Simple to find character offsets and starting positions
Wasteful
- 11 bits never used (remember, Unicode uses only 21 bits)
- In some cases, many other bits wasted (e.g., 7-bit ASCII)

UTF-

Represent Unicode in 16-bit (2 byte) units
- (Prior to v2.1, Unicode was limited to 16 bits)
• Unicode values 0x0000 to 0xFFFF (common chars)Unicode values 0x0000 to 0xFFFF (common chars)
- Represent directly as a 16-bit code value
Two blocks of 1024 (2 10 ) values unused in there
For values 0x10000 to 0x10FFFF
- Subtract 0x10000 so range is 0x0 to 0xFFFFF, needs 20 bits
- Encode first 10 bits in high 16-bit “unused range” number
- Encode remaining 10 bits in a low 16-bit “unused” number

Encode remaining 10 bits in a low 16 bit unused number

During decoding
- If character is “unused” then it is part of a 32-bit encoding
  - Which block (high/low) determines which of the parts it is
- Else it is a 16-bit encoding

UTF-

Base unit is a single byte
For Unicode values 0 to 0xFF (127)
- EEncode directly in a single byte using 7 bits (0 d di tl i i l b t i 7 bit (0 xxxxxxx ))
- Corresponds to ASCII values
For all other values…
First byte indicates how long the encoding is
- Encode the number of bytes needed in unary (count the 1’s)
- 110 xxxxx means “two bytes” and has first 5 bits of number
- 1110 xxxx for 3 bytes 11110 xxx means 4 bytes

1110 xxxx for 3 bytes, 11110 xxx means 4 bytes

(Maximum needed is 4 bytes, though can encode more in theory)
All other bytes include next 6 bits
Bytes are of form 10 xxxxxx
(Why not 7 bits, 1 xxxxxxx or 8 bits, xxxxxxxx ?)

Notes on UTF-8 formatting

Number must be encoded in shortest possible

encoding

Copyright © is 0xA9 = 1010 1001Copyright © is 0xA9 = 1010 1001
Valid 11000010 10101001
Invalid 11100000 10000010 10101001 (and so on)
One-byte characters and those that are part of a

sequence are distinguishable

High-order bit is always clear in former and always set in latter

(True for UTF-16, too, but for different reason)

Non-overlapping aspects of UTF- n

Consider older mixed-width encodings (e.g.,

Windows)

One character could be aOne character could be a

subsequence of another

Causes problems during

string search

Imagine looking for “D”
- Is the 0x44 a “D” or the end of the 0x414 character?

T h k l k b k d h t

To check, look backward one character
Could it be that this is a

0x442 character?

To check, look backward…

(to start of text worst case)

Pros and cons of various UTF- n

All are equally valid encodings
UTF 32 simpler (trivial) to encodeUTF-32 simpler (trivial) to encode
UTF-16 captures more commonly

used Unicode characters simply

UTF-8 is most compact for ASCII-like text
- At a disadvantage for some East Asian text
Binary sort of UTF-8 matches Unicode value sort

Big-/little-endian storage not an issue for UTF-
- Partly addressed by unused characters 0xFFFE and 0xFEFF
- Put that “byte order marker” at start of text to flag endian choice

Back to restricted variable-length codes

Generalization for numeric data
- 1xxxxxxx for 2^7 = 128 most frequent symbols,
- 0xxxxxxx1xxxxxxx for next 20xxxxxxx1xxxxxxx for next 2^14 = 16 384 symbols and so on16,384 symbols, and so on
- Con: totally fails to compress text
  - On WSJ89 uses 8.0 bits per symbol, for a 0.0% compression ratio (!!).
- Pro: Can be used for integer data
  - Examples: word frequencies (usually small), inverted lists
Word-based encoding
- Build dictionary sorted by word frequency

Build dictionary, sorted by word frequency

Represent each word as an offset/index into the dictionary
Pro: A vocabulary of 20,000-50,000 words with a Zipf distribution

requires 12-13 bits per word

compared with a 10-11 bits for completely variable length
Con: Decoding dictionary is large, compared with other methods

(In a moment)

Zipf’s Law

A few words occur very often
- 2 most frequent words can account for 10% of occurrences
- top 6 words are 20%, top 50 words are 50%
Many words are infrequent
“Principle of Least Effort”
- easier to repeat words than coin new ones
Rank · Frequency ≈ Constant
- pr = (Number of occurrences of word of rank r)/N
  - N total word occurrences

probability that a word chosen randomly from the text will be the word

of rank r

for D unique words Σ p (^) r = 1
r ·pr = A
A ≈ 0.

George Kingsley Zipf, 1902-

Linguistic professor at Harvard

Example of Frequent Words

Frequent

Word

Number of

Occurrences

Percentage

of Total

thethe 7,398,9347,398,934 5.95.

of 3,893,790 3.

to 3,364,653 2.

and 3,320,687 2.

in 2,311,785 1.

is 1,559,147 1.

for 1,313,561 1.

The 1,144,860 0.

that 1,066,503 0.

said 1,027,713 0.

Artifact of

InQuery’s

stemming

technique

Frequencies from 336,310 documents in the 1GB TREC Volume 3 Corpus

125,720,891 total word occurrences; 508,209 unique words

Examples of Zipf

Top 50 words from 423 short TIME magazine articles

Examples of Zipf

Top 50 words from 84,678 Associated Press 1989 articles

the 15659 1 6.422 0.0642 has 880 26 0.361 0. Word Freq r Pr rPr Word Freq r Pr rPr
of 7179 2 2.944 0.0589 not 875 27 0.359 0.
to 6287 3 2.578 0.0774 an 863 28 0.354 0.
a 5830 4 2.391 0.0956 s 862 29 0.354 0.
and 5580 5 2.288 0.1144 have 860 30 0.353 0.
in 5245 6 2.151 0.1291 were 858 31 0.352 0.
that 2494 7 1.023 0.0716 their 812 32 0.333 0.
for 2197 8 0.901 0.0721 are 807 33 0.331 0.
was 2147 9 0.881 0.0792 one 742 34 0.304 0.
with 1824 10 0.748 0.0748 they 679 35 0.278 0.
his 1813 11 0.744 0.0818 its 668 36 0.274 0.
is 1800 12 0.738 0.0886 all 646 37 0.265 0.
he 1687 13 0.692 0.0899 week 626 38 0.257 0.
as 1576 14 0.646 0.0905 government 582 39 0.239 0.
on 1523 15 0.625 0.0937 when 577 40 0.237 0.
by 1443 16 0.592 0.0947 would 572 41 0.235 0.
at 1318 17 0.541 0.0919 been 554 42 0.227 0.
it 1232 18 0.505 0.0909 out 553 43 0.227 0.
from 1217 19 0.499 0.0948 new 544 44 0.223 0.
but 1136 20 0.466 0.0932 which 539 45 0.221 0. (243,836 word occurrences, lowercased, punctuation removed, 1.6 MB)
u 949 21 0.389 0.0817 up 539 45 0.221 0.
had 937 22 0.384 0.0845 more 535 47 0.219 0.
last 909 23 0.373 0.0857 into 516 48 0.212 0.
be 906 24 0.372 0.0892 only 504 49 0.207 0.
who 883 25 0.362 0.0905 will 488 50 0.2 0.
- the 2,420,778 1 6.488 0.0649 has 136,007 26 0.365 0. Word Freq r Pr(%) rPr Word Freq r Pr(%) rPr
- of 1,045,733 2 2.803 0.0561 are 130,322 27 0.349 0.
- to 968,882 3 2.597 0.0779 not 127,493 28 0.342 0.
- a 892,429 4 2.392 0.0957 who 116,364 29 0.312 0.
- and 865,644 5 2.32 0.116 they 111,024 30 0.298 0.
- in 847,825 6 2.272 0.1363 its 111,021 31 0.298 0.
- saidsaid 504 593504,593 77 1 3521.352 0 09470.0947 hadhad 103 943 32103,943 32 0 2790.279 0 08920.
- for 363,865 8 0.975 0.078 will 102,949 33 0.276 0.
- that 347,072 9 0.93 0.0837 would 99,503 34 0.267 0.
- was 293,027 10 0.785 0.0785 about 92,983 35 0.249 0.
- on 291,947 11 0.783 0.0861 i 92,005 36 0.247 0.
- he 250,919 12 0.673 0.0807 been 88,786 37 0.238 0.
- is 245,843 13 0.659 0.0857 this 87,286 38 0.234 0.
- with 223,846 14 0.6 0.084 their 84,638 39 0.227 0.
- at 210,064 15 0.563 0.0845 new 83,449 40 0.224 0.
- by 209,586 16 0.562 0.0899 or 81,796 41 0.219 0.
- it 195,621 17 0.524 0.0891 which 80,385 42 0.215 0.
- from 189,451 18 0.508 0.0914 we 80,245 43 0.215 0.
- as 181,714 19 0.487 0.0925 more 76,388 44 0.205 0.
- be 157,300 20 0.422 0.0843 after 75,165 45 0.201 0. (37,309,114 word occurrences, lowercased, punctuation removed, 266MB)
- were 153,913 21 0.413 0.0866 us 72,045 46 0.193 0.
- an 152,576 22 0.409 0.09 percent 71,956 47 0.193 0.
- have 149,749 23 0.401 0.0923 up 71,082 48 0.191 0.
- his 142,285 24 0.381 0.0915 one 70,266 49 0.188 0.
- but 140,880 25 0.378 0.0944 people 68,988 50 0.185 0.

Does Real Data Fit Zipf’s Law?

A law of the form y = kx

c

is called a power law.

Zipf’s law is a power law with c = –
- r = (AN)·n-1^ Æ n = (AN )·r -

[Foundations of Statistical NLP,

Manning and Schutze, 2000 ]

r (AN) n Æ n (AN ) r

AN is a constant for a fixed collection
On a log-log plot, power laws give a straight line with

slope c.

log(n) = log(ANr-1^ ) = log(AN) – 1·log(r)

log( y ) log( kx ) log k c log( x )

c

= = +

Zipf is quite accurate except for very high and low

rank.

Fit to Zipf for Brown Corpus

[Foundations of Statistical NLP,

Manning and Schutze, 2000 ]

k = AN = 100,

Mandelbrot (1954) Correction

The following

more general

form gives

[Foundations of Statistical NLP,

Manning and Schutze, 2000 ]

form gives

bit better fit

Adds a constant to

the denominator

y=k(x+t) c
Here,

n = (AN)·(r+t)

k = 10 5.4^ , C = -1.15, t = 100

Explanations for Zipf’s Law

Zipf’s explanation was his “principle of least effort.”

Balance between speaker’s desire for a small

vocabulary and hearer’s desire for a large onevocabulary and hearer s desire for a large one.

Debate (1955-61) between Mandelbrot and H. Simon

over explanation.

Li (1992) shows that just random typing of letters

including a space will generate “words” with a Zipfian

distribution.

http //linkage rockefeller ed / li/ ipf/

http://linkage.rockefeller.edu/wli/zipf/
Short words more likely to be generated

Vocabulary Growth (Heap’s Law)

How does the size of the overall vocabulary (number

of unique words) grow with the size of the corpus?

Vocabulary has no upper bound due to proper names, typos, etc.y pp p p yp
New words occur less frequently as vocabulary grows
If V is the size of the vocabulary and the n is the

length of the corpus in words:

V = Knβ^ (0< β <1)
Typical constants:
K ≈ 10 − 100
β ≈ 0.4−0.6 (approx. square-root of n)

β 0.4 0.6 (approx. square root of n)

Can be derived from Zipf’s law by assuming

documents are generated by randomly sampling

words from a Zipfian distribution

Heaps’ Law Data

V = Kn

Compression and statistics: Outline

Introduction
Fixed Length Codes
- Short-bytes, bigrams, n-gramsy , g , g
Restricted Variable-Length Codes
- Basic method, extension for larger symbol sets, UTF-
Some diversions
- Statistics of text: Zipf, Heaps
- Information theory
Variable-Length Codes
- Huffman Codes / Canonical Huffman Codes

Lempel-Ziv (LZ77, Gzip, LZ78, LZW, Unix compress )
Synchronization
Compressing inverted files
Compression for block-level retrieval

Information Theory

Shannon studied theoretical limits for data

compression and transmission rate

Compression limits given by Entropy (H)Compression limits given by Entropy (H)
Transmission limits given by Channel Capacity (C)
A number of language tasks have been formulated as

a “noisy channel” problem

i.e., determine the most likely input given the noisy output
OCR

Speech recognition

Speech recognition
Question answering
Machine translation
…

Encoding and Compression - Information Retrieval | CMPSCI 646, Study notes of Computer Science

Related documents

Partial preview of the text

Download Encoding and Compression - Information Retrieval | CMPSCI 646 and more Study notes Computer Science in PDF only on Docsity!

Information Retrieval

James Allan

University of Massachusetts Amherst

Information Retrieval

Compression and Statistics of text

University of Massachusetts Amherst

CMPSCI 646

Fall 2007

Encoding and Compression

• Encoding transforms data from one representation to

another

• Compression is an encoding that takes less space

• Lossless : decoder can reproduce message exactly

• Lossy : can reproduce message approximately

• Measuring compression

Compression

M i l l f i f ti ( JPEG)

Text vs. data compression

Some forms of text compression are really data compression

distribution

Fixed Length Compression: Short bytes

Works well when shifts do not occur often.

Fixed Length Compression: Bigrams

Fixed Length Compression: n -grams

g g p pp

Fixed length compression summary

comprise most of the data

Generalization for More Symbols

Example: Unicode as UTF- n

Fi d l th d fi d l ti h t

UTF-

Encode remaining 10 bits in a low 16 bit unused number

UTF-

1110 xxxx for 3 bytes, 11110 xxx means 4 bytes

Notes on UTF-8 formatting

Non-overlapping aspects of UTF- n

T h k l k b k d h t

0x442 character?

(to start of text worst case)

Pros and cons of various UTF- n

Back to restricted variable-length codes

Build dictionary, sorted by word frequency

requires 12-13 bits per word

(In a moment)

Zipf’s Law

of rank r

George Kingsley Zipf, 1902-

Linguistic professor at Harvard

Example of Frequent Words

Frequent

Word

Number of

Occurrences

Percentage

of Total

thethe 7,398,9347,398,934 5.95.

of 3,893,790 3.

to 3,364,653 2.

and 3,320,687 2.

in 2,311,785 1.

is 1,559,147 1.

for 1,313,561 1.

The 1,144,860 0.

that 1,066,503 0.

said 1,027,713 0.

Artifact of

InQuery’s

stemming

technique

Frequencies from 336,310 documents in the 1GB TREC Volume 3 Corpus

125,720,891 total word occurrences; 508,209 unique words

Examples of Zipf

Top 50 words from 423 short TIME magazine articles

Examples of Zipf

Top 50 words from 84,678 Associated Press 1989 articles

Does Real Data Fit Zipf’s Law?