Encoding and Compression - Information Retrieval | CMPSCI 646, Study notes of Computer Science

Material Type: Notes; Professor: Allan; Class: Information Retrieval; Subject: Computer Science; University: University of Massachusetts - Amherst; Term: Fall 2007;

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-g13
koofers-user-g13 🇺🇸

10 documents

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Information Retrieval
James Allan
University of Massachusetts Amherst
Information
Retrieval
Compression and Statistics of text
University
of
Massachusetts
Amherst
CMPSCI 646
Fall 2007
All slides copyright © James Allan
Encoding and Compression
Encoding transforms data from one representation to
another Model Model
Compression is an encoding that takes less space
e.g., to reduce load on memory, disk, I/O, network
Lossless: decoder can reproduce message exactly
Lossy: can reproduce message approximately
Encoder Decoder
Data DataEncoded Data
Copyright © James Allan
Measuring compression
Compression ratio: Enc/Orig Æ25Mb / 125Mb = 20%
Or sometimes: Orig/Enc Æ125Mb / 25Mb = 500%
Or sometimes (Orig-Enc)/Orig Æ100Mb/125Mb = 80%
Or sometimes (Orig-Enc)/Enc Æ100Mb/25Mb = 400%
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Encoding and Compression - Information Retrieval | CMPSCI 646 and more Study notes Computer Science in PDF only on Docsity!

Information Retrieval

James Allan

University of Massachusetts Amherst

Information Retrieval

Compression and Statistics of text

University of Massachusetts Amherst

CMPSCI 646

Fall 2007

All slides copyright © James Allan

Encoding and Compression

• Encoding transforms data from one representation to

another

Model Model

• Compression is an encoding that takes less space

  • e.g., to reduce load on memory, disk, I/O, network

• Lossless : decoder can reproduce message exactly

• Lossy : can reproduce message approximately

Encoder Decoder

Data Encoded Data Data

Copyright © James Allan

• Measuring compression

  • Compression ratio: Enc/Orig Æ 25Mb / 125Mb = 20%
  • Or sometimes: Orig/Enc Æ 125Mb / 25Mb = 500%
  • Or sometimes (Orig-Enc)/Orig Æ 100Mb/125Mb = 80%
  • Or sometimes (Orig-Enc)/Enc Æ 100Mb/25Mb = 400%

Compression

  • Advantages of Compression
    • Save space in memory (e.g., compressed cache)
    • Save space when storing (e gSave space when storing (e.g., disk, CD/DVD) disk CD/DVD)
    • Save time when accessing (e.g., I/O)
    • Save time when communicating (e.g., over network)
  • Disadvantages of Compression
    • Costs time and computation to compress and uncompress
    • Complicates or prevents random access

M i l l f i f ti ( JPEG)

Copyright © James Allan

  • May involve loss of information (e.g., JPEG)
  • Makes data corruption much more costly.
    • Small errors may make all of the data inaccessible.

Text vs. data compression

  • Text compression (TC) predates most work on

general data compression (DC)

  • TC is a kind of data compression optimized for textTC is a kind of data compression optimized for text
    • i.e., based on a language and a language model
  • TC can be faster or simpler than general DC
    • Assumptions made about the data.
  • Text compression assumes a language and model
    • Data compression typically learns the model on the fly

Some forms of text compression are really data compression

Copyright © James Allan

  • Some forms of text compression are really data compression
  • Text compression is effective when assumptions met
  • Data compression is effective on almost any data with a skewed

distribution

Fixed Length Compression: Short bytes

  • If alphabet ≤ 32 symbols, use 5 bits per symbol
    • Saves 3 bits per character
  • If alphabet > 32 symbols and ≤ 60
    • use 1-30 for most frequent symbols (“base case”)
    • use 1-30 for less frequent symbols (“shift case”)
    • use 0 and 31 to shift back and forth (e.g., typewriter)
    • Example:
      • Fast →
        • 55=25 bits rather than 48=32 bits
    • Works well when shifts do not occur often

Copyright © James Allan

Works well when shifts do not occur often.

  • LaTeX → T
    • 95=45 bits rather than 58=40 bits
  • Optimizations: Just one shift toggle, Temporary shift, and shift-lock
  • Variations: Multiple “cases”

Fixed Length Compression: Bigrams

  • Use a byte (8 bits) as storage unit (values 0-255)
  • Use values 0-87 for blank, upper case, lower case, digits and

25 special characters25 special characters

  • Use values 88-255 for common bigrams (master + combining)
    • Master (8): blank, A, E, I, O, N, T, U
    • Combining (21): blank, plus everything except J, K, Q, X, Y Z
    • Total codes: 88 + 8 * 21 = 88 + 168 = 256
  • Example: Fast Æ saves one byte
  • Pro: Simple, fast, requires little memory.

Copyright © James Allan

  • Con: Based on a small symbol set
  • Con: Maximum compression is 50%.
    • Average is lower (33%?).
  • Variation: 128 ASCII characters and 128 bigrams.
  • Extension: Escape character for ASCII 128-255.

Fixed Length Compression: n -grams

  • Similar to bigrams, but extended to cover sequences

of 2 or more characters

  • Can select common n-grams in advanceCan select common n-grams in advance
  • Could learn them for specific applications
  • The goal is that each encoded unit of length > 1

occur with very high (and roughly equal) probability

  • Popular today for:
    • OCR data
      • scanning errors make bigram assumptions less applicable

Copyright © James Allan

g g p pp

  • Asian languages
    • two and three symbol words are common
    • longer n -grams can capture phrases and names

Fixed length compression summary

  • Three methods presented. All are
    • simple
    • – very effective when their assumptions are correctvery effective when their assumptions are correct
  • All are based on a small symbol set, to varying degrees
    • some only handle a small symbol set
    • some handle a larger symbol set, but compress best when a few symbols

comprise most of the data

  • All are based on a strong assumption about the choice of

language (English, here)

Copyright © James Allan

  • Bigram and n -gram methods are also based on strong

assumptions about common sequences of symbols

  • Intuitions about which characters/sequences are frequent

Generalization for More Symbols

  • Use more than 2 cases
    • 1xxx for 2^3 = 8 most frequent symbols, and
    • 0xxx1xxx for next 2^6 = 64 symbols, and
    • 0xxx0xxx1xxx for next 2^9 = 512 symbols, and
  • Average code length on WSJ89 is 6.2 bits per symbol
    • 23.0% compression ratio

P V i bl b f b l

Copyright © James Allan

  • Pro: Variable number of symbols
  • Con: Only 72 symbols in 1 byte (or less)

Example: Unicode as UTF- n

  • Encode Unicode into variable-length pieces
    • Unicode allows 1,114,112 code values (0 to 0x10FFFF, or 21 bits)
      • 17 “planes” (0 to 16) of 65,536 coding values
      • Basic Multilingual Plane is 0 to 0xFFFF
  • All UTF- n are equally valid ways to encode Unicode
    • It is possible to map between them without loss of information
    • All Unicode code values can be represented
  • Consider UTF-
    • Each Unicode value encoded in 32 bits directly

Fi d l th d fi d l ti h t

Copyright © James Allan

  • Fixed-length and fixed-location characters
  • Simple to find character offsets and starting positions
  • Wasteful
    • 11 bits never used (remember, Unicode uses only 21 bits)
    • In some cases, many other bits wasted (e.g., 7-bit ASCII)

UTF-

  • Represent Unicode in 16-bit (2 byte) units
    • (Prior to v2.1, Unicode was limited to 16 bits)
  • • Unicode values 0x0000 to 0xFFFF (common chars)Unicode values 0x0000 to 0xFFFF (common chars)
    • Represent directly as a 16-bit code value
  • Two blocks of 1024 (2 10 ) values unused in there
  • For values 0x10000 to 0x10FFFF
    • Subtract 0x10000 so range is 0x0 to 0xFFFFF, needs 20 bits
    • Encode first 10 bits in high 16-bit “unused range” number
    • Encode remaining 10 bits in a low 16-bit “unused” number

Copyright © James Allan

Encode remaining 10 bits in a low 16 bit unused number

  • During decoding
    • If character is “unused” then it is part of a 32-bit encoding
      • Which block (high/low) determines which of the parts it is
    • Else it is a 16-bit encoding

UTF-

  • Base unit is a single byte
  • For Unicode values 0 to 0xFF (127)
    • EEncode directly in a single byte using 7 bits (0 d di tl i i l b t i 7 bit (0 xxxxxxx ))
    • Corresponds to ASCII values
  • For all other values…
  • First byte indicates how long the encoding is
    • Encode the number of bytes needed in unary (count the 1’s)
    • 110 xxxxx means “two bytes” and has first 5 bits of number
    • 1110 xxxx for 3 bytes 11110 xxx means 4 bytes

Copyright © James Allan

1110 xxxx for 3 bytes, 11110 xxx means 4 bytes

  • (Maximum needed is 4 bytes, though can encode more in theory)
  • All other bytes include next 6 bits
  • Bytes are of form 10 xxxxxx
  • (Why not 7 bits, 1 xxxxxxx or 8 bits, xxxxxxxx ?)

Notes on UTF-8 formatting

  • Number must be encoded in shortest possible

encoding

  • Copyright © is 0xA9 = 1010 1001Copyright © is 0xA9 = 1010 1001
  • Valid 11000010 10101001
  • Invalid 11100000 10000010 10101001 (and so on)
  • One-byte characters and those that are part of a

sequence are distinguishable

  • High-order bit is always clear in former and always set in latter

Copyright © James Allan

  • (True for UTF-16, too, but for different reason)

Non-overlapping aspects of UTF- n

  • Consider older mixed-width encodings (e.g.,

Windows)

  • One character could be aOne character could be a

subsequence of another

  • Causes problems during

string search

  • Imagine looking for “D”
    • Is the 0x44 a “D” or the end of the 0x414 character?

T h k l k b k d h t

Copyright © James Allan

  • To check, look backward one character
  • Could it be that this is a

0x442 character?

  • To check, look backward…

(to start of text worst case)

Pros and cons of various UTF- n

  • All are equally valid encodings
  • UTF 32 simpler (trivial) to encodeUTF-32 simpler (trivial) to encode
  • UTF-16 captures more commonly

used Unicode characters simply

  • UTF-8 is most compact for ASCII-like text
    • At a disadvantage for some East Asian text
  • Binary sort of UTF-8 matches Unicode value sort

Copyright © James Allan

  • Big-/little-endian storage not an issue for UTF-
    • Partly addressed by unused characters 0xFFFE and 0xFEFF
    • Put that “byte order marker” at start of text to flag endian choice

Back to restricted variable-length codes

  • Generalization for numeric data
    • 1xxxxxxx for 2^7 = 128 most frequent symbols,
    • 0xxxxxxx1xxxxxxx for next 20xxxxxxx1xxxxxxx for next 2^14 = 16 384 symbols and so on16,384 symbols, and so on
    • Con: totally fails to compress text
      • On WSJ89 uses 8.0 bits per symbol, for a 0.0% compression ratio (!!).
    • Pro: Can be used for integer data
      • Examples: word frequencies (usually small), inverted lists
  • Word-based encoding
    • Build dictionary sorted by word frequency

Copyright © James Allan

Build dictionary, sorted by word frequency

  • Represent each word as an offset/index into the dictionary
  • Pro: A vocabulary of 20,000-50,000 words with a Zipf distribution

requires 12-13 bits per word

  • compared with a 10-11 bits for completely variable length
  • Con: Decoding dictionary is large, compared with other methods
(In a moment)

Zipf’s Law

  • A few words occur very often
    • 2 most frequent words can account for 10% of occurrences
    • top 6 words are 20%, top 50 words are 50%
  • Many words are infrequent
  • “Principle of Least Effort”
    • easier to repeat words than coin new ones
  • Rank · Frequency ≈ Constant
    • pr = (Number of occurrences of word of rank r)/N
      • N total word occurrences

Copyright © James Allan

  • probability that a word chosen randomly from the text will be the word

of rank r

  • for D unique words Σ p (^) r = 1
  • r ·pr = A
  • A ≈ 0.

George Kingsley Zipf, 1902-

Linguistic professor at Harvard

Example of Frequent Words

Frequent

Word

Number of

Occurrences

Percentage

of Total

thethe 7,398,9347,398,934 5.95.

of 3,893,790 3.

to 3,364,653 2.

and 3,320,687 2.

in 2,311,785 1.

is 1,559,147 1.

for 1,313,561 1.

The 1,144,860 0.

that 1,066,503 0.

said 1,027,713 0.

Artifact of
InQuery’s
stemming
technique

Copyright © James Allan

Frequencies from 336,310 documents in the 1GB TREC Volume 3 Corpus

125,720,891 total word occurrences; 508,209 unique words

Examples of Zipf

Copyright © James Allan

Top 50 words from 423 short TIME magazine articles

Examples of Zipf

Copyright © James Allan

Top 50 words from 84,678 Associated Press 1989 articles
  • the 15659 1 6.422 0.0642 has 880 26 0.361 0. Word Freq r Pr rPr Word Freq r Pr rPr
  • of 7179 2 2.944 0.0589 not 875 27 0.359 0.
  • to 6287 3 2.578 0.0774 an 863 28 0.354 0.
  • a 5830 4 2.391 0.0956 s 862 29 0.354 0.
  • and 5580 5 2.288 0.1144 have 860 30 0.353 0.
  • in 5245 6 2.151 0.1291 were 858 31 0.352 0.
  • that 2494 7 1.023 0.0716 their 812 32 0.333 0.
  • for 2197 8 0.901 0.0721 are 807 33 0.331 0.
  • was 2147 9 0.881 0.0792 one 742 34 0.304 0.
  • with 1824 10 0.748 0.0748 they 679 35 0.278 0.
  • his 1813 11 0.744 0.0818 its 668 36 0.274 0.
  • is 1800 12 0.738 0.0886 all 646 37 0.265 0.
  • he 1687 13 0.692 0.0899 week 626 38 0.257 0.
  • as 1576 14 0.646 0.0905 government 582 39 0.239 0.
  • on 1523 15 0.625 0.0937 when 577 40 0.237 0.
  • by 1443 16 0.592 0.0947 would 572 41 0.235 0.
  • at 1318 17 0.541 0.0919 been 554 42 0.227 0.
  • it 1232 18 0.505 0.0909 out 553 43 0.227 0.
  • from 1217 19 0.499 0.0948 new 544 44 0.223 0.
  • but 1136 20 0.466 0.0932 which 539 45 0.221 0. (243,836 word occurrences, lowercased, punctuation removed, 1.6 MB)
  • u 949 21 0.389 0.0817 up 539 45 0.221 0.
  • had 937 22 0.384 0.0845 more 535 47 0.219 0.
  • last 909 23 0.373 0.0857 into 516 48 0.212 0.
  • be 906 24 0.372 0.0892 only 504 49 0.207 0.
  • who 883 25 0.362 0.0905 will 488 50 0.2 0.
    • the 2,420,778 1 6.488 0.0649 has 136,007 26 0.365 0. Word Freq r Pr(%) rPr Word Freq r Pr(%) rPr
    • of 1,045,733 2 2.803 0.0561 are 130,322 27 0.349 0.
    • to 968,882 3 2.597 0.0779 not 127,493 28 0.342 0.
    • a 892,429 4 2.392 0.0957 who 116,364 29 0.312 0.
    • and 865,644 5 2.32 0.116 they 111,024 30 0.298 0.
    • in 847,825 6 2.272 0.1363 its 111,021 31 0.298 0.
    • saidsaid 504 593504,593 77 1 3521.352 0 09470.0947 hadhad 103 943 32103,943 32 0 2790.279 0 08920.
    • for 363,865 8 0.975 0.078 will 102,949 33 0.276 0.
    • that 347,072 9 0.93 0.0837 would 99,503 34 0.267 0.
    • was 293,027 10 0.785 0.0785 about 92,983 35 0.249 0.
    • on 291,947 11 0.783 0.0861 i 92,005 36 0.247 0.
    • he 250,919 12 0.673 0.0807 been 88,786 37 0.238 0.
    • is 245,843 13 0.659 0.0857 this 87,286 38 0.234 0.
    • with 223,846 14 0.6 0.084 their 84,638 39 0.227 0.
    • at 210,064 15 0.563 0.0845 new 83,449 40 0.224 0.
    • by 209,586 16 0.562 0.0899 or 81,796 41 0.219 0.
    • it 195,621 17 0.524 0.0891 which 80,385 42 0.215 0.
    • from 189,451 18 0.508 0.0914 we 80,245 43 0.215 0.
    • as 181,714 19 0.487 0.0925 more 76,388 44 0.205 0.
    • be 157,300 20 0.422 0.0843 after 75,165 45 0.201 0. (37,309,114 word occurrences, lowercased, punctuation removed, 266MB)
    • were 153,913 21 0.413 0.0866 us 72,045 46 0.193 0.
    • an 152,576 22 0.409 0.09 percent 71,956 47 0.193 0.
    • have 149,749 23 0.401 0.0923 up 71,082 48 0.191 0.
    • his 142,285 24 0.381 0.0915 one 70,266 49 0.188 0.
    • but 140,880 25 0.378 0.0944 people 68,988 50 0.185 0.

Does Real Data Fit Zipf’s Law?

  • A law of the form y = kx

c

is called a power law.

  • Zipf’s law is a power law with c = –
    • r = (AN)·n-1^ Æ n = (AN )·r -
[Foundations of Statistical NLP,
Manning and Schutze, 2000 ]

r (AN) n Æ n (AN ) r

  • AN is a constant for a fixed collection
  • On a log-log plot, power laws give a straight line with

slope c.

  • log(n) = log(ANr-1^ ) = log(AN) – 1·log(r)

log( y ) log( kx ) log k c log( x )

c

= = +

Copyright © James Allan

  • Zipf is quite accurate except for very high and low

rank.

Fit to Zipf for Brown Corpus

[Foundations of Statistical NLP,
Manning and Schutze, 2000 ]

Copyright © James Allan

k = AN = 100,

Mandelbrot (1954) Correction

  • The following

more general

form gives

[Foundations of Statistical NLP,
Manning and Schutze, 2000 ]

form gives

bit better fit

  • Adds a constant to

the denominator

  • y=k(x+t) c
  • Here,

n = (AN)·(r+t)

Copyright © James Allan

k = 10 5.4^ , C = -1.15, t = 100

Explanations for Zipf’s Law

  • Zipf’s explanation was his “principle of least effort.”

Balance between speaker’s desire for a small

vocabulary and hearer’s desire for a large onevocabulary and hearer s desire for a large one.

  • Debate (1955-61) between Mandelbrot and H. Simon

over explanation.

  • Li (1992) shows that just random typing of letters

including a space will generate “words” with a Zipfian

distribution.

http //linkage rockefeller ed / li/ ipf/

Copyright © James Allan

  • http://linkage.rockefeller.edu/wli/zipf/
  • Short words more likely to be generated

Vocabulary Growth (Heap’s Law)

  • How does the size of the overall vocabulary (number

of unique words) grow with the size of the corpus?

  • Vocabulary has no upper bound due to proper names, typos, etc.y pp p p yp
  • New words occur less frequently as vocabulary grows
  • If V is the size of the vocabulary and the n is the

length of the corpus in words:

  • V = Knβ^ (0< β <1)
  • Typical constants:
  • K ≈ 10 − 100
  • β ≈ 0.4−0.6 (approx. square-root of n)

Copyright © James Allan

β 0.4 0.6 (approx. square root of n)

  • Can be derived from Zipf’s law by assuming

documents are generated by randomly sampling

words from a Zipfian distribution

Heaps’ Law Data

V = Kn

Copyright © James Allan

Compression and statistics: Outline

  • Introduction
  • Fixed Length Codes
    • Short-bytes, bigrams, n-gramsy , g , g
  • Restricted Variable-Length Codes
    • Basic method, extension for larger symbol sets, UTF-
  • Some diversions
    • Statistics of text: Zipf, Heaps
    • Information theory
  • Variable-Length Codes
    • Huffman Codes / Canonical Huffman Codes

Copyright © James Allan

  • Lempel-Ziv (LZ77, Gzip, LZ78, LZW, Unix compress )
  • Synchronization
  • Compressing inverted files
  • Compression for block-level retrieval

Information Theory

  • Shannon studied theoretical limits for data

compression and transmission rate

  • Compression limits given by Entropy (H)Compression limits given by Entropy (H)
  • Transmission limits given by Channel Capacity (C)
  • A number of language tasks have been formulated as

a “noisy channel” problem

  • i.e., determine the most likely input given the noisy output
  • OCR

Speech recognition

Copyright © James Allan

  • Speech recognition
  • Question answering
  • Machine translation