Download Encoding and Compression - Information Retrieval | CMPSCI 646 and more Study notes Computer Science in PDF only on Docsity!
Information Retrieval
James Allan
University of Massachusetts Amherst
Information Retrieval
Compression and Statistics of text
University of Massachusetts Amherst
CMPSCI 646
Fall 2007
All slides copyright © James Allan
Encoding and Compression
• Encoding transforms data from one representation to
another
Model Model
• Compression is an encoding that takes less space
- e.g., to reduce load on memory, disk, I/O, network
• Lossless : decoder can reproduce message exactly
• Lossy : can reproduce message approximately
Encoder Decoder
Data Encoded Data Data
Copyright © James Allan
• Measuring compression
- Compression ratio: Enc/Orig Æ 25Mb / 125Mb = 20%
- Or sometimes: Orig/Enc Æ 125Mb / 25Mb = 500%
- Or sometimes (Orig-Enc)/Orig Æ 100Mb/125Mb = 80%
- Or sometimes (Orig-Enc)/Enc Æ 100Mb/25Mb = 400%
Compression
- Advantages of Compression
- Save space in memory (e.g., compressed cache)
- Save space when storing (e gSave space when storing (e.g., disk, CD/DVD) disk CD/DVD)
- Save time when accessing (e.g., I/O)
- Save time when communicating (e.g., over network)
- Disadvantages of Compression
- Costs time and computation to compress and uncompress
- Complicates or prevents random access
M i l l f i f ti ( JPEG)
Copyright © James Allan
- May involve loss of information (e.g., JPEG)
- Makes data corruption much more costly.
- Small errors may make all of the data inaccessible.
Text vs. data compression
- Text compression (TC) predates most work on
general data compression (DC)
- TC is a kind of data compression optimized for textTC is a kind of data compression optimized for text
- i.e., based on a language and a language model
- TC can be faster or simpler than general DC
- Assumptions made about the data.
- Text compression assumes a language and model
- Data compression typically learns the model on the fly
Some forms of text compression are really data compression
Copyright © James Allan
- Some forms of text compression are really data compression
- Text compression is effective when assumptions met
- Data compression is effective on almost any data with a skewed
distribution
Fixed Length Compression: Short bytes
- If alphabet ≤ 32 symbols, use 5 bits per symbol
- Saves 3 bits per character
- If alphabet > 32 symbols and ≤ 60
- use 1-30 for most frequent symbols (“base case”)
- use 1-30 for less frequent symbols (“shift case”)
- use 0 and 31 to shift back and forth (e.g., typewriter)
- Example:
- Fast →
- 55=25 bits rather than 48=32 bits
- Works well when shifts do not occur often
Copyright © James Allan
Works well when shifts do not occur often.
- LaTeX → T
- 95=45 bits rather than 58=40 bits
- Optimizations: Just one shift toggle, Temporary shift, and shift-lock
- Variations: Multiple “cases”
Fixed Length Compression: Bigrams
- Use a byte (8 bits) as storage unit (values 0-255)
- Use values 0-87 for blank, upper case, lower case, digits and
25 special characters25 special characters
- Use values 88-255 for common bigrams (master + combining)
- Master (8): blank, A, E, I, O, N, T, U
- Combining (21): blank, plus everything except J, K, Q, X, Y Z
- Total codes: 88 + 8 * 21 = 88 + 168 = 256
- Example: Fast Æ saves one byte
- Pro: Simple, fast, requires little memory.
Copyright © James Allan
- Con: Based on a small symbol set
- Con: Maximum compression is 50%.
- Variation: 128 ASCII characters and 128 bigrams.
- Extension: Escape character for ASCII 128-255.
Fixed Length Compression: n -grams
- Similar to bigrams, but extended to cover sequences
of 2 or more characters
- Can select common n-grams in advanceCan select common n-grams in advance
- Could learn them for specific applications
- The goal is that each encoded unit of length > 1
occur with very high (and roughly equal) probability
- Popular today for:
- OCR data
- scanning errors make bigram assumptions less applicable
Copyright © James Allan
g g p pp
- Asian languages
- two and three symbol words are common
- longer n -grams can capture phrases and names
Fixed length compression summary
- Three methods presented. All are
- simple
- – very effective when their assumptions are correctvery effective when their assumptions are correct
- All are based on a small symbol set, to varying degrees
- some only handle a small symbol set
- some handle a larger symbol set, but compress best when a few symbols
comprise most of the data
- All are based on a strong assumption about the choice of
language (English, here)
Copyright © James Allan
- Bigram and n -gram methods are also based on strong
assumptions about common sequences of symbols
- Intuitions about which characters/sequences are frequent
Generalization for More Symbols
- Use more than 2 cases
- 1xxx for 2^3 = 8 most frequent symbols, and
- 0xxx1xxx for next 2^6 = 64 symbols, and
- 0xxx0xxx1xxx for next 2^9 = 512 symbols, and
- …
- Average code length on WSJ89 is 6.2 bits per symbol
P V i bl b f b l
Copyright © James Allan
- Pro: Variable number of symbols
- Con: Only 72 symbols in 1 byte (or less)
Example: Unicode as UTF- n
- Encode Unicode into variable-length pieces
- Unicode allows 1,114,112 code values (0 to 0x10FFFF, or 21 bits)
- 17 “planes” (0 to 16) of 65,536 coding values
- Basic Multilingual Plane is 0 to 0xFFFF
- All UTF- n are equally valid ways to encode Unicode
- It is possible to map between them without loss of information
- All Unicode code values can be represented
- Consider UTF-
- Each Unicode value encoded in 32 bits directly
Fi d l th d fi d l ti h t
Copyright © James Allan
- Fixed-length and fixed-location characters
- Simple to find character offsets and starting positions
- Wasteful
- 11 bits never used (remember, Unicode uses only 21 bits)
- In some cases, many other bits wasted (e.g., 7-bit ASCII)
UTF-
- Represent Unicode in 16-bit (2 byte) units
- (Prior to v2.1, Unicode was limited to 16 bits)
- • Unicode values 0x0000 to 0xFFFF (common chars)Unicode values 0x0000 to 0xFFFF (common chars)
- Represent directly as a 16-bit code value
- Two blocks of 1024 (2 10 ) values unused in there
- For values 0x10000 to 0x10FFFF
- Subtract 0x10000 so range is 0x0 to 0xFFFFF, needs 20 bits
- Encode first 10 bits in high 16-bit “unused range” number
- Encode remaining 10 bits in a low 16-bit “unused” number
Copyright © James Allan
Encode remaining 10 bits in a low 16 bit unused number
- During decoding
- If character is “unused” then it is part of a 32-bit encoding
- Which block (high/low) determines which of the parts it is
- Else it is a 16-bit encoding
UTF-
- Base unit is a single byte
- For Unicode values 0 to 0xFF (127)
- EEncode directly in a single byte using 7 bits (0 d di tl i i l b t i 7 bit (0 xxxxxxx ))
- Corresponds to ASCII values
- For all other values…
- First byte indicates how long the encoding is
- Encode the number of bytes needed in unary (count the 1’s)
- 110 xxxxx means “two bytes” and has first 5 bits of number
- 1110 xxxx for 3 bytes 11110 xxx means 4 bytes
Copyright © James Allan
1110 xxxx for 3 bytes, 11110 xxx means 4 bytes
- (Maximum needed is 4 bytes, though can encode more in theory)
- All other bytes include next 6 bits
- Bytes are of form 10 xxxxxx
- (Why not 7 bits, 1 xxxxxxx or 8 bits, xxxxxxxx ?)
Notes on UTF-8 formatting
- Number must be encoded in shortest possible
encoding
- Copyright © is 0xA9 = 1010 1001Copyright © is 0xA9 = 1010 1001
- Valid 11000010 10101001
- Invalid 11100000 10000010 10101001 (and so on)
- One-byte characters and those that are part of a
sequence are distinguishable
- High-order bit is always clear in former and always set in latter
Copyright © James Allan
- (True for UTF-16, too, but for different reason)
Non-overlapping aspects of UTF- n
- Consider older mixed-width encodings (e.g.,
Windows)
- One character could be aOne character could be a
subsequence of another
string search
- Imagine looking for “D”
- Is the 0x44 a “D” or the end of the 0x414 character?
T h k l k b k d h t
Copyright © James Allan
- To check, look backward one character
- Could it be that this is a
0x442 character?
(to start of text worst case)
Pros and cons of various UTF- n
- All are equally valid encodings
- UTF 32 simpler (trivial) to encodeUTF-32 simpler (trivial) to encode
- UTF-16 captures more commonly
used Unicode characters simply
- UTF-8 is most compact for ASCII-like text
- At a disadvantage for some East Asian text
- Binary sort of UTF-8 matches Unicode value sort
Copyright © James Allan
- Big-/little-endian storage not an issue for UTF-
- Partly addressed by unused characters 0xFFFE and 0xFEFF
- Put that “byte order marker” at start of text to flag endian choice
Back to restricted variable-length codes
- Generalization for numeric data
- 1xxxxxxx for 2^7 = 128 most frequent symbols,
- 0xxxxxxx1xxxxxxx for next 20xxxxxxx1xxxxxxx for next 2^14 = 16 384 symbols and so on16,384 symbols, and so on
- Con: totally fails to compress text
- On WSJ89 uses 8.0 bits per symbol, for a 0.0% compression ratio (!!).
- Pro: Can be used for integer data
- Examples: word frequencies (usually small), inverted lists
- Word-based encoding
- Build dictionary sorted by word frequency
Copyright © James Allan
Build dictionary, sorted by word frequency
- Represent each word as an offset/index into the dictionary
- Pro: A vocabulary of 20,000-50,000 words with a Zipf distribution
requires 12-13 bits per word
- compared with a 10-11 bits for completely variable length
- Con: Decoding dictionary is large, compared with other methods
(In a moment)
Zipf’s Law
- A few words occur very often
- 2 most frequent words can account for 10% of occurrences
- top 6 words are 20%, top 50 words are 50%
- Many words are infrequent
- “Principle of Least Effort”
- easier to repeat words than coin new ones
- Rank · Frequency ≈ Constant
- pr = (Number of occurrences of word of rank r)/N
Copyright © James Allan
- probability that a word chosen randomly from the text will be the word
of rank r
- for D unique words Σ p (^) r = 1
- r ·pr = A
- A ≈ 0.
George Kingsley Zipf, 1902-
Linguistic professor at Harvard
Example of Frequent Words
Frequent
Word
Number of
Occurrences
Percentage
of Total
thethe 7,398,9347,398,934 5.95.
of 3,893,790 3.
to 3,364,653 2.
and 3,320,687 2.
in 2,311,785 1.
is 1,559,147 1.
for 1,313,561 1.
The 1,144,860 0.
that 1,066,503 0.
said 1,027,713 0.
Artifact of
InQuery’s
stemming
technique
Copyright © James Allan
Frequencies from 336,310 documents in the 1GB TREC Volume 3 Corpus
125,720,891 total word occurrences; 508,209 unique words
Examples of Zipf
Copyright © James Allan
Top 50 words from 423 short TIME magazine articles
Examples of Zipf
Copyright © James Allan
Top 50 words from 84,678 Associated Press 1989 articles
- the 15659 1 6.422 0.0642 has 880 26 0.361 0. Word Freq r Pr rPr Word Freq r Pr rPr
- of 7179 2 2.944 0.0589 not 875 27 0.359 0.
- to 6287 3 2.578 0.0774 an 863 28 0.354 0.
- a 5830 4 2.391 0.0956 s 862 29 0.354 0.
- and 5580 5 2.288 0.1144 have 860 30 0.353 0.
- in 5245 6 2.151 0.1291 were 858 31 0.352 0.
- that 2494 7 1.023 0.0716 their 812 32 0.333 0.
- for 2197 8 0.901 0.0721 are 807 33 0.331 0.
- was 2147 9 0.881 0.0792 one 742 34 0.304 0.
- with 1824 10 0.748 0.0748 they 679 35 0.278 0.
- his 1813 11 0.744 0.0818 its 668 36 0.274 0.
- is 1800 12 0.738 0.0886 all 646 37 0.265 0.
- he 1687 13 0.692 0.0899 week 626 38 0.257 0.
- as 1576 14 0.646 0.0905 government 582 39 0.239 0.
- on 1523 15 0.625 0.0937 when 577 40 0.237 0.
- by 1443 16 0.592 0.0947 would 572 41 0.235 0.
- at 1318 17 0.541 0.0919 been 554 42 0.227 0.
- it 1232 18 0.505 0.0909 out 553 43 0.227 0.
- from 1217 19 0.499 0.0948 new 544 44 0.223 0.
- but 1136 20 0.466 0.0932 which 539 45 0.221 0. (243,836 word occurrences, lowercased, punctuation removed, 1.6 MB)
- u 949 21 0.389 0.0817 up 539 45 0.221 0.
- had 937 22 0.384 0.0845 more 535 47 0.219 0.
- last 909 23 0.373 0.0857 into 516 48 0.212 0.
- be 906 24 0.372 0.0892 only 504 49 0.207 0.
- who 883 25 0.362 0.0905 will 488 50 0.2 0.
- the 2,420,778 1 6.488 0.0649 has 136,007 26 0.365 0. Word Freq r Pr(%) rPr Word Freq r Pr(%) rPr
- of 1,045,733 2 2.803 0.0561 are 130,322 27 0.349 0.
- to 968,882 3 2.597 0.0779 not 127,493 28 0.342 0.
- a 892,429 4 2.392 0.0957 who 116,364 29 0.312 0.
- and 865,644 5 2.32 0.116 they 111,024 30 0.298 0.
- in 847,825 6 2.272 0.1363 its 111,021 31 0.298 0.
- saidsaid 504 593504,593 77 1 3521.352 0 09470.0947 hadhad 103 943 32103,943 32 0 2790.279 0 08920.
- for 363,865 8 0.975 0.078 will 102,949 33 0.276 0.
- that 347,072 9 0.93 0.0837 would 99,503 34 0.267 0.
- was 293,027 10 0.785 0.0785 about 92,983 35 0.249 0.
- on 291,947 11 0.783 0.0861 i 92,005 36 0.247 0.
- he 250,919 12 0.673 0.0807 been 88,786 37 0.238 0.
- is 245,843 13 0.659 0.0857 this 87,286 38 0.234 0.
- with 223,846 14 0.6 0.084 their 84,638 39 0.227 0.
- at 210,064 15 0.563 0.0845 new 83,449 40 0.224 0.
- by 209,586 16 0.562 0.0899 or 81,796 41 0.219 0.
- it 195,621 17 0.524 0.0891 which 80,385 42 0.215 0.
- from 189,451 18 0.508 0.0914 we 80,245 43 0.215 0.
- as 181,714 19 0.487 0.0925 more 76,388 44 0.205 0.
- be 157,300 20 0.422 0.0843 after 75,165 45 0.201 0. (37,309,114 word occurrences, lowercased, punctuation removed, 266MB)
- were 153,913 21 0.413 0.0866 us 72,045 46 0.193 0.
- an 152,576 22 0.409 0.09 percent 71,956 47 0.193 0.
- have 149,749 23 0.401 0.0923 up 71,082 48 0.191 0.
- his 142,285 24 0.381 0.0915 one 70,266 49 0.188 0.
- but 140,880 25 0.378 0.0944 people 68,988 50 0.185 0.
Does Real Data Fit Zipf’s Law?
c
is called a power law.
- Zipf’s law is a power law with c = –
- r = (AN)·n-1^ Æ n = (AN )·r -
[Foundations of Statistical NLP,
Manning and Schutze, 2000 ]
r (AN) n Æ n (AN ) r
- AN is a constant for a fixed collection
- On a log-log plot, power laws give a straight line with
slope c.
- log(n) = log(ANr-1^ ) = log(AN) – 1·log(r)
log( y ) log( kx ) log k c log( x )
c
= = +
Copyright © James Allan
- Zipf is quite accurate except for very high and low
rank.
Fit to Zipf for Brown Corpus
[Foundations of Statistical NLP,
Manning and Schutze, 2000 ]
Copyright © James Allan
k = AN = 100,
Mandelbrot (1954) Correction
more general
form gives
[Foundations of Statistical NLP,
Manning and Schutze, 2000 ]
form gives
bit better fit
the denominator
n = (AN)·(r+t)
Copyright © James Allan
k = 10 5.4^ , C = -1.15, t = 100
Explanations for Zipf’s Law
- Zipf’s explanation was his “principle of least effort.”
Balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large onevocabulary and hearer s desire for a large one.
- Debate (1955-61) between Mandelbrot and H. Simon
over explanation.
- Li (1992) shows that just random typing of letters
including a space will generate “words” with a Zipfian
distribution.
http //linkage rockefeller ed / li/ ipf/
Copyright © James Allan
- http://linkage.rockefeller.edu/wli/zipf/
- Short words more likely to be generated
Vocabulary Growth (Heap’s Law)
- How does the size of the overall vocabulary (number
of unique words) grow with the size of the corpus?
- Vocabulary has no upper bound due to proper names, typos, etc.y pp p p yp
- New words occur less frequently as vocabulary grows
- If V is the size of the vocabulary and the n is the
length of the corpus in words:
- V = Knβ^ (0< β <1)
- Typical constants:
- K ≈ 10 − 100
- β ≈ 0.4−0.6 (approx. square-root of n)
Copyright © James Allan
β 0.4 0.6 (approx. square root of n)
- Can be derived from Zipf’s law by assuming
documents are generated by randomly sampling
words from a Zipfian distribution
Heaps’ Law Data
V = Kn
Copyright © James Allan
Compression and statistics: Outline
- Introduction
- Fixed Length Codes
- Short-bytes, bigrams, n-gramsy , g , g
- Restricted Variable-Length Codes
- Basic method, extension for larger symbol sets, UTF-
- Some diversions
- Statistics of text: Zipf, Heaps
- Information theory
- Variable-Length Codes
- Huffman Codes / Canonical Huffman Codes
Copyright © James Allan
- Lempel-Ziv (LZ77, Gzip, LZ78, LZW, Unix compress )
- Synchronization
- Compressing inverted files
- Compression for block-level retrieval
Information Theory
- Shannon studied theoretical limits for data
compression and transmission rate
- Compression limits given by Entropy (H)Compression limits given by Entropy (H)
- Transmission limits given by Channel Capacity (C)
- A number of language tasks have been formulated as
a “noisy channel” problem
- i.e., determine the most likely input given the noisy output
- OCR
Speech recognition
Copyright © James Allan
- Speech recognition
- Question answering
- Machine translation
- …