L6Compression

Download Report

Transcript L6Compression

CS336: Intelligent Information
Retrieval
Lecture 6: Compression
Compression
• Change the representation of a file so that
– takes less space to store
– less time to transmit
• Text compression: original file can be
reconstructed exactly
• Data Compression: can tolerate small
changes or noise
– typically sound or images
• already a digital approximation of an analog waveform
Text Compression: Two classes
• Symbol-wise Methods
– estimate the probability of occurrence of symbols
– More accurate estimates => Greater compression
• Dictionary Methods
– represent several symbols as one codeword
• compression performance based on length of codewords
– replace words and other fragments of text with an index
to an entry in a “dictionary”
Compression: Based on text model
model
text
encoder
model
compressed text
decoder
• Determine output code based on pr distribution of
model
– # bits should use to encode a symbol equivalent to
information content (H(s) = - log Pr(s))
– Recall Information Theory and Entropy (H)
• avg amount of information per symbol over an alphabet
• H = (Pr(s) * I(s))
– The more probable a symbol is, the less information it
provides (easier to predict)
– The less probable, the more information it provides
Compression: Based on text model
model
text
encoder
model
compressed text
decoder
• Symbol-wise methods (statistical)
– Can assume independence assumption
– Considering context achieves best compression rates
• e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’
– Must yield probability distribution such that the probability of the
symbol actually occurring is very high
• Dictionary methods
– string representations typically based upon text seen so far
• most chars coded as part of a string that has already occurred
• Space is saved if the reference or pointer takes fewer bits than
the string it replaces
Statistical Models
• Static: same model regardless of text processed
– But, won’t be good for all inputs
• Recall entropy: measures information content

over collection of symbols
E   p log p

i 1
i
2
i
– high pr(s) => low entropy; low pr(s) => high entropy
– pr(s) = 1 => only s is possible, no encoding needed
– pr(s) = 0 =>s won’t occur, so s can not be encoded
– What is implication for models in which some pr(s) = 0,
but s does in fact occur?
s can not be encoded!
Therefore in practice, all symbols must be assigned pr(s) > 0
Semi-static Statistical Models
• Generate model on fly for file being processed
– first pass to estimate probabilities
– probabilities transmitted to decoder before encoded
transmission
– disadvantage is having to transmit model first
Adaptive Statistical Models
• Model constructed from the text just encoded
– begin with general model, modify gradually as more
text is seen
• best models take into account context
– decoder can generate same model at each time step
since it has seen all characters up to that point
– advantages: effective w/out fine-tuning to particular
text and only 1 pass needed
Adaptive Statistical Models
• What about pr(0) problem?
– allow 1 extra count divided evenly amongst unseen
symbols
• recall probabilities estimated via occurrence counts
• assume 72,300 total occurrences, 5 unseen characters:
• each unseen char, u, gets
pr(u) = pr(char is one of the unseen) * pr(all unseen char)
= 1/5 * 1/72301
• Not good for full-text retrieval
– must be decoded from beginning
• therefore not good for random access to files
Symbol-wise Methods
• Morse Code
– common symbols assigned short codes (few bits)
– rare symbols have longer codes
• Huffman coding (prefix-free code)
– no codeword is a prefix of another symbol’s codeword
Symbol
a
b
c
d
e
f
g
Codeword
0000
0001
001
01
10
110
111
Probability
0.05
0.05
0.1
0.2
0.3
0.2
0.1
Huffman coding
Symbol
a
b
c
d
e
f
g
Codeword
0000
0001
001
01
10
110
111
• What is the string for 1010110111111110?
•
•
•
•
•
Probability
0.05
0.05
0.1
0.2
0.3
0.2
0.1
eefggf
Decoder identifies 10 as first codeword => e
Decoding proceeds l to rt on remainder of string
Good when prob. distribution is static
Best when model is word-based
Typically used for compressing text in IR
Symbol
a
b
c
d
e
f
g
0
Codeword
0000
0001
001
01
10
110
111
1.0
0.4
0
1
0.2
0.6
0
1
0.1
0
a
0.05
0.3
1
1
1
b
0.05
c
0.1
d
0.2
0
e
0.3
0 f
0.2
Probability
0.05
0.05
0.1
0.2
0.3
0.2
0.1
Create a decoding tree bottom-up
1. Each symbol and pr are leafs
2. 2 nodes w/ smallest p are joined
under same parent (p = p(s1)+p(s2)
3. Repeat, ignoring children, until all
nodes are connected
4. Label nodes
• Left branch gets 0
• Right branch 1
g
0.1
1
Huffman coding
• Requires 2 passes over the document
1. gather statistics and build the coding table
2. encode the document
• Must explicitly store coding table with document
– eats into your space savings on short documents
• Exploit only non-uniformity in symbol distribution
–
adaptive algorithms can recognize the higher-order
redundancy in strings e.g. 0101010101....
Dictionary Methods
• Braille
– special codes are used to represent whole words
• Dictionary: list of sub-strings with corresponding
code-words
– we do this naturally e.g) replace “december” w/ “12”
– codebook for ascii might contain
• 128 characters and 128 common letter pairs
• At best each char encoded in 4 bits rather than 7
– can be static, semi-static, or adaptive (best)
Dictionary Methods
• Ziv-Lempel
– Built on the fly
– Replace strings w/ reference to a previous occurrence
– codebook
• all words seen prior to current position
• codewords represented by “pointers”
– Compression: pointer stored in fewer bits than string
– Especially useful when resources used must be
minimized (e.g. 1 machine distributes data to many)
• Relatively easy to implement
• Decoding is fast
• Requires only small amount of memory
Ziv-Lempel
• Coded output: series of triples <x,y,z>
ptr into • x: how far back to look in previous decoded text to
find next string
text
• y: how long the string is
• z: next character from the input
– only necessary if char to be coded did not occur previously
<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <1,10,a>
a b a a b a b a a b b 10 b’s followed by a
Accessing Compressed Text
• Store information identifying a document’s
location relative to the beginning of the file
– Store bit offset
• variable length codes mean docs may not begin/end on
byte boundaries
– insist that docs begin on byte boundaries
• some docs will have waste bits at the end
• need a method to explicitly identify end of coded text
Text Compression
• Key difference b/w symbol-wise & dictionary
– symbol-wise base coding of symbol on context
– dictionary groups symbols together creating an implicit context
• Present techniques give compression of ~2 bits/character
for general English text
• In order to do better, techniques must leverage
– semantic content
– external world knowledge
• Rule of thumb: The greater the compression,
– the slower the program runs or
– the more memory it uses.
• Inverted lists contain skewed data, so text compression
methods are not appropriate.
Inverted File Compression
• Note: each inverted list can be stored as an
ascending sequence of integers
– <8; 2, 4, 9, 45, 57, 76, 78, 80>
• Replace with initial position w/ d-gaps
– <8; 2,2,5,36,12,19,2,2>
– on average gaps < largest doc num
• Compression models describe probability
distribution of d-gap sizes
Fixed-Length Compression
• Typically, integer stored in 4 bytes
• To compress the d-gaps for :
<5; 1, 3, 7, 70, 250 >
<5; 1, 2, 4, 63, 180 >
• Use fewer bytes:
– Two leftmost bits store the number of bytes (1-4)
– The d-gap is stored in the next 6, 14, 22, 30 bits
Length
0x<64
64x<16,384
16,384x<4,194,304
4,194,304x<1,073,741,824
Bytes
1
2
3
4
Example
The inverted list is: 1, 3, 7, 70, 250
After computing gaps: 1, 2, 4, 63, 180
Value
1
2
4
63
180
Compressed
00 000001
00 000010
00 000100
00 111111
01 000000 10110100
Number bytes reduced from 5*4 to 6
 Code
•
•
This method is superior to a binary encoding
Represent the number x in two parts:
1. Unary code: specifies bits needed to code x
–
1 + log x followed by
2. binary code: codes x in that many bits
–
log x bits that represent x – 2
log x
Let x = 9: log x = 3 ( 8 = 23 < 9 < 24 = 16)
part 1: 1 + 3 = 4 => 1110 (unary)
part 2: 9 – 8 = 1
code: 1110 001
Unary code: (n-1) 1-bits followed by a 0, therefore 1 represented by 0,
2 represented by 10, 3 by 110, etc.
Example
The inverted list is: 1, 3, 7, 70, 250
After computing gaps: 1, 2, 4, 63, 180
Value
1
2
4
63
180
Compressed
0 0
10 0
110 00
111110 11111
11111110 0110100
1 + log x
x–2
log x
Number bits reduced from 5*32 to 36.
Decode: let u = unary, b = binary
x = 2u-1 +b