xiaoyan5_compression

Download Report

Transcript xiaoyan5_compression

CS336: Intelligent Information
Retrieval
Lecture 6: Compression
Compression
• Change the representation of a file so that
– takes less space to store
– less time to transmit
• Text compression: original file can be
reconstructed exactly
• Data Compression: can tolerate small
changes or noise
– typically sound or images
• already a digital approximation of an analog waveform
Text Compression: Two classes
• Symbol-wise Methods
– estimate the probability of occurrence of symbols
– More accurate estimates => Greater compression
• Dictionary Methods
– represent several symbols as one codeword
• compression performance based on length of codewords
– replace words and other fragments of text with an index
to an entry in a “dictionary”
Compression: Based on text model
model
text
encoder
model
compressed text
decoder
• Determine output code based on pr distribution of
model
– # bits should use to encode a symbol equivalent to
information content (H(s) = - log Pr(s))
– Recall Information Theory and Entropy (H)
• avg amount of information per symbol over an alphabet
• H = (Pr(s) * I(s))
– The more probable a symbol is, the less information it
provides (easier to predict)
– The less probable, the more information it provides
Compression: Based on text model
model
text
encoder
model
compressed text
decoder
• Symbol-wise methods (statistical)
– Can assume independence assumption
– Considering context achieves best compression rates
• e.g. probability of seeing ‘u’ next is higher if we’ve just seen ‘q’
– Must yield probability distribution such that the probability of the
symbol actually occurring is very high
• Dictionary methods
– string representations typically based upon text seen so far
• most chars coded as part of a string that has already occurred
• Space is saved if the reference or pointer takes fewer bits than
the string it replaces
Stastic Statistical Models
• Static: same model regardless of text processed
– But, won’t be good for all inputs
• Recall entropy: measures information content

over collection of symbols
E   p log p

i 1
i
2
i
– high pr(s) => low entropy; low pr(s) => high entropy
– pr(s) = 1 => only s is possible, no encoding needed
– pr(s) = 0 =>s won’t occur, so s can not be encoded
– What is implication for models in which some pr(s) = 0,
but s does in fact occur?
s can not be encoded!
Therefore in practice, all symbols must be assigned pr(s) > 0
Semi-static Statistical Models
• Generate model on fly for file being processed
– first pass to estimate probabilities
– probabilities transmitted to decoder before encoded
transmission
– disadvantage is having to transmit model first
Adaptive Statistical Models
• Model constructed from the text just encoded
– begin with general model, modify gradually as more
text is seen
• best models take into account context
– decoder can generate same model at each time step
since it has seen all characters up to that point
– advantages: effective w/out fine-tuning to particular
text and only 1 pass needed
Adaptive Statistical Models
• What about pr(0) problem?
– allow 1 extra count divided evenly amongst unseen
symbols
• recall probabilities estimated via occurrence counts
• assume 72,300 total occurrences, 5 unseen characters:
• each unseen char, u, gets
pr(u) = pr(char is one of the unseen) * pr(all unseen char)
= 1/5 * 1/72301
• Not good for full-text retrieval
– must be decoded from beginning
• therefore not good for random access to files
Symbol-wise Methods
• Morse Code
– common symbols assigned short codes (few bits)
– rare symbols have longer codes
• Huffman coding (prefix-free code)
– no codeword is a prefix of another symbol’s codeword
Symbol
a
b
c
d
e
f
g
Codeword
0000
0001
001
01
10
110
111
Probability
0.05
0.05
0.1
0.2
0.3
0.2
0.1
Huffman coding
Symbol
a
b
c
d
e
f
g
Codeword
0000
0001
001
01
10
110
111
• What is the string for 1010110111111110?
•
•
•
•
•
Probability
0.05
0.05
0.1
0.2
0.3
0.2
0.1
eefggf
Decoder identifies 10 as first codeword => e
Decoding proceeds l to rt on remainder of string
Good when prob. distribution is static
Best when model is word-based
Typically used for compressing text in IR
Symbol
a
b
c
d
e
f
g
0
Codeword
0000
0001
001
01
10
110
111
1.0
0.4
0
1
0.2
0.6
0
1
0.1
0
a
0.05
0.3
1
1
1
b
0.05
c
0.1
d
0.2
0
e
0.3
0 f
0.2
Probability
0.05
0.05
0.1
0.2
0.3
0.2
0.1
Create a decoding tree bottom-up
1. Each symbol and pr are leafs
2. 2 nodes w/ smallest p are joined
under same parent (p = p(s1)+p(s2)
3. Repeat, ignoring children, until all
nodes are connected
4. Label nodes
• Left branch gets 0
• Right branch 1
g
0.1
1
Huffman coding
• Requires 2 passes over the document
1. gather statistics and build the coding table
2. encode the document
• Must explicitly store coding table with document
– eats into your space savings on short documents
• Exploit only non-uniformity in symbol distribution
–
adaptive algorithms can recognize the higher-order
redundancy in strings e.g. 0101010101....
Dictionary Methods
• Braille
– special codes are used to represent whole words
• Dictionary: list of sub-strings with corresponding
code-words
– we do this naturally e.g) replace “december” w/ “12”
– codebook for ascii might contain
• 128 characters and 128 common letter pairs
• At best each char encoded in 4 bits rather than 7
– can be static, semi-static, or adaptive (best)
Dictionary Methods
• Ziv-Lempel
– Built on the fly
– Replace strings w/ reference to a previous occurrence
– codebook
• all words seen prior to current position
• codewords represented by “pointers”
– Compression: pointer stored in fewer bits than string
– Especially useful when resources used must be
minimized (e.g. 1 machine distributes data to many)
• Relatively easy to implement
• Decoding is fast
• Requires only small amount of memory
Ziv-Lempel
• Coded output: series of triples <x,y,z>
ptr into • x: how far back to look in previous decoded text to
find next string
text
• y: how long the string is
• z: next character from the input
– only necessary if char to be coded did not occur previously
<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <1,10,a>
a b a a b a b a a b b 10 b’s followed by a
Accessing Compressed Text
• Store information identifying a document’s
location relative to the beginning of the file
– Store bit offset
• variable length codes mean docs may not begin/end on
byte boundaries
– insist that docs begin on byte boundaries
• some docs will have waste bits at the end
• need a method to explicitly identify end of coded text
Text Compression
• Key difference b/w symbol-wise & dictionary
– symbol-wise base coding of symbol on context
– dictionary groups symbols together creating an implicit context
• Present techniques give compression of ~2 bits/character
for general English text
• In order to do better, techniques must leverage
– semantic content
– external world knowledge
• Rule of thumb: The greater the compression,
– the slower the program runs or
– the more memory it uses.
• Inverted lists contain skewed data, so text compression
methods are not appropriate.
Inverted File Compression
• Note: each inverted list can be stored as an
ascending sequence of integers
– <8; 2, 4, 9, 45, 57, 76, 78, 80>
• Replace with initial position w/ d-gaps
– <8; 2,2,5,36,12,19,2,2>
– on average gaps < largest doc num
• Compression models describe probability
distribution of d-gap sizes
Fixed-Length Compression
• Typically, integer stored in 4 bytes
• To compress the d-gaps for :
<5; 1, 3, 7, 70, 250 >
<5; 1, 2, 4, 63, 180 >
• Use fewer bytes:
– Two leftmost bits store the number of bytes (1-4)
– The d-gap is stored in the next 6, 14, 22, 30 bits
Length
0x<64
64x<16,384
16,384x<4,194,304
4,194,304x<1,073,741,824
Bytes
1
2
3
4
Example
The inverted list is: 1, 3, 7, 70, 250
After computing gaps: 1, 2, 4, 63, 180
Value
1
2
4
63
180
Compressed
00 000001
00 000010
00 000100
00 111111
01 000000 10110100
Number bytes reduced from 5*4 to 6
 Code
•
•
This method is superior to a binary encoding
Represent the number x in two parts:
1. Unary code: specifies bits needed to code x
–
1 + log x followed by
2. binary code: codes x in that many bits
–
log x bits that represent x – 2
log x
Let x = 9: log x = 3 ( 8 = 23 < 9 < 24 = 16)
part 1: 1 + 3 = 4 => 1110 (unary)
part 2: 9 – 8 = 1
code: 1110 001
Unary code: (n-1) 1-bits followed by a 0, therefore 1 represented by 0,
2 represented by 10, 3 by 110, etc.
Example
The inverted list is: 1, 3, 7, 70, 250
After computing gaps: 1, 2, 4, 63, 180
Value
1
2
4
63
180
Compressed
0 0
10 0
110 00
111110 11111
11111110 0110100
1 + log x
x–2
log x
Number bits reduced from 5*32 to 36.
Decode: let u = unary, b = binary
x = 2u-1 +b
Comparison of the text Compression
techniques
Arithmetic
Huffman
Huffman
(charactr) (word)
Ziv-Lempel
Compression
ratio
Very good
Poor
Very good
Good
Compression
speed
Slow
Fast
Fast
Very fast
Decompression Slow
speed
Fast
Very fast
Very fast
Memory space
Low
Low
High
Moderate
Compressed
pat. matching
No
Yes
Yes
Yes
Random access No
Yes
yes
No