Lecture9AN - School of Computer Science

Download Report

Transcript Lecture9AN - School of Computer Science

LZW Compression
15-211
Fundamental Data Structures and
Algorithms
Aleks Nanevski
February 10, 2004
based on a lecture by Peter Lee
Last Time…
Problem: data compression
 Convert a string into a shorter string.
Lossless – represents exactly
the same information.
Lossy – approximates the
original information.
 Uses of compression:
Images over the web: JPEG
Music: MP3
General-purpose: ZIP, GZIP, JAR, …
Huffman trees
Huffman’s algorithm
 Huffman’s algorithm gives the
optimal prefix code.
 For a nice online demo, see
 http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html
Huffman compression
 Huffman trees provide a
straightforward method for file
compression.
1. Scan the file and compute frequencies
2. Build the code tree
3. Write code tree to the output file as a
header
4. Scan input, encode, and write into the
output file
Huffman decompression
 Read the header in the compressed
file, and build the code tree
 Read the rest of the file, decode
using the tree
 Write to output
Beating Huffman
 How about doing better than
Huffman!
 Impossible!
Huffman’s algorithm gives the optimal
prefix code!
 Right.
But who says we have to use a prefix
code?
Example
 Suppose we have a file containing
abcdabcdabcdabcdabcdabcd… abcdabcd
 This could be expressed very
compactly as
abcd^1000
Dictionary-Based
Compression
Dictionary-based methods
 Here is a simple idea:
 Keep track of “words” that we have seen, and
replace them with a code number when we
see them again.
 The code is typically shorter than the word
 We can maintain dictionary entries
 (word, code)
 and make additions to the dictionary as
we read the input file.
Lempel & Ziv (1977/78)
Fred Hacker’s algorithm…
 Fred now knows what to do…
 Create the dictionary:
( <the-whole-file>, 1 )
 Transmit 1, done.
Right?
 Fred’s algorithm provides excellent
compression, but…
Right?
 Fred’s algorithm provides excellent
compression, but…
 …the receiver does not know what is in
the dictionary!
 And sending the dictionary is the same as
sending the entire uncompressed file
 Thus, we can’t decompress the “1”.
Hence…
 …we need to build our dictionary in
such a way that the receiver can
rebuild the dictionary easily.
LZW Compression:
The Binary Version
LZW=variant of Lempel-Ziv
Compression, by Terry Welch (1984)
Maintaining a dictionary
 We need a way of incrementally building
up a dictionary during compression in
such a way that…
 …someone who wants to uncompress can
“rediscover” the very same dictionary
 And we already know that a convenient
way to build a dictionary incrementally is
to use a trie
Binary LZW
 In this method, we build up binary
tries
 In a binary trie, each node has two
children
 In addition, we will add the following:
each left edge is marked 0
each right edge is marked 1
each leaf has a label from the set {0,…,n}
A binary trie
0
0
1
0
1
1
1
0
0
2
1
3
0
4
1
5
Binary LZW: Compression
1. We start with a binary trie consisting of
a root node and two children

left child labeled 0, and right labeled 1
2. We read the bits of the input file, and
follow the trie
3. When a leaf is reached, we emit the
label at the leaf
4. Then, add two new children to that leaf
(converting it into an internal node)
Binary LZW: Compression, pt.2
5. The new left child takes the old
label
6. The new right child takes a new
label value that is one greater than
the current maximum label value
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
0
Output:
1
1
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
1
0
0
1
Output:
1
1
2
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
0
0
Output:
1
0
1
3
10
1
1
2
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
0
1
0
1
1
0
0
2
1
3
Output:
1
4
103
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
0
1
0
1
1
1
0
0
2
1
3
0
4
Output:
1034
1
5
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
0
1
0
1
1
1
0
0
2
1
6
3
0
4
Output:
10340
1
5
Binary LZW: Compression example
10010110011
^
Input:
0
Dictionary:
0
1
0
1
1
1
0
0
0
1
1
7
2
6
3
0
1
4
Output:
103402
5
Binary LZW output
 So from the input
10010110011
 we get output
103402
 To represent this output we can keep
track of the number of labels n each
time we emit a code
and use log(n) bits for that code
Binary LZW output
 We started with input
10010110011
 Encoded it as 103402, for which we get the bit sequence
001 000 011 100 000 010
 This looks like an expansion instead of a compression
 But what if we have a larger input, with more repeating
sequences?
 Try it!
Binary LZW output
 One can also use Huffman
compression on the output…
Binary LZW termination
 Note that binary LZW has a serious
problem, in that the input might end
while we are in the middle of the trie
(instead of at a leaf node)
 This is a nasty problem
which is why we won’t use this binary
method
But this is still good for illustration
purposes…
Binary LZW: Uncompress
 To uncompress, we need to read the
compressed file and rebuild the
same trie as we go along
 To do this, we need to maintain the
trie and also the maximum label
value
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
0
Output:
1
1
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
1
0
0
1
Output:
1
1
2
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
0
0
Output:
1
0
1
3
10
1
1
2
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
0
1
0
1
0
2
1
0
1
3
Output:
1
4
1001
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
0
1
0
1
0
1
2
1
0
1
3
0
4
Output:
1
5
1001011
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
0
1
0
1
1
2
1
0
0
1
6
3
0
4
Output:
1
5
100101100
Binary LZW: Uncompress example
103402
^
Input:
0
Dictionary:
0
1
0
1
1
1
0
0
0
1
1
7
2
6
3
0
4
Output:
1
5
10010110011
LZW Compression:
The Byte Version
Byte method
 The binary LZW method doesn’t
really work
we show it for illustrative purposes
 Instead, we use a slightly more
complicated version that works on
bytes or characters
We can think of each byte as a
“character” in the range {0…255}
Byte method trie
 Instead of a binary trie, we use a
more general trie in which
each node can have up to n children
(where n is the size of the alphabet),
one for each byte/character
every node (not just the leaves) has an
integer label from the set {0…m}, for
some m
• except the root node, which has no label
Byte method LZW
 We start with a trie that contains a
root and n children
one child for each possible character
each child labeled 0…n
 When we compress as before, by
walking down the trie
but, after emitting a code and growing
the trie, we must start from the root’s
child labeled c, where c is the character
that caused us to grow the trie
LZW: Byte method example
 Suppose our entire character set
consists only of the four letters:
{a, b, c, d}
 Let’s consider the compression of the
string
baddad
Byte LZW: Compress example
baddad
^
Input:
0
Output:
b
a
Dictionary:
1
c
d
2
3
Byte LZW: Compress example
baddad
^
Input:
b
a
Dictionary:
0
1
a
4
Output:
1
c
d
2
3
Byte LZW: Compress example
baddad
^
Input:
0
d
5
Output:
b
a
Dictionary:
1
a
4
10
c
d
2
3
Byte LZW: Compress example
baddad
^
Input:
0
d
5
Output:
b
a
Dictionary:
1
c
d
2
3
a
d
4
6
103
Byte LZW: Compress example
baddad
^
Input:
0
d
5
Output:
b
a
Dictionary:
1
c
d
2
3
a
4
1033
d
6
a
7
Byte LZW: Compress example
baddad
^
Input:
0
d
5
Output:
b
a
Dictionary:
c
1
d
2
3
a
4
10335
d
6
a
7
Byte LZW output
 So, the input
baddad
 compresses to
10335
 which again can be given in bit form,
just like in the binary method…
 …or compressed again using
Huffman
Byte LZW: Uncompress example
 The uncompress step for byte LZW is
the most complicated part of the
entire process, but is largely similar
to the binary method
Byte LZW: Uncompress example
10335
^
Input:
0
Output:
b
a
Dictionary:
1
c
d
2
3
Byte LZW: Uncompress example
10335
^
Input:
b
a
Dictionary:
0
Output:
1
b
c
d
2
3
Byte LZW: Uncompress example
10335
^
Input:
b
a
Dictionary:
0
1
a
4
Output:
ba
c
d
2
3
Byte LZW: Uncompress example
10335
^
Input:
0
d
5
Output:
b
a
Dictionary:
1
a
4
bad
c
d
2
3
Byte LZW: Uncompress example
10335
^
Input:
0
d
5
Output:
b
a
Dictionary:
1
c
d
2
3
a
d
4
6
badd
Byte LZW: Uncompress example
10335
^
Input:
0
d
5
Output:
b
a
Dictionary:
1
c
d
2
3
a
4
baddad
d
6
a
7
LZW Byte method:
An alternative presentation
Getting off the ground
Suppose we want to compress a file
containing only letters a, b, c and d.
It seems reasonable to start with a
dictionary
a:0
b:1
c:2
d:3
At least we can then deal with the first
letter.
And the receiver knows how to start.
Growing pains
Now suppose the file starts like so:
abbabb…
We scan the a, look it up and output a 0.
After scanning the b, we have seen the
word ab. So, we add it to the dictionary
a:0
b:1
c:2
d:3
ab:4
Growing pains
We already scanned the first b.
abbabb…
Then we get another b.
We output a 1 for the first b, and add bb to
the dictionary
a:0
b:1
c:2
d:3
ab:4
bb:5
So?
Right, so far zero compression.
We already scanned the second b.
abbabb…
After scanning a, we output 1 for the b,
and put ba in the dictionary
…
d:3
ab:4
bb:5
Still zero compression.
ba:6
But now…
We already scanned a.
abbabb…
We scan the next b, and ab : 4 is in the
dictionary.
We scan the next b, output 4, and put abb
into the dictionary.
…
d:3
ab:4
bb:5
ba:6 abb:7
We got compression, because 4 is shorter
than ab.
And so on
We already scanned the last b
abbabb…
Suppose the input continues
abbabbbba…
We scan the next b, and bb:5 is in the
dictionary
We scan the next b, output 5, and put bbb
into the dictionary
… ab:4
bb:5
ba:6
abb:7
bbb:8
More Hits
As our dictionary grows, we are able to
replace longer and longer blocks by short
code numbers.
abbabbbba…
0114
5
6
And we increase the dictionary at each step
by adding another word.
Summary
We scan a sequence of symbols
a1 a2 a3 …. ak
where each prefix is in the dictionary.
We stop when we fall out of the dictionary:
a1 a2 a3 …. ak b
Summary (cont’d)
We output the code for a1 a2 a3 …. ak and
put a1 a2 a3 …. ak b into the dictionary.
Then we set
a1 = b
And start all over.
More importantly
 Since we extend our dictionary in
such a simple way, it can be easily
reconstructed on the other end.
Start with the same initialization, then
Read one code number after the other,
look up the each one in the dictionary,
and extend the dictionary as you go
along.
Sort of
Let's take a closer look at an example.
Assume alphabet {a,b,c}.
The code for aabbaabb is 0 0 1 1 3 5.
The decoding starts with dictionary D:
0:a, 1:b, 2:c
Moving along
The first 4 code words are already in D.
001135
and produce output a a b b.
As we go along, we extend D:
0:a, 1:b, 2:c, 3:aa, 4:ab, 5:bb
For the code numbers 3 5, get
aabbaabb
Done
We have also added to D:
6:ba, 7:aab
But these entries are never used.
Everything is easy, since there is already an
entry in D for each code number when we
encounter it.
Is this it?
Unfortunately, no.
It may happen that we run into a code
word without having an appropriate entry
in D.
But, it can only happen in very special
circumstances, and we can manufacture
the missing entry.
A Bad Run
Consider input
a a b b b a a ==> 0 0 1 5 3
After reading 0 0 1, we output
aab
and extend D with codes for aa and ab
0:a, 1:b, 2:c, 3:aa, 4:ab
Disaster
We have read 0 0 1 from the input
00153
The dictionary is
0:a, 1:b, 2:c, 3:aa, 4:ab
The next code number to read is 5, but it’s
not in D.
How could this have happened?
Can we recover?
… narrowly averted
This problem only arises when on the
compressor end:
• the input contains a substring
…s  s  s …
• compressor read s , output code c for s
, and added c+1: s  s to the dictionary.
• Here s is a single symbol, but  a
(possibly empty) word.
… narrowly averted (pt. 2)
On the decompressor end, D contains
c: s 
• but does not contain c+1: s  s
• the decompressor has already output
x=s
and is now looking at unknown code
number c+1.
… narrowly averted (pt. 3)
But then the fix is to output
x + first(x)
where x is the last decompressed word,
and first(x) the first symbol of x.
Because x=s  was already output, we get
the required
sss
We also update the dictionary to contain
the new entry x+first(x) = s  s.
Example
In our example we have read 0 0 1 from
the input
00153
The last decompressed word is b, and the
next code number to read is 5. Thus
•s=b
•  = empty
•The next word to output and add to D is
s  s = bb
Summary
Let x be the last added word.
Ordinarily, D contains a word y matching
to the input code number.
We output y and extend D with
x+ first (y)
But sometimes we immediately use x.
Then it must be x = s  and we output
x + first(x) = s  s
Example (extended)
0 0 1 5 3 6 7 9 5  aabbbaabbaaabaababb
Input
Output
0
add to D
a
0
+
a
3:aa
1
+
b
4:ab
5
-
bb
5:bb
3
+
aa
6:bba
6
+
bba
7:aab
7
+
aab
8:bbaa
9
-
aaba
9:aaba
5
+
bb
10:aabab
Pseudo Code: Compression
Initialize dictionary D to all words of length 1.
Read all input characters:
output code numbers from D,
extend D whenever a new word appears.
New code words: just an integer counter.
Less Pseudo
initialize D;
c = nextchar;
// next input character
W = c;
// a string
while( c = nextchar ) {
if( W+c is in D )
// dictionary
W = W + c;
else
output code(W); add W+c to D; W = c;
}
output code(W)
Pseudo Code: Decompression
Initialize dictionary D with all words of length
1.
Read all code numbers and
- output corresponding words from D,
- extend D at each step.
This time the dictionary is of the form
( integer, word )
Keys are integers, values words.
Less Pseudo
initialize D;
pc = nextcode;
// first code number
x = word(pc);
// corresponding word
output x;
First code number is easy: codes only a single symbol.
Remember as pc (previous code) and x (previous word).
More Less Pseudo
while ( c = nextcode ) {
if ( c is in D ) {
y = word(c);
ww = x + first(y);
insert ww in D;
output y;
}
else {
The hard case
else {
y = x + first(x);
insert y in D;
output y;
}
pc = c;
x = y;
}
One more detail…
One detail remains: how to build the
dictionary for compression (decompression
is easy).
We need to be able to scan through a
sequence of symbols and check if they form
a prefix of a word already in the dictionary.
We use tries for dictionaries.
Tries!
b
a
0
d
5
1
c
d
2
3
a
d
4
6
Corresponds to dictionary
a:0 b:1 c:2 d:3 ba:4 ad:4 dd:6
Tries
In the LZW situation, we can add the new
word to the trie dictionary in O(1) steps
after discovering that the string is no
longer a prefix of a dictionary word.
Just add a new leaf to the last node
touched.
LZW details
• In reality, one usually restricts the code
words to be 12 or 16 bit integers.
• Hence, one may have to flush the
dictionary ever so often (i.e. proceed to
compress the rest of the input with an
empty dictionary).
• But we won’t bother with this.
LZW details
Lastly, LZW generates as output a stream
of integers.
It makes perfect sense to try to compress
these further, e.g., by Huffman.
Summary of LZW
LZW is an adaptive, dictionary based
compression method.
Encoding is easy in LZW, but uses a special
data structure (trie).
Decoding is slightly complicated, requires
no special data structures.