Data Compression

Download Report

Transcript Data Compression

ICS 220 – Data Structures
and Algorithms
Lecture 11
Dr. Ken Cosh
Review
• Hash Functions
– Spreading data evenly through a table
avoiding collisions while maintaining a simple
efficient function.
• Linear Probing, Quadratic Probing,
General Increment Probing.
• Tombstone deletion.
This week
• Data Compression Techniques
– We can speed up the efficiency of many
algorithms by compressing data.
– Obviously storing ‘M’ or ‘F’, rather than ‘Male’
or ‘Female’, takes less space, AND less time
to transfer between bits of a program.
– Lets investigate how we can efficiently encode
data.
Encoding 128
•
•
•
•
•
128
80
1000000
CXXVIII
ρκη
• ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
Assumptions
• We have ‘n’ different symbols we can use to
encode a message.
– In Morse code n=3.
– In Binary n=2.
• All symbols mi forming the set M, have
probabilities of occurrence P(mi) such that P(mi)
+ … + P(mn) =1
– Infrequently occurring symbols can be assigned a
long code word, while short code words are reserved
for frequent symbols.
Coding Objectives
• Each codeword corresponds to exactly
one symbol.
• Decoding should not require any look
ahead.
– This is known as the ‘prefix’ property.
Prefix Property
•
•
•
•
Symbols: A, B, C
Codes: 1, 2, 12
Message: 12
Is it ‘AB’? Is it ‘C’?
Prefix Property
•
•
•
•
•
•
•
Symbols: A, B, C,
Codes: 1, 22, 12
Message: 1222
Read in 1, is it an A?
Read in 2, was it a C?
Read in 2, Should it be AB?
Read in 2, Ah, finally we can assume it
was CB.
Prefix Property
• Symbols: A, B, C
• Codes: 11, 12, 21
• No Ambiguity here.
Code Optimisation
• The length of a code for one symbol
should not exceed the length of a less
likely symbol;
– if P(mi) ≤ P(mj) then L(mi) ≥ L(mj)
• There should be no unused short codes,
either as stand alone encodings or as
prefixs for longer codes.
– 01, 000, 001, 100, 101 is not ideal as 11 is not
used.
Huffman Coding
• Huffman coding is a method for choosing
a representation for each symbol, resulting
in a prefix-free code
– The bit string representing some particular
symbol is never a prefix of the bit string
representing any other symbol
• The most common characters are
expressed using shorter strings of bits
than are used for less common symbols.
Huffman Coding
• Huffman coding starts by creating a heap, based
on the frequencies of the symbols to be
encoded.
– The root of the heap should be 1.0 – the sum of both
child nodes.
– Each data node appears in a leaf node.
• Take the set (with corresponding occurrence
frequencies out of 120);
A B C D E F G
10 15 5 15 20 5 15 30 5
H
I
Creating a Heap
• There are several ways a heap can be created from the
given data, here is one option;
120
65
55
25
35
H
30
30
10
A
15
B D
15
C
15
10
5
5
F
I
5
G
15
E
20
Huffman Coding
• Each node is then given a code based on their position in the tree –
0 for left node, 1 for right node;
120
B = 010
65
55
25
A = 000
35
H
30
C = 0010
D = 011
30
10
A
15
B D
15
C
15
G
15
E
20
E = 111
F = 00110
G = 110
10
5
H = 10
5
F
I
5
I = 00111
Huffman Decoding
A = 000
B = 010
C = 0010
D = 011
0010000010010000110111
000011011111011
E = 111
01111100000110
F = 00110
0100011111010111000011111011
G = 110
H = 10
10111000011000001010111
I = 00111
Does this code fit our prefix requirements?
Adaptive Huffman
• Huffman coding was invented back in the
1950’s, and has a key assumption – that we
know the expected frequency of each symbol.
• We could get this frequency from a large corpus,
such as the BNC, but then the resulting huffman
tree might not be ideal for all circumstances.
– A technical computing paper is likely to contain more
non alphabet symbols than ‘war and peace’.
• Adaptive Huffman is usable in this situation, as it
reconstructs the tree as data is read in.
Adaptive Huffman
• Initially all symbols are unused and so
stored in a leaf with value 0. As symbols
are used they are moved up the tree.
• As symbols are used more frequently, they
are compared and if necessary swapped
with their siblings.
• Sibling comparison is easy if the heap is
stored as a doubly linked list.
Doubly Linked List
120
65
55
25
35
H
30
30
10
A
15
B D
15
C
15
10
5
5
F
I
5
G
15
E
20
Cabbages are bad
0
add c
1
2
add a
abcdegrs
0
1
1
3
add b
1
2
1
c
abdegrs c
0
c
1
1
1
bdegrs a
3
1
transpose
2
c
1
1
a
0
1
degrs
b
a
0
1
degrs
b
Adaptive Huffman
• Note that Adaptive Huffman requires just one
pass through the message to encode it;
– The first time the b is transmitted it has the code 101.
– The second time the b is transmitted it has the code
0.
• As symbols rise up the tree their codes become
shorter.
• So long as the decoder uses the same formula it
can decode the same message.
Run-Length Encoding
• The word cabbage has a repeated b, so
run length encoding might be able to
shorten the code, by indicating that there
are 2 b’s.
– This is similar to the ๆ symbol in Thai.
• Run-length encoding doesn’t work if there
are numbers in the sequence!
Homework
• Taking the following phrase;
– “Data Structures And Algorithms Analysis”
• Use the Adaptive Huffman Coding method
to generate a code for it.