lecture 6 ppt - George Mason University
Download
Report
Transcript lecture 6 ppt - George Mason University
IT-101
Section 001
Introduction to Information
Technology
Lecture #6
Overview
Chapter 7: Compression
Introduction
Entropy
Huffman coding
Universal coding
Introduction
World Wide Web not World Wide Wait
Compression techniques can significantly reduce the bandwidth and
memory required for sending, receiving, and storing data.
Most computers are equipped with modems that compress or
decompress all information leaving or entering via the phone line.
With a mutually recognized system (e.g. WinZip) the amount of data
can be significantly diminished.
Examples of compression techniques:
Compressing BINARY DATA STREAMS
Variable length coding (e.g. Huffman coding)
Universal Coding (e.g. WinZip)
IMAGE-SPECIFIC COMPRESSION (will will see that images are well
suited for compression)
GIF and JPEG
VIDEO COMPRESSION
MPEG
Why can we compress information?
Compression is possible because information usually contains
redundancies, or information that is often repeated.
For example, two still images from a video sequence of
images are often similar. This fact can be exploited by
transmitting only the changes from one image to the next.
For example, a line of data often contains redundancies:
“Ask not what your country can do for you - ask what you can do for your country.”
File compression programs remove this redundancy.
Some characters occur more frequently than others.
It’s possible to represent frequently occurring characters with a
smaller number of bits during transmission.
This may be accomplished by a variable length code, as
opposed to a fixed length code like ASCII.
An example of a simple variable length code is Morse Code.
“E” occurs more frequently than “Z” so we represent “E”
with a shorter length code:
.=E
-=T
--..=Z
--.-=Q
Information Theory
Variable length coding exploits the fact that some
information occurs more frequently than others.
The mathematical theory behind this concept is
known as: INFORMATION THEORY
Claude E. Shannon developed modern Information
Theory at Bell Labs in 1948.
He saw the relationship between the probability of
appearance of a transmitted signal and its
information content.
This realization enabled the development of
compression techniques.
A Little Probability
Shannon (and others) found that information can be related to probability.
An event has a probability of 1 (or 100%) if we believe this event will occur.
An event has a probability of 0 (or 0%) if we believe this event will not occur.
The probability that an event will occur takes on values anywhere from 0 to 1.
Consider a coin toss: heads or tails each has a probability of .50
In two tosses, the probability of tossing two heads is:
1/2 x 1/2 = 1/4 or .25
In three tosses, the probability of tossing all tails is:
1/2 x 1/2 x 1/2 = 1/8 or .125
We compute probability this way because the result of each toss is
independent of the results of other tosses.
Entropy
If the probability of a
binary event is .5 (like a
coin), then, on average,
you need one bit to
represent the result of
this event.
As the probability of a
binary event increases or
decreases, the number of
bits you need, on
average, to represent the
result decreases
The figure is expressing
that unless an event is
totally random, you can
convey the information of
the event in fewer bits, on
average, than it might
first appear
Let’s do an example...
Bits
As part of
information theory,
Shannon developed the
concept of ENTROPY
Probability of an event
Example from text..
A MEN’S SPECIALTY STORE
The probability of male patrons is .8
The probability of female patrons is .2
Assume for this example, groups of two enter the store.
Calculate the probabilities of different pairings:
Event A, Male-Male.
P(MM) = .8 x .8 = .64
Event B, Male-Female. P(MF) = .8 x .2 = .16
Event C, Female-Male. P(FM) = .2 x .8 = .16
Event D, Female-Female. P(FF) = .2 x .2 = .04
We could assign the longest codes to the most infrequent
events while maintaining unique decodability.
Example (cont..)
Let’s assign a unique string of bits to each event based on the
probability of that event occurring.
Event
Name
Male-Male
Male-Female
Female-Male
Female-Female
A
B
C
D
Code
0
10
110
111
Given a received code of: 01010110100, determine the events:
A
B
B
C
B
A
MM
MF
MF
FM
MF
MM
The above example has used a variable length code.
Variable Length Coding
Takes advantage of the probabilistic nature of information.
Unlike fixed length codes like ASCII, variable length codes:
Assign the longest codes to the most infrequent events.
Assign the shortest codes to the most frequent events.
Each code word must be uniquely identifiable regardless of
length.
Examples of Variable Length Coding
Morse Code
Huffman Coding
If we have total uncertainty about the information
we are conveying, fixed length codes are
preferred.
Morse Code
Characters represented by patterns of dots and dashes.
More frequently used letters use short code symbols.
Short pauses are used to separate the letters.
Represent “Hello” using Morse Code:
H
....
E
.
L
.-..
L
.-..
O
--Hello
.... . .-.. .-.. ---
Huffman Coding
The Huffman coding procedure finds the optimum, uniquely decodable, variable
length code associated with a set of events, given their probabilities of occurrence.
Start
Creates a Binary Code
Tree
Nodes connected by
branches with leaves
Top node – root
Two branches from
each node
Branches
Root
A
1
0
B
0
Node
1
C
0 1
Leaves
D
Huffman Coding
A
B
C
D
0
10
110
111
Start
Given the adjacent Huffman
code tree, decode the
following sequence:
11010001110
110
C
Branches
Root
10
B
0
A
0
A
111
D
0
A
A
1
0
B
0
Node
1
C
0 1
Leaves
D
Huffman Code Construction
First list all events in descending order of probability.
.3
Event A
.3
Event B
.13
Event C
.12
Event D
.1
Event E
.05
Event F
Pair the two events with lowest probabilities and add their
probabilities.
0.15
.3
Event A
.3
Event B
.13
Event C
.12
Event D
.1
Event E
.05
Event F
Huffman Code Construction
Repeat for the pair with the next lowest
probabilities.
0.25
.3
Event A
.3
Event B
.13
Event C
.12
Event D
0.15
.1
Event E
.05
Event F
Huffman Code Construction
Repeat for the pair with the next lowest
probabilities.
0.4
0.25
.3
Event A
.3
Event B
.13
Event C
.12
Event D
0.15
.1
Event E
.05
Event F
Huffman Code Construction
Repeat for the pair with the next lowest
probabilities.
0.4
0.6
0.25
.3
Event A
.3
Event B
.13
Event C
.12
Event D
0.15
.1
Event E
.05
Event F
Huffman Code Construction
Repeat for the last pair and add 0s to the left branches and 1s to
the right branches.
1
0
0.4
0.6
0
0
1
0.25
1
0
.3
Event A
.3
Event B
00
01
.13
Event C
100
1
0
0.15
1
.12
Event D
.1
Event E
.05
Event F
101
110
111
Exercise
Given the code we just constructed:
Event A: 00
Event B: 01
Event C: 100
Event D: 101
Event E: 110
Event F: 111
How can you decode the string: 0000111010110001000000111?
Starting from the leftmost bit, find the shortest bit pattern that matches
one of the codes in the list. The first bit is 0, but we don’t have an
event represented by 0. We do have one represented by 00, which is
event A. Continue applying this procedure:
00 00 111 01 01 100 01 00 00 00 111
A A F
B B C B A A A F
Universal Coding
Huffman has its limits
You must know a priori the probability of the characters or symbols
you are encoding.
What if a document is “one of a kind?”
Universal Coding schemes do not require a knowledge of the statistics
of the events to be coded.
Universal Coding is based on the realization that any stream of
data consists of some repetition.
Lempel-Ziv coding is one form of Universal Coding presented in the
text.
Compression results from reusing frequently occurring strings.
Works better for long data streams. Inefficient for short strings.
Used by WinZip to compress information.
Lempel-Ziv Coding
The basis for Lempel-Ziv coding is the idea that
we can achieve compression of a string by
always coding a series of zeroes and ones as
some previous string (prefix string) plus one new
bit. Compression results from reusing
frequently occuring strings
We will not go through Lempel-Ziv coding in
detail..