Transcript chapter9
Data Compression
Why do we care?
Secondary storage capacity doubles every year
However, disk space fills up quickly on every computer system
More data to compress than ever before
What’s the difference between compression for .mp3 files and
compression for .zip files? Between .gif and .jpg?
Must we exactly reconstruct the data?
Lossy methods
• Generally fine for pictures, video, and audio (JPEG, MPEG, etc.)
Lossless methods
• Run-length encoding
11 3 5 3 2 6 2 6 5 3 5 3 5 3 10
• Text compression
Is it possible to compress (lossless compression rather than lossy)
every file? Every file of a given size?
CPS 100
9.1
Priority Queue
Compression motivates the study of the ADT priority queue
Supports two basic operations
• insert -– an element into the priority queue
• delete – the minimal element from the priority queue
Implementations may allow getmin separate from delete
• Analogous to top/pop, front/dequeue in stacks, queues
Simple sorting using priority queue (see pqdemo.cpp and
usepq.cpp)
string s; priority_queue pq;
while (cin >> s) pq.insert(s);
while (pq.size() > 0) {
pq.deletemin(s);
cout << s << endl;
}
CPS 100
9.2
Priority Queue implementations
Implementing priority queues: average and worst case
Insert
O(..)
Getmin
O(…)
DeleteMin
O(…)
Unsorted vector
Sorted vector
Linked list (sorted?)
Search tree
Balanced tree
Heap
CPS 100
9.3
Quick look at class tpq<…>
Templated class like tstack, tqueue, tvector, tmap, …
If deletemin is supported, what properties must types put
into tpq have, e.g., can we insert string? double? struct?
Can we change what minimal means (think about anaword
and sorting)?
If we use a compare function object for comparing entries we
can make a min-heap act like a max-heap, see pqdemo.cpp
Notice that RevComp inherits from Comparer<Kind>
How is Comparer accessed?
How is this as a sorting method, consider a vector of elements.
In practice heapsort uses the vector as the priority queue
From a big-Oh perspective no difference: O(n log n)
• Is there a difference? What’s hidden with O notation?
CPS 100
9.4
Priority Queue implementation
The class in tpq.h uses heaps, very fast and reasonably simple
Why not use inheritance hierarchy as was used with tmap?
Trade-offs when using HMap and BSTMap:
• Time, space
• Ordering properties
Mechanism for changing comparisons used for priority
Different from comparison used in sortall functions
(anaword)
• Functions are different from classes when templates used
• Functions instantiated when called, object/class instantiated
when object constructed
The tpq mechanism uses inheritance, sorting doesn’t
• In theory we could have template function in non-templated
class, but g++ doesn’t support template member functions
CPS 100
9.5
Creating Heaps
Heap is an array-based implementation of a binary tree used
for implementing priority queues, supports:
insert, findmin, deletemin: complexities?
Using array minimizes storage (no explicit pointers), faster too
--- children are located by index/position in array
Heap is a binary tree with shape property, heap/value property
shape: tree filled at all levels (except perhaps last) and
filled left-to-right (complete binary tree)
each node has value smaller than both children
CPS 100
9.6
Array-based heap
store “node values” in array
beginning at index 1
for node with index k
left child: index 2*k
right child: index 2*k+1
why is this conducive for
maintaining heap shape?
what about heap property?
is the heap a search tree?
where is minimal node?
where are nodes added?
deleted?
CPS 100
6 10 7 17 13 9 21 19 25
0 1
2
3
4
5
6
7
8
9 10
6
7
10
13
17
19
9
21
25
9.7
Adding values to heap
to maintain heap shape, must
6
7
10
add new value in left-to-right
order of last level
13
17
9
21
could violate heap property
19 25 insert 8
move value “up” if too small
6
7
10
change places with parent if heap
6
property violated
10
stop when parent is smaller
17
8
stop when root is reached
19
9
21
25 8
19
bubble 8 up
7
9
21
25 13
6
7
8
pull parent down, swapping isn’t
necessary (optimization)
17
19
CPS 100
13
17
10
9
21
25 13
9.8
Adding values, details
void pqueue::insert(int elt)
{
// add elt to heap in myList
myList.push_back(elt);
int loc = myList.size();
6
7
10
13
17
19
9
21
25
6
13
17
19
while (1 < loc &&
elt < myList[loc/2])
{
myList[loc] = myList[loc/2];
loc /= 2; // go to parent
}
// what’s true here?
7
10
9
21
25 8
6 10 7 17 13 9 21 19 25
0 1
2
3
4
5
6
7
8
9 10
myList[loc] = elt;
}
tvector myList
CPS 100
9.9
Removing minimal element
Where is minimal element?
If we remove it, what
changes, shape/property?
How can we maintain shape?
“last” element moves to root
What property is violated?
After moving last element,
subtrees of root are heaps, why?
Move root down (pull child
up) does it matter where?
When can we stop “re-heaping”?
6
7
10
13
17
19
9
25
25
7
10
13
17
9
21
19
7
25
10
17
19
13
9
21
25
7
9
10
17
19
CPS 100
21
13
25
21
25
9.10
Trie: efficient search of words/suffixes
A trie (from retrieval, but pronounced “try”) supports
These operations are O(size of string) regardless of
how many strings are stored in the trie!
• Insert/Delete string
• Lookup string or string prefix
In some ways a trie is like a 128 (or 26 or alphabet-size) tree,
one branch/edge for each character/letter
Node stores branches to other nodes
Node stores whether it ends the string from root to it
Extremely useful in DNA/string processing
monkeys and typewriter simulation: similar to statistical
methods used in Natural Language understanding
CPS 100
9.11
Trie picture and code (see trie.cpp)
To add string
Start at root, for each char
create node as needed, go
down tree, mark last node
To find string
Start at root, follow links
If Null/0 not contained
Check word flag in node
To print all nodes
Visit every node, build
string as nodes traversed
What about union and
intersection?
CPS 100
a
n
c
s
h
r
a
s
t
r
p
a
c
d
a o
Indicates word ends here
9.12
Text Compression
Input: String S
Output: String S
Shorter
S can be reconstructed from S
CPS 100
9.13
Text Compression:
Examples
Encodings
ASCII: 8 bits/character
Unicode: 16 bits/character
“abcd” in the different formats
ASCII: “01100001011000100110001
101100100”
Fixed: “000001010011100”
1
0
Symbol
ASCII
Fixed
length
Var.
length
a
01100001
000
000
b
01100010
001
11
c
01100011
010
01
d
01100100
011
001
e
01100101
100
10
0
0
a
1
1
b
0
1
c
d
0
0
e
1
0
Var :
“0000100111”
0
0
a
CPS 100
0
1
1
d
c
0
1
e
b
9.14
Huffman Coding
D.A Huffman in early 1950’s
Before compressing data, analyze the input stream
Represent data using variable length codes
Variable length codes though Prefix codes
Each letter is assigned a codeword
Codeword is for a given letter is produced by traversing
the Huffman tree
Property: No codeword produced is the prefix of another
Letters appearing frequently have short codewords, while
those that appear rarely have longer ones
CPS 100
9.15
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.16
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.17
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.18
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.19
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.20
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.21
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.22
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.23
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL
NUMBER OF BITS”
E.g. “ A SIMPLE” “10101101001000101001110011100000”
CPS 100
9.24
Building a tree
- Initial case:
Every character is a leaf/tree with the respective
character counts “the forest” of n trees
n is the size of your alphabet
- Base case:
there is only tree in the forest
- Reduction:
Take the two trees with the smallest counts
and combine them into a tree with count is
equal to the sum of the two subtrees’ counts
n-1 trees in our forest
CPS 100
9.25
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
5
5
4
4
3
3
3
3
2
2
2
2
2
1
1
1
I
E
N
S
M
A
B
O
T
G
D
L
R
U
P
F
C
CPS 100
9.26
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
1
1
F
C
6
5
5
4
4
3
3
3
3
I
E
N
S
M
A
B
O
T
CPS 100
2
2
2
2
2
2
1
G
D
L
R
U
P
9.27
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
2
11
3
1
1
1
2
C
F
P
U
6
5
5
4
4
I
E
N
S
M
CPS 100
3
3
3
3
3
A
B
O
T
2
2
2
2
2
G
D
L
R
9.28
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CPS 100
4
4
4
S
M
3
2
2
L
R
2
3
3
3
3
A
B
O
T
2
2
G
D
9.29
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
4
2
11
3
1
1
1
2
C
F
P
U
6
5
5
I
E
N
CPS 100
4
4
4
4
S
M
3
4
2
2
2
2
G
D
L
R
3
3
3
3
A
B
O
T
2
9.30
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
4
2
3
O
11
1
1
1
2
C
F
P
U
6
I
CPS 100
5
5
5
E
N
4
4
4
4
S
M
4
2
2
2
2
G
D
L
R
3
3
3
A
B
T
3
9.31
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
5
3
6
2
3
O
11
4
3
T
1
1
1
2
C
F
P
U
6
6
I
CPS 100
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
3
3
S
M
A
B
9.32
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
5
3
6
2
3
O
11
4
3
T
1
1
1
2
C
F
P
U
6
6
6
I
CPS 100
5
5
5
E
N
4
4
4
2
2
2
2
G
D
L
R
4
4
S
M
3
3
A
B
9.33
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
6
8
5
3
6
2
3
O
11
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
8
6
6
6
I
CPS 100
5
5
5
E
N
4
4
2
2
2
2
G
D
L
R
3
3
A
B
4
9.34
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
8
5
3
6
2
3
O
11
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
8
8
6
6
6
I
CPS 100
5
6
8
5
5
E
N
4
2
2
2
2
G
D
L
R
3
3
A
B
9.35
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
8
5
6
E
3
2
3
O
11
4
4
S
M
4
3
T
1
1
1
2
C
F
P
U
10
8
8
6
6
6
I
CPS 100
5
6
8
4
2
2
2
2
G
D
L
R
3
3
A
B
5
N
9.36
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
10
5
11
5
5
E
6
N
3
2
3
O
11
8
11
3
T
1
1
1
2
C
F
P
U
10
8
8
6
6
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
6
I
CPS 100
9.37
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
12
10
11
8
6
8
6
I
5
5
5
E
N
3
2
3
O
12
6
11
3
T
1
1
1
2
C
F
P
U
11
CPS 100
10
8
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
8
9.38
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
16
6
12
3
T
1
1
1
2
C
F
P
U
11
CPS 100
11
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
10
9.39
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
21
16
10
11
12
8
6
8
6
I
5
5
5
E
N
3
2
3
O
21
6
16
3
T
1
1
1
2
C
F
P
U
12
CPS 100
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
11
9.40
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
23
6
21
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
16
CPS 100
9.41
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
N
3
2
3
O
37
6
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
23
CPS 100
9.42
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
60
CPS 100
9.43
Encoding
1.
Count occurrence of various characters in string
O(
)
2.
Build priority queue
O(
)
3.
Build Huffman tree
O(
)
4.
Write Huffman tree and coded data to file
O(
)
CPS 100
9.44
Properties of Huffman coding
Want to minimize weighted path length L(T)of tree T
L(T )
d w
i
iLeaf ( T )
i
wi is the weight or count of each codeword i
di is the leaf corresponding to codeword I
Huffman coding creates pretty full bushy trees?
When would it produce a “bad” tree?
How do we produce coded compressed data from input
efficiently?
CPS 100
9.45
Writing code out to file
How do we go from characters to codewords?
Build a table as we build our tree
Keep links to leaf nodes and trace up the tree
Need way of writing bits out to file
Platform dependent?
UNIX read and write
See bitops.h
obstream and ibstream
Write bits from ints
How can differentiate between compressed files and random
data from some file?
Store a
number
CPS 100
9.46
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.47
Decoding a message
1100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.48
Decoding a message
100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.49
Decoding a message
00000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
CPS 100
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
9.50
Decoding a message
0000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CPS 100
9.51
Decoding a message
000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CPS 100
9.52
Decoding a message
00100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CPS 100
9.53
Decoding a message
0100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CPS 100
9.54
Decoding a message
100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
G
CPS 100
9.55
Decoding a message
00001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CPS 100
9.56
Decoding a message
0001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CPS 100
9.57
Decoding a message
001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CPS 100
9.58
Decoding a message
01001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CPS 100
9.59
Decoding a message
1001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GO
CPS 100
9.60
Decoding a message
001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CPS 100
9.61
Decoding a message
01101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CPS 100
9.62
Decoding a message
1101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CPS 100
9.63
Decoding a message
101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CPS 100
9.64
Decoding a message
01
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
4
4
S
M
3
T
1
1
1
2
C
F
P
U
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOO
CPS 100
9.65
Decoding a message
1
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOOD
CPS 100
9.66
Decoding a message
01100000100001001101
60
37
23
21
16
10
11
12
8
11
6
8
6
I
5
5
5
E
6
N
3
2
3
O
3
T
1
1
1
2
C
F
P
U
4
4
S
M
4
4
2
2
2
2
G
D
L
R
3
3
A
B
GOOD
CPS 100
9.67
Decoding
1.
Read in tree data
O(
)
2.
Decode bit string with tree
O(
)
CPS 100
9.68
Other methods
Adaptive Huffman coding
Lempel-Ziv algorithms
Build the coding table on the fly while reading document
Coding table changes dynamically
Cool protocol between encoder and decoder so that
everyone is always using the right coding scheme
Works darn well (compress, gzip, etc.)
More complicated methods
Burrows-Wheeler (bunzip2)
PPM statistical methods
CPS 100
9.69
Questions
How about ternary Huffman trees?
How would that affect the algorithm?
How about n-ary trees?
What would we gain?
Are Huffman trees optimal?
What does that mean? (Hint: L(T))
How can that be proven? (Hint: Induction will be your
friend again)
CPS 100
9.70
Sorting: From Theory to Practice
Why do we study sorting?
Because we have to
Because sorting is beautiful
Because … and …
There are n sorting algorithms, how many should we study?
O(n), O(log n), …
Why do we study more than one algorithm?
•
•
CPS 100
Which sorting algorithm is best?
9.71
Sorting out sorts (see also sortall.cpp)
Simple, O(n2) sorts --- for sorting n elements
Selection sort --- n2 comparisons, n swaps, easy to code
Insertion sort --- n2 comparisons, n2 moves, stable, fast
Bubble sort --- n2 everything, slow, slower, and ugly
Divide and conquer faster sorts: O(n log n) for n elements
Quick sort: fast in practice, O(n2) worst case
Merge sort: good worst case, great for linked lists, uses
extra storage for vectors/arrays
Other sorts:
Heap sort, basically priority queue sorting
Radix sort: doesn’t compare keys, uses digits/characters
Shell sort: quasi-insertion, fast in practice, non-recursive
CPS 100
9.72
Selection sort
Simple to code n2 sort: n2 comparisons, n swaps
void selectSort(tvector<string>& a)
{
int k;
for(k=0; k < a.size(); k++)
{
int minIndex = findMin(a,k,a.size());
swap(a[k],a[minIndex]);
}
}
n
# comparisons: S k = 1 + 2 + … + n = n(n+1)/2 = O(n2)
k=1
Swaps?
Sorted, won’t move
?????
Invariant:
final position
CPS 100
9.73
Insertion Sort
Stable sort, O(n2), good on nearly sorted vectors
Stable sorts maintain order of equal keys
Good for sorting on two criteria: name, then age
void insertSort(tvector<string>& a)
{
int k, loc; string elt;
for(k=1; k < a.size(); k++)
{
elt = a[k];
loc = k;
// shift until spot for elt is found
while (0 < loc && elt < a[loc-1]
{
a[loc] = a[loc-1];
// shift right
loc=loc-1;
}
a[loc] = elt;
}
Sorted relative to
}
?????
each other
CPS 100
9.74
Bubble sort
For completeness you should know about this sort
Few (if any) redeeming features. Really slow, really, really
Can code to recognize already sorted vector (see insertion)
• Not worth it for bubble sort, much slower than insertion
void bubbleSort(tvector<string>& a)
{
int j,k;
for(j=a.size()-1; j >= 0; j--)
{
for(k=0; k < j; k++)
{
if (a[k] > a[k+1])
swap(a[k],a[k+1]);
}
Sorted, in final
}
?????
position
}
“bubble” elements down the vector/array
CPS 100
9.75
Summary of simple sorts
Selection sort has n swaps, good for “heavy” data
moving objects with lots of state, e.g., …
• A string isn’t heavy, why? (pointer and pointee)
• What happens in Java?
• Wrap heavy items in “smart pointer proxy”
Insertion sort is good on nearly sorted data, it’s stable, it’s fast
Also foundation for Shell sort, very fast non-recursive
More complicated to code, but relatively simple, and fast
Bubble sort is a travesty
Can be parallelized, but on one machine don’t go near it
CPS 100
9.76
Quicksort: fast in practice
Invented in 1962 by C.A.R. Hoare, didn’t understand recursion
Worst case is O(n2), but avoidable in nearly all cases
In 1997 Introsort published (Musser, introspective sort)
• Like quicksort in practice, but recognizes when it will be bad
and changes to heapsort
void quick(tvector<string>& a, int left, int right)
{
if (left < right)
{
int pivot = partition(a,left,right);
quick(a,left,pivot-1);
quick(a,pivot+1, right);
}
}
Recurrence?
<= X
X
> X
pivot index
CPS 100
9.77
Partition code for quicksort
what we want
<= pivot
> pivot
right
left
pIndex
what we have
??????????????
right
left
invariant
>
???
left
right
pIndex
CPS 100
int partition(tvector<string>& a,
int left, int right)
{
string pivot = a[left];
int k, pIndex = left;
for(k=left+1, k <= right; k++)
{
if (a[k] <= pivot)
{
pIndex++;
swap(a[k],a[pIndex]);
}
}
swap(a[left], a[pIndex]);
}
<=
k
Easy to develop partition
loop invariant:
statement true each time loop
test is evaluated, used to verify
correctness of loop
Can swap into a[left] before loop
Nearly sorted data still ok
9.78
Analysis of Quicksort
Average case and worst case analysis
Recurrence for worst case: T(n) =
What about average?
Reason informally:
Two calls vector size n/2
Four calls vector size n/4
… How many calls? Work done on each call?
Partition: typically find middle of left, middle, right, swap, go
Avoid bad performance on nearly sorted data
In practice: remove some (all?) recursion, avoid lots of “clones”
CPS 100
9.79
Tail recursion elimination
If the last statement is a recursive call, recursion can be replaced
with iteration
Call cannot be part of an expression
Some compilers do this automatically
void foo(int n)
{
if (0 < n)
{ cout << n << endl;
foo(n-1);
}
}
void foo2(int n)
{
while (0 < n)
{ cout << n << endl;
n = n-1;
}
}
What if cout << and recursive call switched?
What about recursive factorial?
CPS 100
9.80
Merge sort: worst case O(n log n)
Divide and conquer --- recursive sort
Divide list/vector into two halves
• Sort each half
• Merge sorted halves together
What is complexity of merging two sorted lists?
What is recurrence relation for merge sort as described?
T(n) =
What is advantage of vector over linked-list for merge sort?
What about merging, advantage of linked list?
Vector requires auxiliary storage (or very fancy coding)
CPS 100
9.81
Merge sort: lists or vectors
Mergesort for vectors
void mergesort(tvector<string>& a, int left, int right)
{
if (left < right)
{
int mid = (right+left)/2;
mergesort(a, left, mid);
mergesort(a, mid+1, right);
merge(a,left,mid,right);
}
}
What’s different when linked lists used?
Do differences affect complexity? Why?
How does merge work?
CPS 100
9.82
Mergesort continued
Vector code for merge isn’t pretty, but it’s not hard
Mergesort itself is elegant
void merge(tvector<string>& a,
int left, int middle, int right)
// pre: left <= middle <= right,
//
a[left] <= … <= a[middle],
//
a[middle+1] <= … <= a[right]
// post: a[left] <= … <= a[right]
Why is this prototype potentially simpler for linked lists?
What will prototype be? What is complexity?
CPS 100
9.83
Summary of O(n log n) sorts
Quicksort is relatively straight-forward to code, very fast
Worst case is very unlikely, but possible, therefore …
But, if lots of elements are equal, performance will be bad
• One million integers from range 0 to 10,000
• How can we change partition to handle this?
Merge sort is stable, it’s fast, good for linked lists, harder to code?
Worst case performance is O(n log n), compare quicksort
Extra storage for array/vector
Heapsort, more complex to code, good worst case, not stable
Basically heap-based priority queue in a vector
CPS 100
9.84
Sorting in practice
Rarely will you need to roll your own sort, but when you do …
What are key issues?
If you use a library sort, you need to understand the interface
In C++ we have STL and sortall.cpp in Tapestry
• STL has sort, and stable_sort
• Tapestry has lots of sorts, Quicksort is fast in practice
In C the generic sort is complex to use because arrays are ugly
• See csort.cpp
In Java guarantees and worst-case are important
• Why won’t quicksort be used?
Function objects permit sorting criteria to change simply
CPS 100
9.85
In practice: templated sort functions
Function templates permit us to write once, use several times
for several different types of vector
Template function “stamps out” real function
Maintenance is saved, code still large (why?)
What properties must hold for vector elements?
Comparable using < operator
Elements can be assigned to each other
Template functions capture property requirements in code
Part of generic programming
Some languages support this better than others (not Java)
CPS 100
9.86
Function object concept
To encapsulate comparison (like operator <) in a parameter
Need convention for parameter : name and behavior
Enforceable by templates or by inheritance (or both)
• Sorts don’t use inheritance, tpqueue<..> does
Name convention: class/object has a method named compare
Two parameters, the (vector) elements being compared
See comparer.h, used in sortall.h and in tpq.h
Behavior convention: compare returns an int
zero if elements equal
+1 (positive) if first > second
-1 (negative) if first < second
CPS 100
9.87
Function object example
class StrLenComp // : public Comparer<string>
{
public:
int compare(const string& a, const string& b) const
// post: return -1/+1/0 as a.length() < b.length()
{
if (a.length() < b.length()) return -1;
if (a.length() > b.length()) return 1;
return 0;
}
};
// to use this:
StrLenComp scomp;
if (scomp.compare(“hello”, “goodbye”) < 0) …
CPS 100
We can use this to sort, see sortall.h
Call of sort: InsertSort(vec, vec.size(), scomp);
9.88
Non-comparison-based sorts
lower bound: W(n log n) for
comparison based sorts (like
searching lower bound)
bucket sort/radix sort are
not-comparison based, faster
asymptotically and in
practice
sort a vector of ints, all ints
in the range 1..100, how?
radix: examine each digit of
numbers being sorted
CPS 100
23 34 56 25 44 73 42 26 10 16
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
9.89
Shell sort
Comparison-based, similar to insertion sort
Using Hibbard’s increments (see sortall.h) yields O(n3/2)
Sequence of insertion sorts, note last value of h!!
int k,loc,h; string elt;
h = …; // set h to 2p-1, just less than a.size()
while (h > 0)
{
for(k=h; k < n; k++)
{
elt=a[k];
loc = k;
while (h <= loc && elt < a[loc-h])
{
a[loc] = a[loc-h];
loc -= h;
}
a[loc] = elt;
}
h /= 2;
}
CPS 100
9.90