Of -1? - Duke Computer Science

Download Report

Transcript Of -1? - Duke Computer Science

PF-AT-EOS
Data compression

Seriously important, why?
Priority Queue

Seriously important, why?
Assignment overview

Huff _______
Bits, Bytes, Atoms

What's an int? ASCII? double?
CPS 100, Fall 2011
12.1
PQ Application: Data Compression
Compression is a high-profile application



.zip, .mp3, .jpg, .gif, .gz, …
What property of MP3 was a significant factor in what
made Napster work (why did Napster ultimately fail?)
Who invented Napster, how old, when?
Why do we care?




Secondary storage capacity doubles every year
Disk space fills up quickly on every computer system
More data to compress than ever before
Will we ever need to stop worrying about storage?
CPS 100, Fall 2011
12.2
More on Compression
Different compression techniques



.mp3 files and .zip files?
.gif and .jpg?
Lossless and lossy
Impossible to compress/lossless everything: Why?
Lossy methods

Good for pictures, video, and audio (JPEG, MPEG, etc.)
Lossless methods

Run-length encoding, Huffman, LZW, …
11 3 5 3 2 6 2 6 5 3 5 3 5 3 10
CPS 100, Fall 2011
12.3
Huffman Coding
Understand Huffman Coding




Data compression
Priority Queue
Bits and Bytes
Greedy Algorithm
Many compression algorithms




Huffman is optimal, per-character compression
Still used, e.g., basis of Burrows-Wheeler
Other compression 'better', sometimes slower?
LZW, GZIP, BW, …
CPS 100, Fall 2011
12.4
Compression and Coding
What gets compressed?
 Save on storage, why is this a good idea?
 Save on data transmission, how and why?
What is information, how is it compressible?
 Exploit redundancy, without that, hard to
compress
Represent information: code (Morse cf. Huffman)
 Dots and dashes or 0's and 1's
 How to construct code?
CPS 100, Fall 2011
12.5
Coding/Compression/Concepts
For ASCII we use 8 bits, for Unicode 16 bits
 Minimum number of bits to represent N values?
 Representation of genomic data (a, c ,g, t)?
 What about noisy genomic data?
We can use a variable-length encoding, e.g., Huffman
 How do we decide on lengths? How do we decode?
 Values for Morse code encodings, why?

… ---…
CPS 100, Fall 2011
12.6
Huffman Coding
D.A Huffman in early 1950’s: story of invention
 Analyze and process data before compression
 Not developed to compress data “on-the-fly”
Represent data using variable length codes
 Each letter/chunk assigned a codeword/bitstring
 Codeword for letter/chunk is produced by traversing
the Huffman tree
 Property: No codeword produced is the prefix of
another
 Frequent letters/chunk have short encoding, while
those that appear rarely have longer ones
Huffman coding is optimal per-character coding method
CPS 100, Fall 2011
12.7
Mary Shaw
Software engineering and
software architecture
 Tools for constructing
large software systems
 Development is a small
piece of total cost,
maintenance is larger,
depends on well-designed
and developed techniques
Interested in computer
science, programming,
curricula, and canoeing,
health-care costs
ACM Fellow, Alan Perlis
Professor of Compsci at CMU
CPS 100, Fall 2011
12.8
Huffman coding: go go gophers
ASCII
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32
1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
3 bits
000
001
010
011
100
101
110
111
??
??
choose two smallest weights
 combine nodes + weights
 Repeat
 Priority queue?
Encoding uses tree:
 0 left/1 right
 How many bits?
CPS 100, Fall 2011
g
o
p
h
e
r
s
*
3
3
1
1
1
1
1
2
2
2
3
p
h
e
r
s
*
1
1
1
1
1
2
6
4
2
2
p
h
e
r
1
1
1
1
g
o
3
3
12.9
Huffman coding: go go gophers
ASCII
g 103
o 111
p 112
h 104
e 101
r 114
s 115
sp. 32
1100111
1101111
1110000
1101000
1100101
1110010
1110011
1000000
3 bits
000
001
010
011
100
101
110
111
00
01
1100
1101
1110
1111
100
101
13
7
6
g
o
3
3
Encoding uses tree/trie:
 0 left/1 right
 How many bits? 37!!
 Savings? Worth it?
4
3
s
*
1
2
2
2
p
h
e
r
1
1
1
1
6
4
2
2
g
o
3
3
3
CPS 100, Fall 2011
p
h
e
r
1
1
1
1
s
*
1
2
12.10
Building a Huffman tree
Begin with a forest of single-node trees/tries (leaves)
 Each node/tree/leaf is weighted with character count
 Node stores two values: character and count
Repeat until there is only one node left: root of tree
 Remove two minimally weighted trees from forest
 Create new tree/internal node with minimal trees as
children,
• Weight is sum of children’s weight (no char)
How does process terminate? Finding minimum?
 Remove minimal trees, hummm……
CPS 100, Fall 2011
12.11
How do we create Huffman Tree/Trie?
Insert weighted values into priority queue
 What are initial weights? Why?

Remove minimal nodes, weight by sums, re-insert
 Total number of nodes?
PriorityQueue<TreeNode> pq = new PriorityQueue<TreeNode>();
for(int k=0; k < freq.length; k++){
pq.add(new TreeNode(k,freq[k],null,null));
}
while (pq.size() > 1){
TreeNode left = pq.remove();
TreeNode right = pq.remove();
pq.add(new TreeNode(0,left.weight+right.weight,
left,right));
}
TreeNode root = pq.remove();
CPS 100, Fall 2011
12.12
Creating compressed file
Once we have new encodings, read every character
 Write encoding, not the character, to compressed file
 Why does this save bits?
 What other information needed in compressed file?
How do we uncompress?
 How do we know foo.hf represents compressed file?
 Is suffix sufficient? Alternatives?
Why is Huffman coding a two-pass method?
 Alternatives?
CPS 100, Fall 2011
12.13
Uncompression with Huffman
We need the trie to uncompress
 000100100010011001101111
As we read a bit, what do we do?
 Go left on 0, go right on 1
 When do we stop? What to do?
How do we get the trie?
 How did we get it originally? Store 256 int/counts
• How do we read counts?
 How do we store a trie? 20 Questions relevance?
• Reading a trie? Leaf indicator? Node values?
CPS 100, Fall 2011
12.14
Other Huffman Issues
What do we need to decode?
 How did we encode? How will we decode?
 What information needed for decoding?
Reading and writing bits: chunks and stopping
 Can you write 3 bits? Why not? Why?
 PSEUDO_EOF
 BitInputStream and BitOutputStream: API
What should happen when the file won’t compress?
 Silently compress bigger? Warn user? Alternatives?
CPS 100, Fall 2011
12.15
Huffman Complexities
How do we measure? Size of input file, size of alphabet
 Which is typically bigger?
Accumulating character counts: ______
 How can we do this in O(1) time, though not really
Building the heap/priority queue from counts ____
 Initializing heap guaranteed
Building Huffman tree ____
 Why?
Create table of encodings from tree ____
 Why?
Write tree and compressed file _____
CPS 100, Fall 2011
12.16
Good Compsci 100 Assignment?
Array of character/chunk counts, or is this a map?
 Map character/chunk to count, why array?
Priority Queue for generating tree/trie
 Do we need a heap implementation? Why?
Tree traversals for code generation, uncompression
 One recursive, one not, why and which?
Deal with bits and chunks rather than ints and chars
 The good, the bad, the ugly
Create a working compression program
 How would we deploy it? Make it better?
Benchmark for analysis
 What’s a corpus?
CPS 100, Fall 2011
12.17
Anita Borg 1949-2003
“Dr. Anita Borg tenaciously
envisioned and set about to
change the world for women
and for technology. … she
fought tirelessly for the
development technology
with positive social and
human impact.”
“Anita Borg sought to
revolutionize the world and
the way we think about
technology and its impact on
our lives.”
http://www.youtube.com/wat
ch?v=1yPxd5jqz_Q
CPS 100, Fall 2011
12.18
YAQ, YAQ, haha! (Yet Another Queue)
What is the dequeue policy for a Queue?

Why do we implement Queue with LinkedList
• Interface and class in java.util

Can we remove an element other than first?
How does queue help word-ladder/shortest path?


First item enqueued/added is the one we want
What if different element is “best”?
PriorityQueue has a different dequeue policy

Best item is dequeued, queue manages itself to ensure
operations are efficient
CPS 100, Fall 2011
12.19
PriorityQueue raison d’être
Algorithms Using PQ for efficiency

Shortest Path: Google Maps/Garmin to Internet Routing
• How is this like word-ladder? How different?

Event based simulation
• Coping with explosion in number of particles or things

Optimal A* search, game-playing, AI,
• Can't explore entire search space, can estimate good move
Data compression facilitated by priority queue

Alltime best assignment in a Compsci 100 course?
• Subject to debate, of course

From A-Z, soup-to-nuts, bits to abstractions
CPS 100, Fall 2011
12.20
Priority Queue
Compression motivates ADT priority queue


Supports two basic operations
• add/insert -– an element into the priority queue
• remove/delete – the minimal element from the
priority queue
Implementations allow getmin/peek as well as delete
• Analogous to top/pop, peek/dequeue in stacks, queues
Think about implementing the ADT, choices?
 Add compared to min/remove
 Balanced search tree is ok, but can we do better?
CPS 100, Fall 2011
12.21
Priority Queue sorting
See PQDemo.java,

code below sorts, complexity?
String[] array = {...}; // array filled with data
PriorityQueue<String> pq = new PriorityQueue<String>();
for(String s : array) pq.add(s);
for(int k=0; k < array.length; k++){
array[k] = pq.remove();
}
Bottlenecks, operations in code above
 Add words one-at-a-time to PQ v. all-at-once
 What if PQ is an array, add or remove fast/slow?
 We’d like PQ to have tree characteristics, why?
CPS 100, Fall 2011
12.22
Priority Queue top-M sorting
What if we have lots and lots and lots of data

code below sorts top-M elements, complexity?
Scanner s = … // initialize;
PriorityQueue<String> pq =
new PriorityQueue<String>();
while (s.hasNext()) {
pq.add(s.next());
if (pq.size() > M) pq.remove();
}
What’s advantageous about this code?
 Store everything and sort everything?
 Store everything, sort first M?
 What is complexity of sort: O(n log n)
CPS 100, Fall 2011
12.23
Priority Queue implementations
Priority queues: average and worst case
Insert
average
Unsorted list
Sorted list
Search tree
Balanced tree
Heap
Getmin Insert
(delete) worst
Getmin
(delete)
O(1)
O(n)
O(1)
O(n)
O(n)
O(1)
O(n)
O(1)
log n
log n
O(n)
log n
log n
log n
log n
O(1)
log n
log n
log n
O(n)
Heap has O(n) build heap from n elements
CPS 100, Fall 2011
12.24
PriorityQueue.java [java.util]
What about objects inserted into pq?



Comparable, e.g., essentially sortable
How can we change what minimal means?
Implementation uses heap, tree stored in an array
Use a Comparator for comparing entries we can
make a min-heap act like a max-heap, see PQDemo


Where is class Comparator declaration? How used?
What if we didn't know about Collections.reverseOrder?
• How do we make this ourselves?
CPS 100, Fall 2011
12.25
Big-Oh and a tighter look at inserts
log(1) + log(2) + log(3) + … + log(n)


Property of logs, log(a) + log(b) = log(a*b)
log(1*2*3*…*n) = log(n!)
We can show using Sterling’s formula:
log(n!) = c1*log(n) + nlog(n) – c2*n
We can get O(n log n) easily, this is a tight bound,
lower, W(n log n) as well
CPS 100, Fall 2011
12.26
Priority Queue implementation
Heap data structure is fast and reasonably simple


Why not use inheritance hierarchy as was used with Map?
Trade-offs when using HashMap and TreeMap:
• Time, space, ordering properties, TreeMap support?
Changing comparison when calculating priority?

Create object to replace, or in lieu of compareTo
• Comparable interface compares this to passed object
• Comparator interface compares two passed objects

Both comparison methods: compareTo() and compare()
• Compare two objects (parameters or self and parameter)
• Returns –1, 0, +1 depending on <, ==, >
CPS 100, Fall 2011
12.27
Creating Heaps
Heap: array-based implementation of binary tree
used for implementing priority queues:

add/insert, peek/getmin, remove/deletemin, O(???)
Array minimizes storage (no explicit pointers),
faster too, contiguous (cache) and indexing
Heap has shape property and heap/value property


shape: tree filled at all levels (except perhaps last) and
filled left-to-right (complete binary tree)
each node has value smaller than both children
CPS 100, Fall 2011
12.28
Array-based heap
store “node values” in array
beginning at index 1
for node with index k
 left child: index 2*k
 right child: index 2*k+1
why is this conducive for
maintaining heap shape?
what about heap property?
is the heap a search tree?
where is minimal node?
where are nodes added?
deleted?
CPS 100, Fall 2011
6 10 7 17 13 9 21 19 25
0 1
2
3
4
5
6
7
8
9 10
6
7
10
13
17
19
9
21
25
12.29
Thinking about heaps
Where is minimal element?
 Root, why?
Where is maximal element?
 Leaves, why?
How many leaves are there in
an N-node heap (big-Oh)?

O(n), but exact?
What is complexity of find
max in a minheap? Why?

O(n), but ½ N?
Where is second smallest
element? Why?

Near root?
CPS 100, Fall 2011
6
7
10
13
17
9
21
25
19
6 10 7 17 13 9 21 19 25
0 1
2
3
4
5
6
7
8
9 10
12.30
Adding values to heap
to maintain heap shape, must
add new value in left-to-right
order of last level
 could violate heap property
 move value “up” if too
small
6
7
10
13
17
19
9
21
25 insert 8
change places with parent if
10
heap property violated
17
8
 stop when parent is smaller
19 25 13
 stop when root is reached
6
7
10
13
17
bubble 8 up
7
9
21
6
7
8
pull parent down, swapping
isn’t necessary (optimization)
CPS 100, Fall 2011
21
25 8
19
6
9
17
19
10
9
21
25 13
12.31
Adding values, details (pseudocode)
void add(Object elt)
{
// add elt to heap in myList
myList.add(elt);
int loc = myList.size()-1;
6
7
10
13
17
19
9
21
25
while (1 < loc &&
elt < myList.get(loc/2)){
myList.set(loc,myList.get(loc/2));
loc = loc/2; // go to parent
}
// what’s true here?
6
7
10
13
17
19
9
21
25 8
myList.set(loc,elt);
}
6 10 7 17 13 9 21 19 25
0 1
2
3
4
5
6
7
8
9 10
array myList
CPS 100, Fall 2011
12.32
Removing minimal element
Where is minimal element?
6
 If we remove it, what changes,
7
10
shape/property?
13
17
9
21
How can we maintain shape?
19 25
25
 “last” element moves to root
7
10
 What property is violated?
13
17
9
21
After moving last element,
19
subtrees of root are heaps, why?
7
 Move root down (pull child
25
10
up) does it matter where?
13
17
9
21
When can we stop “re-heaping”?
19
7
 Less than both children
9
10
 Reach a leaf
13 25
17
21
19
CPS 100, Fall 2011
12.33
Views of programming
Writing code from the method/function view is
pretty similar across languages


Organizing methods is different, organizing code is
different, not all languages have classes,
Loops, arrays, arithmetic, …
Program using abstractions and high level concepts


Do we need to understand 32-bit twos-complement storage
to understand x =x+1?
Do we need to understand how arrays map to contiguous
memory to use ArrayLists?
CPS 100, Fall 2011
12.34
Bottom up meets Top down
We teach programming by teaching abstractions



What's an array? What’s an ArrayList?
What's an array in C?
What's a map, Hashmap? Treemap?
How are programs stored and executed in the JVM?


Differences on different architectures?
Similarities between Java and C++ and Python and PHP?
What about how the CPU works? How memory
works? Shouldn't we study this too?
CPS 100, Fall 2011
12.35
From bit to byte to char to int to long
Ultimately everything is stored as either a 0 or 1



Bit is binary digit a byte is a binary term (8 bits)
We should be grateful we can deal with Strings rather than
sequences of 0's and 1's.
We should be grateful we can deal with an int rather than
the 32 bits that comprise an int
If we have 255 values for R, G, B, how can we pack
this into an int?


Why should we care, can’t we use one int per color?
How do we do the packing and unpacking?
CPS 100, Fall 2011
12.36
More information on bit, int, long
int values are stored as two's complement numbers
with 32 bits, for 64 bits use the type long, a char is
16 bits



Standard in Java, different in C/C++
Facilitates addition/subtraction for int values
We don't need to worry about this, except to note:
• Infinity + 1 = - Infinity (see Integer.MAX_VALUE)
• Math.abs(-Infinity) > Infinity
Java byte, int, long are signed values, char unsigned


What are values for 16-bit char? 8-bit byte?
Why will this matter in Burrows Wheeler?
CPS 100, Fall 2011
12.37
More details about bits
How is 13 represented?

… _0_ _0_ _1_ _1_ _0_ _1_
24

23
22
21
20
Total is 8+4+1 = 13
What is bit representation of 32? Of 15? Of 1023?


What is bit-representation of 2n - 1?
What is bit-representation of 0? Of -1?
• Study later, but -1 is all 1’s, left-most bit determines < 0
Java signed byte: -128..127, # bits?


What if we only want 0-255? (Huff, pixels, …)
Convert negative values or use char, trade-offs?
Java char unsigned: 0..65,536 # bits?

Why is char unsigned? Why not as in C++/C?
CPS 100, Fall 2011
12.38
How are data stored?
To facilitate Huffman coding we need to read/write
one bit



Why do we need to read one bit?
Why do we need to write one bit?
When do we read 8 bits at a time? 32 bits?
We can't actually write one bit-at-a-time. We can't
really write one char at a time either.


Output and input are buffered,minimize memory accesses
and disk accesses
Why do we care about this when we talk about data
structures and algorithms?
• Where does data come from?
CPS 100, Fall 2011
12.39
How do we buffer char output?
Done for us as part of InputStream and Reader
classes
InputStreams are for reading bytes
 Readers are for reading char values
 Why do we have both and how do they interact?
Reader r = new InputStreamReader(System.in);
 Do we need to flush our buffers?

In the past Java IO has been notoriously slow


Do we care about I? About O?
This is changing, and the java.nio classes help
• Map a file to a region in memory in one operation
CPS 100, Fall 2011
12.40