hash function
Download
Report
Transcript hash function
Hashing (Ch. 14)
Goal: to implement a symbol table or dictionary
(insert, delete, search)
What if you don’t need ordered keys--pred, succ,
sort, select?
Are O(log n) comparisons necessary? (no)
Hashing basic plan:
create a big array for the items to be stored
use a function to figure out storage location from key (hash
function)
a collision resolution scheme is necessary
Hashing Example
Simple Hash function:
Treat the key as a large integer K
h(K) = K mod M, where M is the table size
let M be a prime number.
Example:
Suppose we have 101 buckets in the hash table.
‘abcd’ in hex is 0x61626364
Converted to decimal it’s 1633831724
1633831724 % 101 = 11
Thus h(‘abcd’) = 11. Store the key at location 11.
“dcba” hashes to 57.
“abbc” also hashes to 57 – collision. What to do?
If you have billions of possible keys and hundreds of
buckets, lots of collisions are possible!
Hashing Strings
h(‘aVeryLongVariableName’)?
Instead of dealing with very large numbers, you can
use Horner’s method:
256 * 97 + 86 = 24918 % 101 = 72
256 * 72 + 101 = 18533 % 101 = 50
256 * 50 + 114 = 12914 % 101 = 87
Scramble by replacing 256 with 117
int hash(char *v, int M)
{ int h, a=117;
for (h=0; *v; v++)
h = (a*h + *v) % M;
return h;
}
Collisions
How likely are collisions?
Birthday paradox
M
100
1000
10000
sqrt(p M/2) (about 1.25 sqrt(M))
12
40
125
[1.25 sqrt(365) is about 24]
Experiment: generate random numbers 0..100
84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61 …
Collision at 13th number, as predicted
What to do about collisions?
Separate Chaining
Build a linked list for each bucket
Linear search within list
0:
1: L
2: M
3: N
4:
5: E
6:
7: G
8: H
9: I
10:
A A A
X
C
P E E
R
S
Simple, practical, widely used
Cuts search time by a factor of M over sequential
search
Separate Chaining 2
Insertion time?
Average search cost, successful search?
O(N/2M)
Average search cost, unsuccessful?
O(1)
O(N/M)
M large: CONSTANT average search time
Worst case: N (“probabilistically unlikely”)
Keep lists sorted?
insert time O(N/2M)
unsuccessful search time O(N/2M)
Linear Probing
Or, we could keep everything in the same table
Insert: upon
collision, search
for a free spot
Search: same (if
you find one, fail)
Runtime?
Still O(1) if table
is sparse
But: as table fills,
clustering occurs
Skipping c spots
doesn’t help…
Clustering
Long clusters tend to get longer
Precise analysis difficult
Theorem (Knuth):
Insert cost: approx. (1 + 1/(1-N/M)2)/2
Search (hit) cost: approx. (1 + 1/(1-N/M))/2
(50% full 2.5 probes; 80% full 13 probes)
(50% full 1.5 probes; 80% full 3 probes)
Search (miss): same as insert
Too slow when table gets 70-80% full
How to reduce/avoid clustering?
Double Hashing
Use a second hash function to
compute increment seq.
Analysis extremely
difficult
About like ideal
(random probe)
Thm (Guibas-Szemeredi):
Insert: approx 1+1/(1-N/M)
Search hit: ln(1+N/M)/(N/M)
Search miss: same as insert
Not too slow until the table is
about 90% full
Dynamic Hash Tables
Suppose you are making a symbol table for a
compiler. How big should you make the hash table?
If you don’t know in advance how big a table to
make, what to do?
Could grow the table when it “fills” (e.g. 50% full)
Make a new table of twice the size.
Make a new hash function
Re-hash all of the items in the new table
Dispose of the old table
Table Growing Analysis
Worst case insertion: Q(n), to re-hash all items
Can we make any better statements?
Average case?
O(1), since insertions n through 2n cost O(n) (on average)
for insertions and O(2n) (on average) for rehashing O(n)
total (with 3x the constant)
Amortized analysis?
The result above is actually an amortized result for the
rehashing.
Any sequence of j insertions into an empty table has O(j)
average cost for insertions and O(2j) for rehashing.
Or, think of it as billing 3 time units for each insertion,
storing 2 in the bank. Withdraw them later for rehashing.
Separate Chaining vs.
Double Hashing
Assume the same amount of space for keys, links
(use pointers for long or variable-length keys)
Separate chaining:
Double hashing in same space:
1M buckets, 4M keys
4M links in nodes
9M words total; avg search time 2
4M items, 9M buckets in table
average search time: 1/(1-4/9) = 1.8: 10% faster
Double hashing in same time
4M items, average search time 2
space needed: 8M words (1/(1-4/8) = 2) (11% less space)
Deletion
How to implement delete() with linear chaining?
How to implement delete() with linear probing?
Simply unlink unwanted item
Runtime?
Same as search()
Can’t just erase it. (Why not?)
Re-hash entire cluster
Or mark as deleted?
How to delete() with double hashing?
Re-hashing cluster doesn’t work – which “cluster”?
Mark as deleted
Every so often re-hash entire table to prune “dead-wood”
Comparisons and summary
Separate chaining advantages:
Why use hashing?
constant time search and insert, on average
easy to implement
Why not use hashing?
idiot-proof (degrades gracefully)
no large chunks of memory needed (but is this better?)
No performance guarantees
Too much arithmetic on long keys – high constant
Uses extra space
Doesn’t support pred, succ, sort, etc. – no notion of order
Where did perl “hashes” get their name?
Hashing Summary
5k
50k
100k
190k
200k
RB
6
74
182
407
Separate chaining: easiest to deploy
Linear probing: fastest (but takes more memory)
Double hashing: least memory (but takes more time
to compute the second hash function)
Dynamic (grow): handles any number of inserts
Curious use of hashing: early unix spell checker
(back in the days of the 3M machines…)
Construction
Chain Probe Dbl Grow
1
4
4
3
18
11
12
22
35
21
23
47
79
106
59 155
84
159
RB
2
36
84
186
Search Miss
Chain Probe Dbl Grow
1
0
1
0
15
8
8
8
45
23
21
15
144 2194
261
30
156
33