hash function

Download Report

Transcript hash function

Hashing (Ch. 14)

Goal: to implement a symbol table or dictionary
(insert, delete, search)
 What if you don’t need ordered keys--pred, succ,
sort, select?
Are O(log n) comparisons necessary? (no)

Hashing basic plan:





create a big array for the items to be stored
use a function to figure out storage location from key (hash
function)
a collision resolution scheme is necessary
Hashing Example

Simple Hash function:




Treat the key as a large integer K
h(K) = K mod M, where M is the table size
let M be a prime number.
Example:








Suppose we have 101 buckets in the hash table.
‘abcd’ in hex is 0x61626364
Converted to decimal it’s 1633831724
1633831724 % 101 = 11
Thus h(‘abcd’) = 11. Store the key at location 11.
“dcba” hashes to 57.
“abbc” also hashes to 57 – collision. What to do?
If you have billions of possible keys and hundreds of
buckets, lots of collisions are possible!
Hashing Strings


h(‘aVeryLongVariableName’)?
Instead of dealing with very large numbers, you can
use Horner’s method:




256 * 97 + 86 = 24918 % 101 = 72
256 * 72 + 101 = 18533 % 101 = 50
256 * 50 + 114 = 12914 % 101 = 87
Scramble by replacing 256 with 117
int hash(char *v, int M)
{ int h, a=117;
for (h=0; *v; v++)
h = (a*h + *v) % M;
return h;
}
Collisions


How likely are collisions?
Birthday paradox
M
100
1000
10000
sqrt(p M/2) (about 1.25 sqrt(M))
12
40
125
[1.25 sqrt(365) is about 24]

Experiment: generate random numbers 0..100



84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61 …
Collision at 13th number, as predicted
What to do about collisions?
Separate Chaining


Build a linked list for each bucket
Linear search within list
0:
1: L
2: M
3: N
4:
5: E
6:
7: G
8: H
9: I
10:


A A A
X
C
P E E
R
S
Simple, practical, widely used
Cuts search time by a factor of M over sequential
search
Separate Chaining 2

Insertion time?


Average search cost, successful search?




O(N/2M)
Average search cost, unsuccessful?


O(1)
O(N/M)
M large: CONSTANT average search time
Worst case: N (“probabilistically unlikely”)
Keep lists sorted?


insert time O(N/2M)
unsuccessful search time O(N/2M)
Linear Probing







Or, we could keep everything in the same table
Insert: upon
collision, search
for a free spot
Search: same (if
you find one, fail)
Runtime?
Still O(1) if table
is sparse
But: as table fills,
clustering occurs
Skipping c spots
doesn’t help…
Clustering



Long clusters tend to get longer
Precise analysis difficult
Theorem (Knuth):

Insert cost: approx. (1 + 1/(1-N/M)2)/2


Search (hit) cost: approx. (1 + 1/(1-N/M))/2


(50% full  2.5 probes; 80% full  13 probes)
(50% full  1.5 probes; 80% full  3 probes)
Search (miss): same as insert

Too slow when table gets 70-80% full

How to reduce/avoid clustering?
Double Hashing
Use a second hash function to
compute increment seq.

Analysis extremely
difficult

About like ideal
(random probe)

Thm (Guibas-Szemeredi):
Insert: approx 1+1/(1-N/M)
Search hit: ln(1+N/M)/(N/M)
Search miss: same as insert

Not too slow until the table is
about 90% full
Dynamic Hash Tables







Suppose you are making a symbol table for a
compiler. How big should you make the hash table?
If you don’t know in advance how big a table to
make, what to do?
Could grow the table when it “fills” (e.g. 50% full)
Make a new table of twice the size.
Make a new hash function
Re-hash all of the items in the new table
Dispose of the old table
Table Growing Analysis



Worst case insertion: Q(n), to re-hash all items
Can we make any better statements?
Average case?


O(1), since insertions n through 2n cost O(n) (on average)
for insertions and O(2n) (on average) for rehashing  O(n)
total (with 3x the constant)
Amortized analysis?



The result above is actually an amortized result for the
rehashing.
Any sequence of j insertions into an empty table has O(j)
average cost for insertions and O(2j) for rehashing.
Or, think of it as billing 3 time units for each insertion,
storing 2 in the bank. Withdraw them later for rehashing.
Separate Chaining vs.
Double Hashing


Assume the same amount of space for keys, links
(use pointers for long or variable-length keys)
Separate chaining:




Double hashing in same space:



1M buckets, 4M keys
4M links in nodes
9M words total; avg search time 2
4M items, 9M buckets in table
average search time: 1/(1-4/9) = 1.8: 10% faster
Double hashing in same time


4M items, average search time 2
space needed: 8M words (1/(1-4/8) = 2) (11% less space)
Deletion

How to implement delete() with linear chaining?




How to implement delete() with linear probing?




Simply unlink unwanted item
Runtime?
Same as search()
Can’t just erase it. (Why not?)
Re-hash entire cluster
Or mark as deleted?
How to delete() with double hashing?



Re-hashing cluster doesn’t work – which “cluster”?
Mark as deleted
Every so often re-hash entire table to prune “dead-wood”
Comparisons and summary

Separate chaining advantages:



Why use hashing?



constant time search and insert, on average
easy to implement
Why not use hashing?





idiot-proof (degrades gracefully)
no large chunks of memory needed (but is this better?)
No performance guarantees
Too much arithmetic on long keys – high constant
Uses extra space
Doesn’t support pred, succ, sort, etc. – no notion of order
Where did perl “hashes” get their name?
Hashing Summary





5k
50k
100k
190k
200k
RB
6
74
182
407
Separate chaining: easiest to deploy
Linear probing: fastest (but takes more memory)
Double hashing: least memory (but takes more time
to compute the second hash function)
Dynamic (grow): handles any number of inserts
Curious use of hashing: early unix spell checker
(back in the days of the 3M machines…)
Construction
Chain Probe Dbl Grow
1
4
4
3
18
11
12
22
35
21
23
47
79
106
59 155
84
159
RB
2
36
84
186
Search Miss
Chain Probe Dbl Grow
1
0
1
0
15
8
8
8
45
23
21
15
144 2194
261
30
156
33