Hashing

Transcript Hashing

Hashing
Amihood Amir
Bar Ilan University
2014
Direct Addressing
In old days:
LD 1,1
LD 2,2
AD 1,2
ST 1,3
Today:
C <- A+B
Direct Addressing
Compiler keeps track of:
C <- A+B
A in location 1
B in location 2
C in location 3
How?
Tables? Linked lists? Skip lists? AVL
trees? B-trees?
Encode?
Consider ASCII code of alphabet:
(American Standard Code for Information Interchange)
Encode
In ASCII code:
A to Z are 1000001 to 1011010
Encode
In ASCII code:
A to Z are 1000001 to 1011010
In decimal:
65 to 90.
So if we subtract 64 from the ASCII
We will get locations 1 to 26.
Map
In general:
Consider a function h: U
{0,…,m}.
Where U is the universe of keys, m is
number of keys.
When we want to access the record of
key kєU just go to location h(k) and it
will point to the location of the record.
Problems:
1. What happens if keys are not
consecutive?
E.g. parameters:
pointer, variable, idnumber
2. What happens if not all keys are used?
E.g. ID numbers of students:
211101793, 31264992, 001684421
3. What happens if h(k) is not bijective?
Hash Tables:
If U is larger than m, h(k) can not be
bijective, thus we may encounter
collisions: k1≠k2 where h(k1)= h(k2)
What do we do then?
Solution 1:
chaining.
Chaining
Chaining puts elements that hash to the
same slot in a linked list:
U
(universe of keys)
k6
k2
k1
k4 ——
k5
k2
——
——
——
k1
k4
K
(actual
k7
keys)
T
——
k5
——
k8
k3
k3 ——
k8
——
k6 ——
k7 ——
Hash operations:
CHAINED-HASH-INSERT(T,x)
Insert x at end of list T[h(key[x])]
CHAINED-HASH-SEARCH(T,k)
Search for an element with key k in list T[h(k)]
CHAINED-HASH-DELETE(T,x)
Delete x from list T[h(key[x])]
Time:
Depends on chains.
We assume: simple uniform hashing.
h hashes into any slot with equal
uniform likelihood.
Load Factor:
We will analyze average search time.
This depends on how large is the hash
table relative to the number of keys.
Let n= number of keys, m= hash table size
n/m is the load factor.
Write n/m = α .
Average Complexity of Search
Notation: length of list T[j] is nj. Σ nj = n.
Average value of nj = E(nj) = n/m = α
Assumptions:
h(k) computed in constant time.
h(k) simple uniform hash function.
Searching for new key
Compute h(k) in time O(1).
Get to list T[h(k)] (uniformly distributed)
whose length is nj.
Need to search list T[h(k)].
Time for search: nj.
E(nj) = α.
Conclude: Average time to search for new
key is Θ(1+α).
What is worst case time?
Searching for Existing Key
Do we gain anything by searching for a key
that is already in list?
Note:
Expected number of elements in
successful search for k =
Expected length of list T[h(k)] when k was
inserted + 1.
(since we insert elements at end)
Searching for Existing Key
Let k1, … , kn be the keys in order of their
insertion.
When ki is inserted,
the expected list length is (i-1)/m.
So the expected length of a successful search is
the average of all such lists:
1 n
i 1


1


m
n i 1
Calculate:
1 n
i 1


1


m
n i 1
=
1 n
i  1
1

nm i 1
=
 1  (n  1)n 
1 


 nm  2 
=
(n  1)n
n
1
 1
1
 1

 1 
2nm
2m 2m
2 2m
Conclude:
Average time for searching in a hash table
with chaining is Θ(1+α)
Note:
Generally m is chosen as O(n)
Thus the average operation on a hash
table with chaining takes constant
time.
Choosing a Hash Function
Assume keys are natural numbers:
{0, 1, 2, … }
A natural mapping of natural numbers to
{0, 1, … , m}
is by dividing by m+1 and taking the
remainder, i.e.
h(k) = k mod (m+1)
The Division Method
What are good m’s?
In ASCII example:
A to Z are 1000001 to 1011010
In decimal:
65 to 90.
We subtracted 64 from the ASCII and
got locations 1 to 26.
If we choose m=32 we achieve the same
result.
What does it Mean?
32 = 25
Dividing by 32 and taking the remainder
means taking the 5 LSBs in the binary
representation.
A is 1000001 = 1
Z is 1011010 = 26
Is this good? – Here, yes. In general?
Want Even Distribution
We want even distribution of the bits.
If m=2b then only the b least significant
bits participate in the hashing. We
would like all bits to participate.
Solution:
Use for m a prime number not too close
to a power of 2.
Always good idea: Check distribution of
real application data.
The Multiplication Method
Choose:
0<σ<1
h(k) = m (k  k 
Meaning:
1. multiply key by σ and take fraction.
2. then multiply by m and take floor.
Multiplication Method Example
Choose:
σ = 0.618 m=32
(Knuth recommends (√5 -1)/2 )
k=2391
h(k) = m (k  k 
2391 x 0.618 = 1477.638
1477.638-1477 = 0.638
0.638 x 32 = 20.416
So
h(2391)=20.
Why does this involve all bits?
Assume: k uses w bits, which is a computer
word.
Word operations take constant time.
Assume: m = 2p.
Easy Implementation of h(k):
Implementing h(k)
Easy Implementation of h(k):
w bits
k
X
Getting rid of this
Means dividing by
2w.
σ2w
p bits
h(k)
Worst Case
Problem: Malicious adversary can
make sure all keys hash to same
entry.
Usual solution: randomize.
But then how do we get to correct hash
table entry upon search?
Universal Hashing
We want to make sure that:
Not every time we hash keys
k1, k2, … , kn
They hash to same table entry.
How can we make sure of that?
Use a number of different hash
functions and employ them at random.
Universal Hashing
A collection H of hash functions from
universe U to {0,…,m} is universal if for
every pair of keys x,yєU, where x≠y the
number of hash functions hєH for
which h(x)=h(y) is |H|/m.
This means: If we choose a function hєH
at random the chance of collision
between x and y where x≠y is 1/m.
Constructing a Universal
Collection of Hash Functions
Choose m to be prime.
Decompose a key k as follows:
log m bits
value < m
r+1 pieces
k = [k0, k1, … , kr ]
Universal Hashing Construction
Choose randomly mr+1 sequences of length
r+1 of elements from {0,…,m-1}.
Each sequence is a=[a0, … , ar], aiє {0,…,m-1}.
Each sequence defines a hash function
r
ha (k )   ai ki mod m
i 0
The universal class is: H =a U{ha}
This is a Universal Class
Need to show that the probability that x≠y
collide is 1/m.
Because x≠y then there is some i for which
xi≠yi . Wlog assume i=0.
Since m is prime then for any fixed
[a1,…,ar] there is exactly one value that a0
can get that satisfies h(x)=h(y).
Why?
Proof Continuation…
h(x)-h(y)=0 means:
r
a0 ( x0  y0 )   ai ( xi  yi ) (mod m)
i 1
But x0-y0 is non-zero.
For prime m there is a unique multiplicative
inverse modulo m.
Thus there is only one value between 0 and
m-1 that a0 can get.
End of Proof
Conclude: Each pair x≠y may collide once for
each [a1,…,ar]. But there are mr possible
values for [a1,…,ar].
However: Since there are mr+1 different
hash functions, the keys x and y collide
with probability:
mr
1

r 1
m
m
Open Addressing
Solution 2 to the collision problem.
We don’t want to use chaining. (save space
and complexity of pointers.)
Where should we put the collision?
Inside hash table, in an empty slot.
Problem: Which empty slot?
-- We assume a probing hashing function.
Idea behind Probing
Hashing function of the form:
h : U x {0,…,m}
{0,…,m}
The initial hashing function: h(k,0).
h(k,i) for i=1,…,m,
gives subsequent values. We try locations
h(k,0), h(k,1), etc. until an open slot is
found in the hash table.
If none found we report an overflow error.
Average Complexity of Probing
Assume uniform hashing – i.e.
A probe sequence for key k is equally
likely to be any permutation of {0,…,m}.
What is the expected number of probes
with load factor α?
Note: since this is open address hashing,
α = n/m <1
Probing Complexity – Case I
Case I : key not in hash table.
Key not in table
every probe, except for
last, is to an occupied slot.
Let
pi=Pr{exactly i probes access occupied slot}
Then
The expected number of probes is

1   ip i
i 0
Probing Complexity – Case I
Note: For i>n, pi=0.
Claim:


 ip   q
i 0
i
i 1
i
,
Where
qi=Pr{at least i probes access occupied slots}
Why?
Probing Complexity – Case I


 ip   i(q  q
i 0
i
i 1
i
i 0
)
q1  q2  2q2  2q3  3q3  3q4  4q4  4q5  
q2
q3
q1  q2  q3  q4  q5  

q
i 1
i
q4
Probing Complexity – Case I
What is: qi?
q1 = n/m there are n elements in table, the probability of
having something in the first slot accessed is
therefore n/m
q2 = n/m ((n-1)/(m-1))
 n  n  1   n  i  1   n 
i
qi   
 
    
 m  m  1   m  i  1   m 
i
Probing Complexity – Case I
Conclude: The expected complexity of
insering an element into the hash table
is:


i 0
i 1
1   ip i  1   qi

1
  
1
i 0
i
Example: If the hash table is half full, the
expected number of probes is 2. If it is 90%
full, the expected number is 10.
Probing Complexity – Case II
Case II : key in hash table.
As in the chaining case. Expected search
time for key in table = Expected
insertion time.
We have shown:
Expected time to insert key i+1 is:
1
m

1  mi m  i
Probing Complexity – Case II
Conclude:
Expected key insertion time:
1 n 1 m
m n 1 1
1
 


n i 0 m  i n i 0 m  i 
Need to compute
n 1
1

i 0 m  i
Probing Complexity – Case II
n 1
m
1
1
 

i 0 m  i
i  m  n 1 i
In General:
b
What is
1

ia i
?
Probing Complexity – Case II
Consider any monotonic function f(x)>0.
b
Approximate
 f (i )
ia
By the area under the function.
Probing Complexity – Case II
Consider any monotonic function f(x)>0.
f(x)
b 1

b
f ( x)   f ( x) 
i a
a
b
 f ( x)
a 1
b
f (if) (i)
b
ia ia
a-1
a
b
b+1
Probing Complexity – Case II
Conclude:
b
Approximate
b
By
 f ( x)
a 1
 f (i )
ia
Probing Complexity – Case II
Conclude:
m
Approximate
By
1

i  n  m 1 i
1
 m 
 1 
dx  ln( m)  ln( m  n)  ln 
  ln 


x
nm
 1 
mn
m
Probing Complexity – Case II
Conclude:
Expected key insertion time:
n 1
1
1  1 

ln 


 i 0 m  i
 1 
1
Example: If the hash table is half full, the
expected number of probes is 2. If it is 90%
full, the expected number is 2.6.
Types of Probing
What Functions can be used for probing?
1. Linear
2. Quadratic
3. Double Hashing
Linear Probing
j <- h(k)
If T[j] occupied, then repeat until finding an
empty slot:
j <- j+1
End
T[j] <- k
Attention:
1. Wrap around when end of table reached.
2. If table full then overflow error.
Linear Probing - Discussion
Pro: Easy to implement.
Con: Clustering.
an empty slot
following a cluster
of length i has
probability
(i+1)/m to be filled.
Idea behind Probing
Hashing function of the form:
h : U x {0,…,m}
{0,…,m}
The initial hashing function, h’(k), gives
First value, but if this one is taken, we try
h(k,i) for i=0,…,m.
In linear probing we have:
h(k,i) = (h’(k)+i) mod (m+1)
For Uniformity:
There are m! different paths of length m,
we should be able to generate them all.
Linear probing: Generates m paths, so
clearly not uniform.
The more different paths generated, the
better.
Quadratic Probing
h(k,i) = (h’(k) + c1i + c2i2) mod m
Works better but:
1. Still only m different paths.
2. Note that if h’(x)=h’(y) then
h(x,i)=h(y,i) for all i.
This causes what is called secondary
clustering.
Double Hashing
h(k,i)=(h1(k)+ih2(k))mod m
To get permutation: h2(k) must be relative
prime to m.
Possibility:
h1(k)=k mod m
h2(k)=1+(k mod m’)
Either m power of 2 and m’ odd or m and m’
both prime with distance 2 between
them.
Double Hashing Example
h(k,i)=(h1(k)+ih2(k))mod m
h1(k)=k mod 13
h2(k)=1+(k mod 11)
Probing Seq. of 14: Probing Seq. of 27
27 mod 13 = 1
14 mod 13 = 1
(1+6) mod 13 = 7
(1+4) mod 13 = 5
(1+12) mod 13 = 0
(1+8) mod 13 = 9
.
.
.
.
Double Hashing Example
Advantage:
m2 different sequences generated.
Deletions
Problem: If an element is deleted, we may
think a key is not in the table!
Solution: When an element is deleted,
mark it with flag.
Meaning: It can cause long searches if
many deletions.
Therefore: In very dynamic setting use
chaining.

Hashing

Transcript Hashing

Directory