Universal Hashing

Download Report

Transcript Universal Hashing

Universal Hashing


Worst case analysis
Probabilistic analysis


Need the knowledge of the distribution of the inputs
Indicator random variables

Given a sample space S and an event A, the indicator
random variable I{A} associated with event A is defined
as:
1 if A occurs
I  A 
0 o/w

E.g.:
Consider flipping a fair coin:
• Sample space S = { H,T }
• Define random variable Y with Pr{ Y=H } =
Pr{ Y=T }=1/2
• We can define an indicator r.v. XH associated with the
coin coming up heads, i.e. Y=H
1 if Y  H
X H  I Y  H  
0 if Y  T

E  X H   E  I Y  H 
 1  Pr Y  H   0  Pr Y  T 
1
 Pr Y  H  
2
Lemma :
Given a sample space S and an event A in the
sample space S, let X A  I { A}
Then E  X A   Pr  A
Proof :
E  X A   E  I  A  1 Pr  A  0  Pr  A
 Pr  A




H={ h: U→{0,…,m-1} }, which is a finite collection
of hash functions.
H is called “universal” if for each pair of distinct
keys k, ∈ U, the number of hash functions h∈H
for which h(k)=h( ) is at most |H|/m
Define ni = the length of list T[i]
Theorem:
Suppose h is randomly selected from H, using
chaining to resolve collisions. If k is not in the table,
then E[nh(k)] ≤ α. If k is in the table, then
E[nh(k)] ≤ 1+α

Proof:



For each pair k and of distinct keys,
define Xk =I{h(k)=h( )}.
By definition, Prh{h(k)=h( )} ≤ 1/m, and so
E[Xk ] ≤ 1/m.
Define Yk to be the number of keys other than k
that hash to the same slot as k, so that
Yk   Xk
T
k
1
E[Yk ]   E[ Xk ]  
T
T m
k
k


If k  T , then nh ( k )  Yk and |{ :  T ,  k}|  n
n
thus E[nh ( k )]  E[Yk ]   
m
If k∈T, then because k appears in T[h(k)] and the
count Yk does not include k, we have nh(k) = Yk + 1
and |{ :  T ,  k}| n  1
n 1
1
Thus E[nh ( k )]  E[Yk ]  1 
1  1    1 
m
m

Corollary: Using universal hashing and collision
resolution by chaining in an initially empty table
with m slots, it takes expected time Θ(n) to handle
any sequence of n Insert, Search and Delete
operations containing O(m) Insert operations.
Proof: Since n= O(m), the load factor is O(1). By the
Thm, each Search takes O(1) time. Each of Insert
and Delete takes O(1). Thus the expected time is
Θ(n).
Designing a universal class of hash functions:
p:prime

Z p  0,1,, p  1 Z p  1,2,, p  1
For any a  Z p and b  Z p , define ha,b : Zp→Zm
ha ,b (k )  (( ak  b) mod p ) mod m
 p,m  ha,b : a  Z *p and b  Z p 
Theorem:
Hp,m is universal.
Pf: Let k, be two distinct keys in Zp.
Given ha,b, Let r= (ak +b) mod p, and
s= (a +b) mod p.
Then r-s ≡ a(k- ) mod p


For any ha,b∈Hp,m, distinct inputs k and
map to distinct r and s modulo p.
Each possible p(p-1) choices for the pair
(a,b) with a≠0 yields a different resulting
pair (r,s) with r≠s, since we can solve for
a and b given r and s:
a=((r-s)((k- )-1 mod p)) mod p
b=(r-ak) mod p

There are p(p-1) possible pairs (r,s)
with r≠s, there is a 1-1
correspondence between pairs (a,b)
with a≠0 and (r,s), r≠s.

For any given pair of inputs k and , if
we pick (a,b) uniformly at random from
Z p  Z p , the resulting pair (r,s) is
equally likely to be any pair of distinct
values modulo p.


Pr[ k and collide]=Prr,s[ r≡s mod m]
Given r, the number of s such that s≠r
and s≡r (mod m) is at most
⌈p/m⌉-1≤((p+m-1)/m)-1
=(p-1)/m
∵ s, s+m, s+2m,…., ≤p
Thus,
Prr,s[r≡s mod m] ≤((p-1)/m)/(p-1)
=1/m
Therefore, for any pair of distinct
k, ∈Zp,
Pr[ha,b(k)=ha,b( )] ≤1/m,
so that Hp,m is universal.

Perfect Hashing


Good for when the keys are static; i.e. , once
stored, the keys never change, e.g. CD-ROM,
the set of reserved word in programming
languages. A perfect hashing uses O(1)
memory accesses for a search.
Thm :
If we store n keys in a hash table of size
m=n2 using a hash function h randomly
chosen from a universal class of hash
functions, then the probability of there being
any collisions is < ½ .
Proof:


Let h be chosen from an universal family. Then
each pair collides with probability 1/m , and
there are  n2  pairs of keys.
Let X be a r.v. that counts the number of
collisions. When m=n2,
 n  1 n2  n 1 1
E[ X ]    
 2
2
n
2
 2 m
By Markov ' s inequality, Pr[ X  t ]  E[ X ] / t ,
and take t  1.

Thm: If we store n keys in a hash table
of size m=n using a hash function h
randomly chosen from universal class of
m 1
2
hash functions, then E[ j 0 n j ]  2n ,
where nj is the number of keys hashing
to slot j.

Pf:

It is clear for any nonnegative integer a,
a
a  a  2 
 2
2


 nj 
E[ n j ]  E[  n j  2  ]
j 0
j 0 
 2 
m 1
m 1 n
 j
 E[ n j ]  2 E[  ]
j 0
j 0  2 
m 1
2
m 1
m 1 n
nj 
 j
 E[n]  2 E[  ]  n  2 E[  ]
j 0  2 
j 0  2 
m 1
total number of pairs of
keys that collide
 n  1 n(n  1) n  1
  

, since m  n.
2m
2
 2 m
m 1
n 1
2
E[ n j ]  n  2 
 2n  1  2n.
2
j 0

Cor: If store n keys in a hash table of
size m=n using a hash function h
randomly chosen from a universal class
of hash functions and set the size of
each secondary hash table to mj=nj2 for
j=0,…,m-1, then the expected amount
of storage required for all secondary
hash tables in a perfect hashing scheme
is < 2n.

Pf:
m 1
m 1
j 0
j 0
E[ m j ]  E[ n j ]  2n.
2
Testing a few randomly chosen hash functions
will soon find one using small storage.


Cor: Pr[total storage for secondary hash
tables ]  4n] < 1/2
Pf: By Markov’s inequality, Pr[X  t]  E[X]/t.
m 1

Take X   m j and t  4n :
j 0
m 1
m 1
E[ m j ]
j 0
4n
Pr{ m j  4n} 
j 0
2n 1

 .
4n 2