Universal Hashing
Download
Report
Transcript Universal Hashing
Universal Hashing
Worst case analysis
Probabilistic analysis
Need the knowledge of the distribution of the inputs
Indicator random variables
Given a sample space S and an event A, the indicator
random variable I{A} associated with event A is defined
as:
1 if A occurs
I A
0 o/w
E.g.:
Consider flipping a fair coin:
• Sample space S = { H,T }
• Define random variable Y with Pr{ Y=H } =
Pr{ Y=T }=1/2
• We can define an indicator r.v. XH associated with the
coin coming up heads, i.e. Y=H
1 if Y H
X H I Y H
0 if Y T
E X H E I Y H
1 Pr Y H 0 Pr Y T
1
Pr Y H
2
Lemma :
Given a sample space S and an event A in the
sample space S, let X A I { A}
Then E X A Pr A
Proof :
E X A E I A 1 Pr A 0 Pr A
Pr A
H={ h: U→{0,…,m-1} }, which is a finite collection
of hash functions.
H is called “universal” if for each pair of distinct
keys k, ∈ U, the number of hash functions h∈H
for which h(k)=h( ) is at most |H|/m
Define ni = the length of list T[i]
Theorem:
Suppose h is randomly selected from H, using
chaining to resolve collisions. If k is not in the table,
then E[nh(k)] ≤ α. If k is in the table, then
E[nh(k)] ≤ 1+α
Proof:
For each pair k and of distinct keys,
define Xk =I{h(k)=h( )}.
By definition, Prh{h(k)=h( )} ≤ 1/m, and so
E[Xk ] ≤ 1/m.
Define Yk to be the number of keys other than k
that hash to the same slot as k, so that
Yk Xk
T
k
1
E[Yk ] E[ Xk ]
T
T m
k
k
If k T , then nh ( k ) Yk and |{ : T , k}| n
n
thus E[nh ( k )] E[Yk ]
m
If k∈T, then because k appears in T[h(k)] and the
count Yk does not include k, we have nh(k) = Yk + 1
and |{ : T , k}| n 1
n 1
1
Thus E[nh ( k )] E[Yk ] 1
1 1 1
m
m
Corollary: Using universal hashing and collision
resolution by chaining in an initially empty table
with m slots, it takes expected time Θ(n) to handle
any sequence of n Insert, Search and Delete
operations containing O(m) Insert operations.
Proof: Since n= O(m), the load factor is O(1). By the
Thm, each Search takes O(1) time. Each of Insert
and Delete takes O(1). Thus the expected time is
Θ(n).
Designing a universal class of hash functions:
p:prime
Z p 0,1,, p 1 Z p 1,2,, p 1
For any a Z p and b Z p , define ha,b : Zp→Zm
ha ,b (k ) (( ak b) mod p ) mod m
p,m ha,b : a Z *p and b Z p
Theorem:
Hp,m is universal.
Pf: Let k, be two distinct keys in Zp.
Given ha,b, Let r= (ak +b) mod p, and
s= (a +b) mod p.
Then r-s ≡ a(k- ) mod p
For any ha,b∈Hp,m, distinct inputs k and
map to distinct r and s modulo p.
Each possible p(p-1) choices for the pair
(a,b) with a≠0 yields a different resulting
pair (r,s) with r≠s, since we can solve for
a and b given r and s:
a=((r-s)((k- )-1 mod p)) mod p
b=(r-ak) mod p
There are p(p-1) possible pairs (r,s)
with r≠s, there is a 1-1
correspondence between pairs (a,b)
with a≠0 and (r,s), r≠s.
For any given pair of inputs k and , if
we pick (a,b) uniformly at random from
Z p Z p , the resulting pair (r,s) is
equally likely to be any pair of distinct
values modulo p.
Pr[ k and collide]=Prr,s[ r≡s mod m]
Given r, the number of s such that s≠r
and s≡r (mod m) is at most
⌈p/m⌉-1≤((p+m-1)/m)-1
=(p-1)/m
∵ s, s+m, s+2m,…., ≤p
Thus,
Prr,s[r≡s mod m] ≤((p-1)/m)/(p-1)
=1/m
Therefore, for any pair of distinct
k, ∈Zp,
Pr[ha,b(k)=ha,b( )] ≤1/m,
so that Hp,m is universal.
Perfect Hashing
Good for when the keys are static; i.e. , once
stored, the keys never change, e.g. CD-ROM,
the set of reserved word in programming
languages. A perfect hashing uses O(1)
memory accesses for a search.
Thm :
If we store n keys in a hash table of size
m=n2 using a hash function h randomly
chosen from a universal class of hash
functions, then the probability of there being
any collisions is < ½ .
Proof:
Let h be chosen from an universal family. Then
each pair collides with probability 1/m , and
there are n2 pairs of keys.
Let X be a r.v. that counts the number of
collisions. When m=n2,
n 1 n2 n 1 1
E[ X ]
2
2
n
2
2 m
By Markov ' s inequality, Pr[ X t ] E[ X ] / t ,
and take t 1.
Thm: If we store n keys in a hash table
of size m=n using a hash function h
randomly chosen from universal class of
m 1
2
hash functions, then E[ j 0 n j ] 2n ,
where nj is the number of keys hashing
to slot j.
Pf:
It is clear for any nonnegative integer a,
a
a a 2
2
2
nj
E[ n j ] E[ n j 2 ]
j 0
j 0
2
m 1
m 1 n
j
E[ n j ] 2 E[ ]
j 0
j 0 2
m 1
2
m 1
m 1 n
nj
j
E[n] 2 E[ ] n 2 E[ ]
j 0 2
j 0 2
m 1
total number of pairs of
keys that collide
n 1 n(n 1) n 1
, since m n.
2m
2
2 m
m 1
n 1
2
E[ n j ] n 2
2n 1 2n.
2
j 0
Cor: If store n keys in a hash table of
size m=n using a hash function h
randomly chosen from a universal class
of hash functions and set the size of
each secondary hash table to mj=nj2 for
j=0,…,m-1, then the expected amount
of storage required for all secondary
hash tables in a perfect hashing scheme
is < 2n.
Pf:
m 1
m 1
j 0
j 0
E[ m j ] E[ n j ] 2n.
2
Testing a few randomly chosen hash functions
will soon find one using small storage.
Cor: Pr[total storage for secondary hash
tables ] 4n] < 1/2
Pf: By Markov’s inequality, Pr[X t] E[X]/t.
m 1
Take X m j and t 4n :
j 0
m 1
m 1
E[ m j ]
j 0
4n
Pr{ m j 4n}
j 0
2n 1
.
4n 2