IA-Lec14 - MCST-CS

Download Report

Transcript IA-Lec14 - MCST-CS

Hash Tables:
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The worst-case running time for insertion is O(1). The insertion
procedure is fast in part because it assumes that the element x being
inserted is not already present in the table; this assumption can be
checked if necessary (at additional cost) by performing a search
before insertion. For searching, the worst-case running time is
proportional to the length of the list; we shall analyze this operation
more closely below. Deletion of an element x can be accomplished
in O(1) time if the lists are doubly linked. (Note that CHAINEDHASH-DELETE takes as input an element x and not its key k, so we
don't have to search for x first. If the lists were singly linked, it
would not be of great help to take as input the element x rather than
the key k. We would still have to find x in the list T[h(key[x])], so
that the next link of x's predecessor could be properly set to splice x
out. In this case, deletion and searching would have essentially the
same running time.)
Analysis of hashing with chaining:
• Given a hash table with m slots and n keys, define
load factor = n/m : average number of keys per
slot.
• Assume each key is equally likely to be hashed
into any slot: simple uniform hashing (SUH).
• Thm: In a hash table in which collisions are
resolved by chaining, an unsuccessful search takes
expected time Θ(1+ ) under SUH.
Proof:
Under the assumption of SUH, any un-stored key is
equally likely to hash to any of the m slots.
The expected time to search unsuccessfully for a key
k ins the expected time to search to the end of list
T[h(k)], which is exactly .
Thus, the total time required is Θ(1+ ).
□
• Thm: In a hash table in which collisions are resolved by
chaining, a successful search takes time Θ(1+ ), on the
average under SUH.
Proof:
Let the element being searched for equally likely to be any of
the n elements stored in the table.
The expected time to search successfully for a key.
....
x
....
Elements before x in the list were inserted after x was inserted.
We want to find the expected number of elements added to
x’s list after x was added to the list.
Let xi denote the ith element into the table, for i =1 to n, and let
ki=key[xi].
Define Xij = I{ h(ki)=h(kj) }. Under SUH, we have
Pr{ h(ki)=h(kj) } = 1/m = E[Xij ].
n
n
1 n
1 n
E[  (1   X ij )]   (1   E[X ij ])
n i 1
n i 1
j i 1
j i 1
n
1 n
1
1 n
  (1   )  1 
(n  i)

n i 1
mn i 1
j i 1 m
 1
1 2 n(n  1)
n 1
 
(n 
)  1
 1 
mn
2
2m
2 2n
(2 

2


2n
)  (1   ).
What makes a good hash function?
• Ideally, the hash function satisÞes the assumption of simple
uniform hashing.
• In practice, it.s not possible to satisfy this assumption, since we
don.t know in advance the probability distribution that keys are
drawn from, and the keys may not be drawn independently.
• Often use heuristics, based on the domain of the keys, to create a
hash function
that performs well.
Keys as natural numbers
• Hash functions assume that the keys are natural numbers.
• When they.re not, have to interpret them as natural numbers.
• Example: Interpret a character string as an integer expressed in
some radix
notation. Suppose the string is CLRS:
• ASCII values: C = 67, L = 76, R = 82, S = 83.
• There are 128 basic ASCII values.
• So interpret CLRS as (67 · 1283)+ (76 · 1282)+ (82 · 1281)+ (83 ·
1280) =
141,764,947.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The multiplication method:
Universal Hashing
• H={ h: U→{0,…,m-1} }, which is a finite collection of
hash functions.
• H is called “universal” if for each pair of distinct
keys k, ∈ U, the number of hash functions h∈H for
which h(k)=h( ) is at most |H|/m
• Define ni = the length of list T[i]
• Thm:
suppose h is randomly selected from H, using chaining
to resolve collisions. If k is not in the table, then E[nh(k)]
≤ α. If k is in the table, then
E[nh(k)] ≤ 1+α
• Proof:
– For each pair k and of distinct keys,
define Xk =I{h(k)=h( )}.
– By definition, Prh{h(k)=h( )} ≤ 1/m, and so
E[Xk ] ≤ 1/m.
– Define Yk to be the number of keys other than k
that hash to the same slot as k, so that
Yk   Xk
T
k
1
E[Yk ]   E[ Xk ]  
T
T m
k
k
– If k  T , then nh ( k )  Yk and |{ :  T ,  k}|  n
n
thus E[nh ( k )]  E[Yk ]   
m
– If k∈T, then because k appears in T[h(k)] and the count
Yk does not include k, we have nh(k) = Yk + 1
and |{ :  T ,  k}| n  1
n 1
1
Thus E[nh ( k )]  E[Yk ]  1 
1  1    1 
m
m
Designing a universal class of hash functions:
p:prime

Z
Z p  0,1,, p  1 p  1,2,, p  1
For any a  Z pand b  Z,p define
ha ,b (k )  (( ak  b) mod p ) mod m , ha,b:Zp→Zm
 p,m  ha,b : a  Z *p and b  Z p 
Theorem:
Hp,m is universal.
Pf: Let k, be two distinct keys in Zp.
Given ha,b, Let r=(ak+b) mod p , and
s=(a +b) mod p.
Then r-s≡a(k- ) mod p
For any ha,b∈Hp,m, distinct inputs k and map
to distinct r and s modulo p.
Each possible p(p-1) choices for the pair
(a,b) with a≠0 yields a different resulting
pair (r,s) with r≠s, since we can solve for
a and b given r and s:
a=((r-s)((k- )-1 mod p)) mod p
b=(r-ak) mod p
• There are p(p-1) possible pairs (r,s) with
r≠s, there is a 1-1 correspondence
between pairs (a,b) with a≠0 and
(r,s), r≠s.
• For any given pair of inputs k and , if we
pick (a,b) uniformly at random from
Z p  Z p, the resulting pair (r,s) is equally
likely to be any pair of distinct values
modulo p.
• Pr[ k and collide]=Prr,s[r≡s mod m]
• Given r, the number of s such that s≠r and
s≡r (mod m) is at most
⌈p/m⌉-1≤((p+m-1)/m)-1
=(p-1)/m
∵ s, s+m, s+2m,…., ≤p
• Thus,
Prr,s[r≡s mod m] ≤((p-1)/m)/(p-1)
=1/m
Therefore, for any pair of distinct
k, ∈Zp,
Pr[ha,b(k)=ha,b( )] ≤1/m,
so that Hp,m is universal.
• Open addressing:
– There is no list and no element stored outside
the table.
– Advantage: avoid pointers, potentially yield
fewer collisions and faster retrieval.
– h : U 0,1, , m 1  0,1, , m 1
– For every k, the probe sequence
h  k , 0  , h  k ,1 ,
, h  k , m  1
is a permutation of 0,1, , m.1
– Deletion from an open-address hash table is
difficult.
– Thus chaining is more common when keys
must be deleted.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Linear probing:
– h : U  0,1, , m 1~ an ordinary hash
function (auxiliary hash function).
– h  k , i    h  k   i  .mod m
• Quadratic probing:
2

h
k
,
i

h
k

c
i

c
i





–
1
2 
mod m ,where h’ is an
auxiliary hash function, c1 and c2≠0 and are
constants.
• Double hashing:
– h  k , i    h1  k   ih2  k   mod m ,where h1 and
h2 are auxiliary hash functions.
–   m2  probe sequences; Linear and
Quadratic have   m  probe sequences.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Analysis of open-addressing hashing
n

: load factor,
m
with n elements and m slots.
• Thm:
Given an open-address hash table with load
factor   n m  1 , the expected number
of probes in an unsuccessful search is at
most 1 1    , assuming uniform hashing.
• Pf:
– Define the r.v. X to be the number of probes
made in an unsuccessful search.
– Define Ai: the event there is an ith probe and it
is to an occupied slot.
– Event
 X  i  A1  A2 .  Ai1
–
Pr  X  i  Pr  A1  A2 
 Ai 1
 Pr  A1  Pr  A2 | A1  Pr  A3 | A1  A2  
 Pr  Ai 1 | A1  A2 
 Ai 2 
n
Pr  A1 
m
– The prob. that there is a jth probe and it is to an
occupied slot, given that the first j-1 probes
were to occupied slots is (n-j+1)/(m-j+1). Why?
– ∵n<m, (n-j)/(m-j) ≤ n/m for all 0 ≤ j<m.
– Pr[ X  i]  n  n  1  n  i  2
m m 1 m  i  2
n i 1
 ( )   i 1
m


E[ X ]   Pr[ X  i]   
i 1
i 1
i 1

1
  
1
i 0
i
Cor: Inserting an element into an openaddressing hash table with load factor α
requires at most 1/(1- α) probes on
average, assuming uniform hashing.
Thm: Given an open-address hash table with
load factor α<1, the expected number of
probes in a successful search is at
1
1
ln
most  1   , assuming uniform
hashing and that each key in the table is
equally likely to be searched for.
Pf: Suppose we search for a key k.
If k was the (i+1)st key inserted into the
hash table, the expected number of probes
made in a search for k is at most
1/(1-i/m)=m/(m-i).
• Averaging over all n keys in the hash table
gives us the average number of probes in a
successful search:
1 n1 m m n1 1
1
 
 ( H m  H m n )

n i 0 m  i n i 0 m  i 
m
m
1
1 1 dx 1
m
1
1
     ln
 ln
 k mn1 k  mn x  m  n  1  
Perfect Hashing:
• Perfect hashing :
good for when the keys are static; i.e. , once stored,
the keys never change, e.g. CD-ROM, the set of
reserved word in programming language.
• Thm :
If we store n keys in a hash table of size m=n2
using a hash function h randomly chosen from a
universal class of hash functions, then the
probability of there being any collisions < ½ .
• Proof:
Let h be chosen from an universal family. Then each
pair collides with probability 1/m , and there are  n2 
 
pairs of keys.
Let X be a r.v. that counts the number of collisions.
When m=n2,
 n  1 n2  n 1 1
E[ X ]    
 2
2
n
2
 2 m
By Markov ' s inequality, Pr[ X  t ]  E[ X ] / t ,
and take t  1.
• Thm: If we store n keys in a hash table of
size m=n using a hash function h randomly
chosen from universal class of hash
m 1
2
E
[
n
functions, then  j 0 j ]  2n , where nj is
the number of keys hashing to slot j.
• Pf:
– It is clear for any nonnegative integer a,
a
a  a  2 
 2
2
–

 nj 
E[ n j ]  E[  n j  2  ]
j 0
j 0 
 2 
m 1
m 1 n
 j
 E[ n j ]  2 E[  ]
j 0
j 0  2 
m 1
2
m 1
m 1 n
nj 
 j
 E[n]  2 E[  ]  n  2 E[  ]
j 0  2 
j 0  2 
m 1
total number of collisions
 n  1 n(n  1) n  1
  

, since m  n.
2m
2
 2 m
m 1
n 1
2
E[ n j ]  n  2 
 2n  1  2n.
2
j 0
• Cor: If store n keys in a hash table of size
m=n using a hash function h randomly
chosen from a universal class of hash
functions and we set the size of each
secondary hash table to mj=nj2 for j=0,…,m1, then the expected amount of storage
required for all secondary hash tables in a
perfect hashing scheme is < 2n.
• Cor: Same as the above,
Pr{total storage  4n} < 1/2
• Pf:
– By Markov’s inequality, Pr{ X  t }  E[X]/t.
m 1
–
Take X   m j and t  4n :
j 0
m 1
m 1
E[ m j ]
j 0
4n
Pr{ m j  4n} 
j 0
2n 1

 .
4n 2