IA-Lec14 - MCST-CS
Download
Report
Transcript IA-Lec14 - MCST-CS
Hash Tables:
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The worst-case running time for insertion is O(1). The insertion
procedure is fast in part because it assumes that the element x being
inserted is not already present in the table; this assumption can be
checked if necessary (at additional cost) by performing a search
before insertion. For searching, the worst-case running time is
proportional to the length of the list; we shall analyze this operation
more closely below. Deletion of an element x can be accomplished
in O(1) time if the lists are doubly linked. (Note that CHAINEDHASH-DELETE takes as input an element x and not its key k, so we
don't have to search for x first. If the lists were singly linked, it
would not be of great help to take as input the element x rather than
the key k. We would still have to find x in the list T[h(key[x])], so
that the next link of x's predecessor could be properly set to splice x
out. In this case, deletion and searching would have essentially the
same running time.)
Analysis of hashing with chaining:
• Given a hash table with m slots and n keys, define
load factor = n/m : average number of keys per
slot.
• Assume each key is equally likely to be hashed
into any slot: simple uniform hashing (SUH).
• Thm: In a hash table in which collisions are
resolved by chaining, an unsuccessful search takes
expected time Θ(1+ ) under SUH.
Proof:
Under the assumption of SUH, any un-stored key is
equally likely to hash to any of the m slots.
The expected time to search unsuccessfully for a key
k ins the expected time to search to the end of list
T[h(k)], which is exactly .
Thus, the total time required is Θ(1+ ).
□
• Thm: In a hash table in which collisions are resolved by
chaining, a successful search takes time Θ(1+ ), on the
average under SUH.
Proof:
Let the element being searched for equally likely to be any of
the n elements stored in the table.
The expected time to search successfully for a key.
....
x
....
Elements before x in the list were inserted after x was inserted.
We want to find the expected number of elements added to
x’s list after x was added to the list.
Let xi denote the ith element into the table, for i =1 to n, and let
ki=key[xi].
Define Xij = I{ h(ki)=h(kj) }. Under SUH, we have
Pr{ h(ki)=h(kj) } = 1/m = E[Xij ].
n
n
1 n
1 n
E[ (1 X ij )] (1 E[X ij ])
n i 1
n i 1
j i 1
j i 1
n
1 n
1
1 n
(1 ) 1
(n i)
n i 1
mn i 1
j i 1 m
1
1 2 n(n 1)
n 1
(n
) 1
1
mn
2
2m
2 2n
(2
2
2n
) (1 ).
What makes a good hash function?
• Ideally, the hash function satisÞes the assumption of simple
uniform hashing.
• In practice, it.s not possible to satisfy this assumption, since we
don.t know in advance the probability distribution that keys are
drawn from, and the keys may not be drawn independently.
• Often use heuristics, based on the domain of the keys, to create a
hash function
that performs well.
Keys as natural numbers
• Hash functions assume that the keys are natural numbers.
• When they.re not, have to interpret them as natural numbers.
• Example: Interpret a character string as an integer expressed in
some radix
notation. Suppose the string is CLRS:
• ASCII values: C = 67, L = 76, R = 82, S = 83.
• There are 128 basic ASCII values.
• So interpret CLRS as (67 · 1283)+ (76 · 1282)+ (82 · 1281)+ (83 ·
1280) =
141,764,947.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The multiplication method:
Universal Hashing
• H={ h: U→{0,…,m-1} }, which is a finite collection of
hash functions.
• H is called “universal” if for each pair of distinct
keys k, ∈ U, the number of hash functions h∈H for
which h(k)=h( ) is at most |H|/m
• Define ni = the length of list T[i]
• Thm:
suppose h is randomly selected from H, using chaining
to resolve collisions. If k is not in the table, then E[nh(k)]
≤ α. If k is in the table, then
E[nh(k)] ≤ 1+α
• Proof:
– For each pair k and of distinct keys,
define Xk =I{h(k)=h( )}.
– By definition, Prh{h(k)=h( )} ≤ 1/m, and so
E[Xk ] ≤ 1/m.
– Define Yk to be the number of keys other than k
that hash to the same slot as k, so that
Yk Xk
T
k
1
E[Yk ] E[ Xk ]
T
T m
k
k
– If k T , then nh ( k ) Yk and |{ : T , k}| n
n
thus E[nh ( k )] E[Yk ]
m
– If k∈T, then because k appears in T[h(k)] and the count
Yk does not include k, we have nh(k) = Yk + 1
and |{ : T , k}| n 1
n 1
1
Thus E[nh ( k )] E[Yk ] 1
1 1 1
m
m
Designing a universal class of hash functions:
p:prime
Z
Z p 0,1,, p 1 p 1,2,, p 1
For any a Z pand b Z,p define
ha ,b (k ) (( ak b) mod p ) mod m , ha,b:Zp→Zm
p,m ha,b : a Z *p and b Z p
Theorem:
Hp,m is universal.
Pf: Let k, be two distinct keys in Zp.
Given ha,b, Let r=(ak+b) mod p , and
s=(a +b) mod p.
Then r-s≡a(k- ) mod p
For any ha,b∈Hp,m, distinct inputs k and map
to distinct r and s modulo p.
Each possible p(p-1) choices for the pair
(a,b) with a≠0 yields a different resulting
pair (r,s) with r≠s, since we can solve for
a and b given r and s:
a=((r-s)((k- )-1 mod p)) mod p
b=(r-ak) mod p
• There are p(p-1) possible pairs (r,s) with
r≠s, there is a 1-1 correspondence
between pairs (a,b) with a≠0 and
(r,s), r≠s.
• For any given pair of inputs k and , if we
pick (a,b) uniformly at random from
Z p Z p, the resulting pair (r,s) is equally
likely to be any pair of distinct values
modulo p.
• Pr[ k and collide]=Prr,s[r≡s mod m]
• Given r, the number of s such that s≠r and
s≡r (mod m) is at most
⌈p/m⌉-1≤((p+m-1)/m)-1
=(p-1)/m
∵ s, s+m, s+2m,…., ≤p
• Thus,
Prr,s[r≡s mod m] ≤((p-1)/m)/(p-1)
=1/m
Therefore, for any pair of distinct
k, ∈Zp,
Pr[ha,b(k)=ha,b( )] ≤1/m,
so that Hp,m is universal.
• Open addressing:
– There is no list and no element stored outside
the table.
– Advantage: avoid pointers, potentially yield
fewer collisions and faster retrieval.
– h : U 0,1, , m 1 0,1, , m 1
– For every k, the probe sequence
h k , 0 , h k ,1 ,
, h k , m 1
is a permutation of 0,1, , m.1
– Deletion from an open-address hash table is
difficult.
– Thus chaining is more common when keys
must be deleted.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Linear probing:
– h : U 0,1, , m 1~ an ordinary hash
function (auxiliary hash function).
– h k , i h k i .mod m
• Quadratic probing:
2
h
k
,
i
h
k
c
i
c
i
–
1
2
mod m ,where h’ is an
auxiliary hash function, c1 and c2≠0 and are
constants.
• Double hashing:
– h k , i h1 k ih2 k mod m ,where h1 and
h2 are auxiliary hash functions.
– m2 probe sequences; Linear and
Quadratic have m probe sequences.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Analysis of open-addressing hashing
n
: load factor,
m
with n elements and m slots.
• Thm:
Given an open-address hash table with load
factor n m 1 , the expected number
of probes in an unsuccessful search is at
most 1 1 , assuming uniform hashing.
• Pf:
– Define the r.v. X to be the number of probes
made in an unsuccessful search.
– Define Ai: the event there is an ith probe and it
is to an occupied slot.
– Event
X i A1 A2 . Ai1
–
Pr X i Pr A1 A2
Ai 1
Pr A1 Pr A2 | A1 Pr A3 | A1 A2
Pr Ai 1 | A1 A2
Ai 2
n
Pr A1
m
– The prob. that there is a jth probe and it is to an
occupied slot, given that the first j-1 probes
were to occupied slots is (n-j+1)/(m-j+1). Why?
– ∵n<m, (n-j)/(m-j) ≤ n/m for all 0 ≤ j<m.
– Pr[ X i] n n 1 n i 2
m m 1 m i 2
n i 1
( ) i 1
m
E[ X ] Pr[ X i]
i 1
i 1
i 1
1
1
i 0
i
Cor: Inserting an element into an openaddressing hash table with load factor α
requires at most 1/(1- α) probes on
average, assuming uniform hashing.
Thm: Given an open-address hash table with
load factor α<1, the expected number of
probes in a successful search is at
1
1
ln
most 1 , assuming uniform
hashing and that each key in the table is
equally likely to be searched for.
Pf: Suppose we search for a key k.
If k was the (i+1)st key inserted into the
hash table, the expected number of probes
made in a search for k is at most
1/(1-i/m)=m/(m-i).
• Averaging over all n keys in the hash table
gives us the average number of probes in a
successful search:
1 n1 m m n1 1
1
( H m H m n )
n i 0 m i n i 0 m i
m
m
1
1 1 dx 1
m
1
1
ln
ln
k mn1 k mn x m n 1
Perfect Hashing:
• Perfect hashing :
good for when the keys are static; i.e. , once stored,
the keys never change, e.g. CD-ROM, the set of
reserved word in programming language.
• Thm :
If we store n keys in a hash table of size m=n2
using a hash function h randomly chosen from a
universal class of hash functions, then the
probability of there being any collisions < ½ .
• Proof:
Let h be chosen from an universal family. Then each
pair collides with probability 1/m , and there are n2
pairs of keys.
Let X be a r.v. that counts the number of collisions.
When m=n2,
n 1 n2 n 1 1
E[ X ]
2
2
n
2
2 m
By Markov ' s inequality, Pr[ X t ] E[ X ] / t ,
and take t 1.
• Thm: If we store n keys in a hash table of
size m=n using a hash function h randomly
chosen from universal class of hash
m 1
2
E
[
n
functions, then j 0 j ] 2n , where nj is
the number of keys hashing to slot j.
• Pf:
– It is clear for any nonnegative integer a,
a
a a 2
2
2
–
nj
E[ n j ] E[ n j 2 ]
j 0
j 0
2
m 1
m 1 n
j
E[ n j ] 2 E[ ]
j 0
j 0 2
m 1
2
m 1
m 1 n
nj
j
E[n] 2 E[ ] n 2 E[ ]
j 0 2
j 0 2
m 1
total number of collisions
n 1 n(n 1) n 1
, since m n.
2m
2
2 m
m 1
n 1
2
E[ n j ] n 2
2n 1 2n.
2
j 0
• Cor: If store n keys in a hash table of size
m=n using a hash function h randomly
chosen from a universal class of hash
functions and we set the size of each
secondary hash table to mj=nj2 for j=0,…,m1, then the expected amount of storage
required for all secondary hash tables in a
perfect hashing scheme is < 2n.
• Cor: Same as the above,
Pr{total storage 4n} < 1/2
• Pf:
– By Markov’s inequality, Pr{ X t } E[X]/t.
m 1
–
Take X m j and t 4n :
j 0
m 1
m 1
E[ m j ]
j 0
4n
Pr{ m j 4n}
j 0
2n 1
.
4n 2