Hash Tables Dictionaries with Constant Access Time

Download Report

Transcript Hash Tables Dictionaries with Constant Access Time

Data Structures
Dana Shapira
Hash Tables
1
Element Uniqueness Problem
Let 0  x0 ,..., xn1  m
 Determine whether there exist ij such that xi=xj


Sort Algorithm
 Bucket Sort
for (i=0;i<m;i++)
T[i]=NULL;
for (i=0;i<n;i++){
if (T[xi]= = NULL)
T[xi]= i
else{
output (i, T[xi])
return;
}
}
What happens when m is large or when we are dealing with real numbers??2
Hash Tables

h
Notations:



U
U universe of keys of size |U|,
K an actual set of keys of size n,
T a hash table of size m

Use a hash-function h:U{0,…,m-1}, h(x)=i that
computes the slot i in array T where element x is
to be stored, for all x in U.

h(k) is computed in O(|k|) = O(1).
h(x1)
x1
Set of array
indices
h(x2) h(x3) h(x4)
x2
x3 x4
3
Example
0




h:U{0,…, m-1}
h(x)=x mod 10 (what is m?)
input: 17,62,19,81,53
Collision: x ≠ y but h(x) = h(y).
m « |U|.
Solutions:
1. Chaining
2. Open addressing
1
81
2
62
3
4
53
5
6
7
8
17
9
19
4
Collision-Resolution by Chaining
1
81
62
12
53
17
37
57
19



Insert(T,x):
Delete(T,x):
Search(T,x):
Insert new element x at the head of list T[h(x.key)].
Delete element x from list T[h(x.key)].
Search list T[h(x.key)].
5
Analysis of Chaining
Simple Uniform Hashing
 Any given element is equally likely to hash to any slot in the hash table.
 The slot an element hashes to is independent of where other elements
hash.
Load factor:
α = n/m (elements stored in the hash table / number of slots in
the hash table)
6
Analysis of Chaining





Theorem: In a hash table with chaining, under the assumption of simple uniform
hashing, both successful and unsuccessful searches take expected time Θ(1+α) on the
average, where α is the hash table load factor.
Proof:
Unsuccessful Search: Under the assumption of simple uniform hashing, any key k is
equally likely to hash to any slot in the hash table. The expected time to search
unsuccessfully for a key k is the expected time to search to the end of list T[h(k)]
which has expected length α.
expected time - (1 + α) including time for computing h(k).
Successful search:
The number of elements examined during a successful search is 1 more than the
number of elements that appear before k in T[h(k)].
1  n  i 1
1  n

  1  n i 1  1 n

1


1

i

1
1 











n  i 1  m
nm  i 1
  n  i 1 m  n i 1

 n  1  1    1  1       1  
1 n  n  1


1  1


nm
2
2m
2 2m
2 2n
expected time - (1 + α)
7

Corollary: If m = (n), then Insert, Delete, and Search take expected constant time.
Designing Good Hash Functions





Example:
Input = reals drawn uniformly at random from [0,1)
Hash function: h(x) = mx
Often, the input distribution is unknown.
Then we can use heuristics or universal hashing.
8
The Division Method

Hash function: h(x) = x mod m

m = 2k

 h(x) = the lowest k bits of x
 Heuristic:

m = prime number not too close to a power of 2
9
The Multiplication Method

Hash function: h(x) = m (cx mod 1), for some 0 < c < 1



Optimal choice of c depends on input distribution.
Heuristic:
Knuth suggests the inverse of the golden ratio as a value that works well:





Example:
x=123,456, m=10,000
h(x) = 10,000 ·(123,456·0.61803… mod 1) =
= 10,000 ·(76,300.0041151… mod 1) =
= 10,000 ·0.0041151… = 41.151... = 41
10
Efficient Implementation of the
Multiplication Method
h(x) = m (cx mod 1)
w bits
Let w be the size of a machine word
Assume that key x fits into a machine word
Assume that m = 2p
Restrict ourselves to values of c of the form
c = s / 2w
0  s  2w
Then cx = sx / 2w
sx is a number that fits into two machine
words
h(x) = p most significant bits of the lower
word
w bits
*
w bits
Fractional part
Integer part after
multiplying by m = 2p
w bits
w bits
p bits
11
Example
h(x) = m (cx mod 1)

x = 123456, p = 14, m = 214 = 16384, w = 32,
 Then sx = (76300 ⋅ 232) + 17612864
 The 14 most significant bits of 17612864 are 67; that is, h(x) = 67
x = 0000 0000 0000 0001 1110 0010 0100 0000
s = 1001 1110 0011 0111 0111 1001 1011 1001
sx = 0000 0000 0000 0001 0010 1010 0000 1100
0000 0001 0000 1100 1100 0000 0100 0000
h(x) = 00 0000 0100 0011 = 67
12
Open Addressing
All elements are stored directly in the hash table.
Load factor α cannot exceed 1.
If slot T[h(x)] is already occupied for a key x, we probe alternative locations until we find an
empty slot.
Searching probes slots starting at T[h(x)] until x is found or we are sure that x is not in T.
Instead of computing h(x), we compute h(x, i)
i -the probe number.
13
h(x)
Linear Probing
Hash function:
h(k, i) = (h'(k) + i) mod m, where h' is an original hash function.

Benefits:
Easy to implement
 Problem:
Primary Clustering - Long runs of occupied slots build up as table becomes fuller.
14
Quadratic Probing
Hash function:
h(k, i) = (h'(k) + c1i + c2i2) mod m, where h' is an original hash function.

Benefits:
No more primary clustering
 Problem:
Secondary Clustering - Two elements x and y with h'(x) = h'(y) have same probe
sequence.
15
Double Hashing
Hash function:
h(k, i) = (h1(k) + ih2(k)) mod m, where h1 and h2 are two original hash functions.
 h2(k) has to be prime w.r.t. m; that is, gcd(h2(k), m) = 1.
 Two methods:
Choose m to be a power of 2 and guarantee that h2(k) is always odd.
Choose m to be a prime number and guarantee that h2(k) < m.
 Benefits:
No more clustering
 Drawback:
More complicated than linear and quadratic probing
16
Analysis of Open Addressing

Uniform hashing: The probe sequence h(k, 0), …, h(k, m – 1) is equally likely
to be any permutation of 0, …, m – 1.

Theorem: In an open-address hash table with load factor α < 1, the expected
number of probes in an unsuccessful search is at most 1 / (1 – α), assuming
uniform hashing.
Proof: Let X be the number of probes in an unsuccessful search.



i 0
i 0
i 1
E[ X ]   iP  X  i   i  P  X  i  P  X  i  1   P  X  i
Ai = “there is an i-th probe, and it accesses a non-empty slot”
17
Analysis of Open Addressing – Cont.
18
Analysis of Open Addressing – Cont.
Corollary: The expected number of probes performed during an insertion into an open-address
hash table with uniform hashing is 1 / (1 – α).
19
Analysis of Open Addressing – Cont.
Theorem: Given an open-address hash table with load factor α < 1, the expected number of
probes in a successful search is (1/α) ln (1 / (1 – α)), assuming uniform hashing and assuming
that each key in the table is equally likely to be searched for.

A successful search for an element x follows the same probe sequence as the
insertion of element x.
 Consider the (i + 1)-st element x that was inserted.
 The expected number of probes performed when inserting x is at most

Averaging over all n elements, the expected number of probes in a successful search
is
20
Analysis of Open Addressing – Cont.
21
Universal Hashing

A family  of hash functions is universal if for each pair k, l of keys, there are at
most || / m functions in  such that h(k) = h(l).

This means: For any two keys k and l and any function h chosen uniformly at

random, the probability that h(k) = h(l) is at most 1/m.



P= (|| / m )/ || =1/m
This is the same as if we chose h(k) and h(l) uniformly at random from [0, m – 1].
22
Analysis of Universal Hashing

Theorem: For a hash function h chosen uniformly at random from a universal family ,
the expected length of the list T[h(x)] is α if x is not in the hash table and 1 + α if x is in
the hash table.
Proof: Indicator variables:
1
 ij  
0
h i   h  j 
h i   h  j 
Yx = the number of keys ≠ x that hash to the same slot as x
Yx    xy
yT
x y


E Yx   E    xy 
 yT

 x  y

23
Analysis of Universal Hashing
Cont.


E Yx   E    xy  
 yT



 x y

1
E   xy   

yT
yT m
x y
x y
If x is not in T, then |{y  T : x ≠ y}| = n. Hence, E[Yx] = n / m = α.
If x is in T, then |{y  T : x ≠ y} = n – 1. Hence, E[Yx] = (n – 1) / m < α.
The length of list T[h(x)] is one more, that is, 1 + α.
24
Universal Family of Hash
Functions
 Choose a prime p so that m = p.
 Let
x  [ x0 ,..., xr ] such xi  m
 For each
a  [a0 ,..., ar ]  {0,..., m  1}r 1define the hash function as follows:

 r

ha  x     ai xi  mod m
 i 0

r 1
m
h
 a   || =
=
a

Example: m=p=253 U  0,..., 2 24

a=[248,223,101]
x=1025=[0,2,1]  ha  x    248  0  223  2  1011 mod 253  41
.
25
Universal Family of Hash
Functions
Theorem: The class  is universal.
Proof: Let
x  [ x0 ,..., xr ], y  [ y0 ,..., yr ]
such that x ≠ y, w.l.o.g
x0  y0
 For all a1,…,ar there exists a single a0 such that
ha  x   ha  y 
r
ha  x   ha  y    ai  xi  yi   0(mod m)
i 0
r
 For all

z
 a0  x0  y0    ai  xi  yi  (mod m)
i 0
p
there exists a single w such that z·w=1(mod m)
 r

1
z  x0  y0  0  a0    ai  xi  yi    x0  y0  (mod m)
 i 1

ha  x   ha  y  for mr values
26
 The number of hash functions h in , for which h(k1) = h(k2) is at most mr/ mr+1 =1/m
Universal Family of Hash
Functions
Choose a prime p so that m < p.
For any 1 ≤ a < p and 0 ≤ b < p, we define a function
ha,b(x) = ((ax + b) mod p) mod m.
Let Hp,m be the family
Hp,m = {ha,b : 1 ≤ a < p and 0 ≤ b < p}.
Theorem: The class Hp,m is universal.
27
Summary






Hash tables are the most efficient dictionaries if only operations Insert, Delete, and
Search have to be supported.
If uniform hashing is used, the expected time of each of these operations is
constant.
Universal hashing is somewhat complicated, but performs well even for adversarial
input distributions.
If the input distribution is known, heuristics perform well and are much simpler
than universal hashing.
For collision-resolution, chaining is the simplest method, but it requires more space
than open addressing.
Open addressing is either more complicated or suffers from clustering effects.
28