10_HashTables - Kutztown University

Download Report

Transcript 10_HashTables - Kutztown University

Hash Tables
CSIT 402
Data Structures II
Goal
Hashing
• Perform inserts, deletes, and finds in
constant average time
Topics
• Hash table, hash function, collisions
Collision handling
• Separate chaining
• Open addressing: linear probing,
quadratic probing, double hashing
Rehashing
• Load factor
Goal
• Develop a structure that will allow user to
insert/delete/find records in
constant average time
–
–
–
–
structure will be a table (relatively small)
table completely contained in memory
implemented by an array
capitalizes on ability to access any element of
the array in constant time
Hash Table Applications
• Symbol table in compilers
• Accessing tree or graph nodes by name
– E.g., city names in Google maps
• Maintaining a transposition table in games
– Remember previous game situations and the move taken (avoid recomputation)
• Dictionary lookups
– Spelling checkers
– Natural language understanding (word sense)
• Heavily used in text processing languages
– E.g., Perl, Python, etc.
Hash Function
• Assume table (array) size is N
• Function f(x) maps any key x to an int
between 0 and N−1
• For example, assume that N=15, that key x
is a non-negative int between 0 and
MAX_INT, and hash function
f(x) = x % 15
Hash Function
Thus, since f(x) = x % 15,
if x = 25 129
35 2501
f(x) = 10
9
5
11
47
2
36
6
Storing the keys in the array is not a problem.
Array:
0
_
1
_
2
47
3
_
4
_
5
35
6
36
7
_
8
9
_ 129
10
11
25 2501
12
_
13
_
14
_
Hash Function
What happens when you try to insert: x = 65 ?
x =
65
f(x) =
5
Array:
0
_
1
_
2
47
3
_
4
_
5
6
35 36
65(?)
7
_
8
9
_ 129
10
11
25 2501
12
_
13
_
14
_
Hash Function
What happens when you try to insert: x = 65 ?
x
65
f(x)
5
Array:
0
1
2
47
3
4
5
6
35 36
65(?)
7
This is called a collision.
8
9
129
10
11
25 2501
12
13
14
Handling Collisions
• Separate Chaining
• Open Addressing
– Linear Probing
– Quadratic Probing
– Double Hashing
Handling Collisions
Separate Chaining
Separate Chaining
Let each array element be the head of a chain.
Array:
0
1
2

47
3
4
5

65

35
6

36
7
8
9

129
10
11


25 2501
12
13
14
Where would you store: 29, 16, 14, 99, 127 ?
Separate Chaining
Let each array element be the head of a chain:
Array:
0
1

16
2

47
3
4
5

65

35
6
7


36 127
8
9

99

129
10
11


25 2501
12
13
14

14

29
Where would you store: 29, 16, 14, 99, 127 ?
Note that we use the insertAtHead() method
when inserting new keys.
Handling Collisions
Linear Probing
Linear Probing
Let key x be stored in element f(x)=t of the array
Array:
0
1
2
47
3
4
5
6
35 36
65(?)
7
8
9
129
10
11
25 2501
12
13
14
What do you do in case of a collision?
If the hash table is not full, attempt to store key in
array elements (t+1)%N, (t+2)%N, (t+3)%N …
until you find an empty slot.
Linear Probing
Where do you store 65 ?
Array:
0
1
2
47
3
4
5
6
7
35 36 65



attempts
8
9
129
10
11
25 2501
12
13
14
Linear Probing
If the hash table is not full, attempt to store key
in array elements (t+1)%N, (t+2)%N, …
Array:
0
1
2
47
3
4
5
35
6
36
7
65
8
9
129
10
11
25 2501
12
13
14
29

Where would you store: 29, 16, 14, 99, 127 ?
Linear Probing
If the hash table is not full, attempt to store key
in array elements (t+1)%N, (t+2)%N, …
Array:
0
1
16

2
47
3
4
5
35
6
36
7
65
8
9
129
10
11
25 2501
12
Where would you store: 16, 14, 99, 127 ?
13
14
29
Linear Probing
If the hash table is not full, attempt to store key
in array elements (t+1)%N, (t+2)%N, …
Array:
0
14

1
16
2
47
3
4
5
35
6
36
7
65
8
9
129
10
11
25 2501
Where would you store: 14, 99, 127 ?
12
13
14
29

attempt
Linear Probing
If the hash table is not full, attempt to store key
in array elements (t+1)%N, (t+2)%N, …
Array:
0
14
1
16
2
47
3
4
5
35
6
36
7
65
8
9
129

10
11 12 13
25 2501 99



attempts
Where would you store: 99, 127 ?
14
29
Linear Probing
If the hash table is not full, attempt to store key
in array elements (t+1)%N, (t+2)%N, …
Array:
0
14
1
16
2
47
3
4
5
35
6
36
7
8
9
65 127 129


attempts
Where would you store: 127 ?
10
11
25 2501
12
99
13
14
29
Linear Probing
• Leads to problem of clustering. Elements tend
to cluster in dense intervals in the array.



 
Drawbacks of Linear Probing
• Works until array is full, but as number of items N
approaches TableSize, access time approaches O(N)
• Very prone to cluster formation (as in our example)
– If a key hashes anywhere into a cluster, finding a free cell
involves going through the entire cluster – and making it
grow!
– Primary clustering – clusters grow when keys hash to values
close to each other
• Can have cases where table is empty except for a few
clusters
– Does not satisfy good hash function criterion of distributing
keys uniformly
Handling Collisions
Quadratic Probing
Quadratic Probing
Let key x be stored in element f(x)=t of the array
Array:
0
1
2
47
3
4
5
6
35 36
65(?)
7
8
9
129
10
11
25 2501
12
13
14
What do you do in case of a collision?
If the hash table is not full, attempt to store key in
array elements (t+12)%N, (t+22)%N, (t+32)%N …
until you find an empty slot.
Quadratic Probing
Where do you store 65 ? f(65)=t=5
Array:
0
1
2
47
3
4
5
6
7
35 36


t t+1
attempts
8
9
129

t+4
10
11
25 2501
12
13
14
65

t+9
Quadratic Probing
If the hash table is not full, attempt to store key
in array elements (t+12)%N, (t+22)%N …
Array:
0
29

t+1
1
2
47
3
4
5
35
6
36
7
8
9
129
10
11
25 2501
12
13
14
65

t
attempts
Where would you store: 29, 16, 14, 99, 127 ?
Quadratic Probing
If the hash table is not full, attempt to store key
in array elements (t+12)%N, (t+22)%N …
Array:
0
29
1
2
16 47

t
attempts
3
4
5
35
6
36
7
8
9
129
10
11
25 2501
12
Where would you store: 16, 14, 99, 127 ?
13
14
65
Quadratic Probing
If the hash table is not full, attempt to store key
in array elements (t+12)%N, (t+22)%N …
Array:
0
1
2
3
29 16 47 14


t+1
t+4
attempts
4
5
35
6
36
7
8
9
129
10
11
25 2501
Where would you store: 14, 99, 127 ?
12
13
14
65

t
Quadratic Probing
If the hash table is not full, attempt to store key
in array elements (t+12)%N, (t+22)%N …
Array:
0
29
1
16
2
47
3
14
4
5
35
6
36
7
8
9 10
11
129 25 2501


t t+1
attempts
Where would you store: 99, 127 ?
12
13
99

t+4
14
65
Quadratic Probing
If the hash table is not full, attempt to store key
in array elements (t+12)%N, (t+22)%N …
Array:
0
29
1
16
2
47
3
14
4
5
35
6
7
8
9
36 127
129

t
attempts
Where would you store: 127 ?
10
25
11 12
2501
13
99
14
65
Quadratic Probing
• Tends to distribute keys better
• Alleviates problem of clustering
Handling Collisions
Double Hashing
Double Hashing
Let key x be stored in element f(x)=t of the array
Array:
0
1
2
47
3
4
5
6
35 36
65(?)
7
8
9
129
10
11
25 2501
12
13
14
What do you do in case of a collision?
Define a second hash function f2(x)=d. Attempt to
store key in array elements (t+d)%N, (t+2d)%N,
(t+3d)%N …
until you find an empty slot.
Double Hashing
• Collision Resolution Strategy
– Apply if hash function produces collision
• Typical second hash function
f2(x)=R − ( x % R )
where R is a prime number, R < N
Double Hashing
Where do you store 65 ? f(65)=t=5
• Collision at 5 means apply double hashing
Let f2(x)= 11 − (x % 11)
Note: R=11, N=15
f2(65)=d=1
Attempt to store key in array elements (t+d)%N,
(t+2d)%N, (t+3d)%N …
Array:
0
1
2
3
47
4
5
t=(t+d)%15
=(t+1)%15
6
7
8
9 10
11 12 13
35 36 65
129 25 2501



t
t=(t+2)%15
attempts
14
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
f2(x)= 11 − (x % 11)
f2(29)=d=4
But f(29)= 14
=> No collision => No f2 this time
Array:
0
1
2
47
3
4
5
35
6
36
7
65
8
9
129
10
11
25 2501
12
13
14
29

t
attempt
Where would you store: 29, 16, 14, 99, 127 ?
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Let f2(x)= 11 − (x % 11)
f2(16)=d=6
But f(16)= 1 => No collision
Array:
0
1
16

t
attempt
2
47
3
4
5
35
6
36
7
65
8
9
129
10
11
25 2501
12
Where would you store: 16, 14, 99, 127 ?
13
14
29
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Let f2(x)= 11 − (x % 11)
f2(14)=d=8
f(14)= 14 => Collision
Array:
0
1
2
14 16 47

t+16=(14+16)%15
attempts
3
4
5
35
6
36
7
8
9
65
129

t+8=(14+8)%15
10
11
25 2501
12
13
Initially hashes to
14%15=14
Where would you store: 14, 99, 127 ?
14
29

t
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Let f2(x)= 11 − (x % 11)
f2(99)=d=11
f(99)= 9
=> No collision
Array:
0
14
1
2
16 47

(t+22)%15
attempts
3
4
5
6
7
35 36 65

(t+11)%15
8
9
129

t
10
11
25 2501
First application of 2ndary hash function
Where would you store: 99, 127 ?
12 13 14
99
29

(t+33)%15
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Let f2(x)= 11 − (x % 11)
f2(127)=d=5
Array:
0
14
1
16
2
3
47

(t+10)%15
attempts
4
5
35
6
36
7
65

t
8
9
129
Where would you store: 127 ?
10
11
25 2501
12 13
99

(t+5)%15
14
29
Delete Element
Collision Ramifications
•
•
•
•
Chaining
Linear Probing
Quadratic Probing
Double Hashing
Deletion Issues
Chaining/Buckets
• Deletion from Linked List
• No Issues
Linear Probing
• Issue:
• Removal of element within cluster
Deletion Issues
• The location where the element was must not be left
as an ordinary "empty spot" since that could interfere
with searches and deletions (why not insertions?).
• The location must be marked in some special way so
that a search can tell that the spot used to have
something in it.
[0]
[1]
Number 281942902
[2]
Number 233667136
[3]
Number 580625685
[4]
[5]
[ 700]
Number 701466868
Number 155778322
...
Deletion Issues
General Solutions
• Each slot can be marked as empty, deleted, or occupied
• Cascade following elements one slot back, according to the
collision handling scheme
• Remove all successor elements and reinsert
• Second table of deleted items.
• Commonly used in search engines.
• Rebuild hash table
• Move element at end of cluster to fill slot
• Some of these are not foolproof!
Searching For a Key
Elephant in the room
• If there’s a collision, how do you know when you
found your item, since all mapped to the same slot?
•
•
The key is assumed to be a unique value
Keep in mind:
•
•
The hash function maps sparse data into an array or similar
It is applied to the key
Hash Tables in C++ STL
• STL has hash table containers
– hash_set
– hash_map
Hash Set in STL
#include <hash_set>
struct eqstr
{
bool operator()(const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};
void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set,
const char* word)
{
hash_set<const char*, hash<const char*>, eqstr>::const_iterator it
= Set.find(word);
cout << word << ": "
<< (it != Set.end() ? "present" : "not present")
<< endl;
}
Key
Hash fn
Key equality test
int main()
{
hash_set<const char*, hash<const char*>, eqstr> Set;
Set.insert("kiwi");
lookup(Set, “kiwi");
}
Hash Map in STL
#include <hash_map>
struct eqstr
{
bool operator() (const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};
Key equality test
Key
Data
Hash fn
int main()
{
hash_map<const char*, int, hash<const char*>, eqstr> months;
Internally
months["january"] = 31;
treated
months["february"] = 28;
like insert
…
(or overwrite
months["december"] = 31;
if key
already present) cout << “january -> " << months[“january"] << endl;
}
Simple Hashes
• It's possible to have very simple hash functions if
you are certain of your keys
• For example,
– suppose we know that the keys s will be real numbers
uniformly distributed over 0  s < 1
– Then a very fast, very good hash function is
• hash(s) = floor(s·m)
• where m is the size of the table
12/26/03
Hashing - Lecture 10
49
Nonnumerical Keys
• Many hash functions assume that the universe of keys
is the natural numbers N={0,1,…}
• Need to find a function to convert the actual key to a
natural number quickly and effectively before or
during the hash calculation
• Generally work with the ASCII character codes when
converting strings to numbers
12/26/03
Hashing - Lecture 10
50
Load Factor of a Hash Table
• Let N = number of items to be stored
• Load factor  = N/TableSize
– TableSize = 101 and N =505, then  = 5
– TableSize = 101 and N = 10, then  = 0.1
• Average length of chained list =  and so average
time for accessing an item =
O(1) + O()
– Want  to be smaller than 1 but close to 1 if good
hashing function (i.e. TableSize  N)
– With chaining hashing continues to work for  > 1
12/26/03
Hashing - Lecture 10
51
Conclusions
• Best to choose N=prime number
• Issue of Load Factor
= fraction of hash table occupied
– should rehash when load factor between 0.5 and 0.7
• Rehashing
– approximately double the size of hash table (select
N=nearest prime)
– redefine hash function(s)
– rehash keys into new table