Hashing – Part II

Download Report

Transcript Hashing – Part II

Hashing – Part II
CS 367 – Introduction to Data Structures
Collisions
• All the hashing techniques we have talked about
so far are hashing more possible keys than
array elements
– Since all keys must hash into the array, multiple keys
must hash to the same index
• Example
–
–
–
–
–
consider a single class with 25 students
records for each student are stored in a hash table
each student is assigned a 2 digit ID number
the ID number is used as the key for the hash table
while the array for records has only 25 indexes, there
are 100 of possible ID combinations
– that means every index can be hashed to by 4
different keys
Collisions
Combinations of all possible ID #’s
00 01 02 03 04 05
95 96 97 98 99
. . .
50 % 25
25 % 25
0 % 25
75 % 25
0
1
23 24
. . .
Array of student records
Collisions
• Must find a way to deal with collisions
– first, must find a way to insert a key if another
key is already occupying the index
– secondly, must be able to find a key again if it
is not in the index it initially hashes to
Open Addressing
• If a key finds another key occupying its hashed
index, the node begins probing the array for an
open index
– it then inserts itself in the first open index that is found
• To find a key, simply look at its hashed index and
check the node
– if it matches the key we’re looking for, done
– if it doesn’t match, then start probing the array looking
for the key
• Obviously, the probing function for a key must
always be the same
– otherwise you will never find it again
Open Addressing
• A formal mathematical definition of open
addressing is as follows
– if a position h(key) is occupied, then perform
the following probing sequence
norm(h(key) + p(1)), norm(h(key) + p(2)), …, norm(h(key) + p(i)), …
– the p function is some type of probing function
– the normalizing function is usually just modulo
the table size
• need to make sure that the probing function “wraps
around” and doesn’t go past the array indexes
Probing Functions
• Linear probing is the simplest type of
probing function
– p(i) = i
– in other words, if the index is currently used,
try the next index + 1, and then the index +2,
and so on
Linear Probing
h(A3) = 3
h(A5) = 5
h(B3) = 3, (h(B3) + 1) % 7 = 4
h(B5) = 5, (h(B5) + 1) % 7 = 6
h(C3) = 3, (h(C3) + 1) % 7 = 4, (h(C3) + 2) % 7 = 5, (h(C3) + 3) % 7 = 6,
(h(C3) + 4) % 7 = 0
0
C3
1
2
3
4
5
6
A3 B3 A5 B5
Hash Table
Clustering
• Clustering is a major problem with linear
probing
– it leads to long groups of filled indexes
– the longer the group, the higher the probability
another hash value will end up in the cluster
• hence making the cluster even longer
– the longer the group, the longer it takes for a
probing process to find an unused index
• Need to find a better probing function
Quadratic Function
• One idea is to probe ahead of the hashed
index, then probe behind, then ahead, …
– to make this work even better, every time the
function goes ahead, it increases the distance
by more than just one
• Quadratic Function
p(i) = h(key) + (-1)i-1((i + 1)/2)2 for i = 1, 2, …, tableSize – 1
– this function leads to the following pattern
h(key), h(key) + 1, h(key) – 1, h(key) + 4, h(key) – 4, h(key) + 9, h(key) – 9, …
– once again, all of these values should be normalized
Quadratic Function
h(A3) = 3
h(A5) = 5
h(B3) = 3, (h(B3) + 1) % 7 = 4
h(B5) = 5, (h(B5) + 1) % 7 = 6
h(C3) = 3, (h(C3) + 1) % 7 = 4, (h(C3) - 1) % 7 = 2
0
1
2
3
4
5
6
C3 A3 B3 A5 B5
Hash Table
Quadratic Function
• Notice that the cluster still exists but the
search time for C3 was drastically reduced
– normally, the clustering will not be as bad with
this equation
– problem is, keys hashed to the same spot
follow the same probing sequence
• known as secondary clustering
• One important note, the size of the array
should be an odd number
– why?
Double Hashing
• Double hashing is the best way to avoid
clustering
– in double clustering, a second hash function is
included in the algorithm
– if the primary hash function collides with
another index, the second hash function is
used to make sure this second key follows a
different route to find an open index
• assumes that if two keys hash to the same index
with one hash function, they will not hash to the
same index with the next hash function
Double Hashing
• Mathematically the sequence is
– h(key), h(key) + hs(key), h(key) + 2hs(key), …
– all of these terms should be modulo table size
• The trick is to pick a secondary hash
function that is likely to lead to different
values for two keys were the primary hash
function led to the same value
Double Hashing
• Example
– imagine hashing strings
– primary hash function adds all ASCII values
– secondary hash function uses ASCII value of
second character only
– both “amy” and “may” will hash to the same
primary value
– however, ‘m’ and ‘a’ will hash to different
values and lead to different attempts on the
second try to find an open space
Search Times
• Consider the search times for the previous
examples
– if the item is in the hashed index, only one
access to array
– if the hash table were full and the key not in
the table, would have to access every
element in the table to determine that
– in general, the more data in the table, the
longer the search time and the greater the
chance for collisions
Chaining
• Chaining is a possible solution to growing
access times and insert times
• The idea is that every element in the array
is a pointer to the head of a linked list
– all items that hash to this index are inserted at
the tail of this list
– the search time for an empty element to insert
into is eliminated
– no clustering because all items go where they
are hashed
Chaining
h(A3) = 3
h(A5) = 5
h(B3) = 3
h(B5) = 5
h(C3) = 3
0
1
2
3
4
5
A3
A5
B3
B5
C3
6
Bucket Addressing
• Bucket addressing is almost exactly like
open addressing except space for more
than one item is set aside
– each element has space for multiple keys
– basically implementing a 2-D array
• the hash function returns a row to put the data in
• the first available column in the row is used
• Still have the potential for overflowing into
other buckets
– if all columns are in use, need to use probing
to find a row with a free column
Bucket Addressing
0
1
3
A3
B3
4
C3
5
A5
0
h(A3) = 3
h(A5) = 5
h(B3) = 3
h(B5) = 5
h(C3) = 3, (h(C3) + 1) % 7 = 4
1
2
6
B5