Transcript Hashing

Hashing
General idea:
• Get a large array
• Design a hash function h which converts keys
into array indices .
• Insert (key, value) in the array at h(key)
position.
• Can also be used in external memory
(addresses instead of array indices).
Example
Suppose the size of the hash table is k.
How to map arbitrary large numbers to numbers in the range
0.....k-1?
A possible solution: division hashing.
index = largeNumber % k (remainder of division by k).
However, even this is not very practical (largeNumber is too large
in our case).
Another problem is that the same index may now be associated
with different words.
Requirements for Hash function
•
•
•
•
The following are general requirements for hash
functions:
If the hash table has size k, hash function should return
numbers in the range 0....k-1.
Hash function must be quick to calculate the value.
Hash function should minimize collisions (when several
keys are hashed to the same number).
The last requirement is satisfied if each key is equally
likely to map to any of the k indices (uniform
distribution). Usually we cannot completely avoid
collisions (as in the example with words).
Hashing Methods
•
•
•
•
Division method
Multiplication method
Folding method
Other hashing methods
Division method
• Algorithm: H(x) = x mod m + 1
Where m is some predetermined divisor integer (i.e., te
table size), x is the preconditioned key, and mod stands
for modulo. Note that adding 1 is only necessary if the
table starts at key 1 (if it starts at 0, the algorithm
simplifies to (H(x) = x mod m).
• So, in other words: given an item, divide the
preconditioned key of that item by the table size (+1).
The remainder is the hash key.
Division method
• Given a hash table with 10 buckets, what is the
hash key for 'Cat'? Since 'Cat' = 131130 when
converted to ASCII, then x = 131130.
We are given the table size (i.e., m = 10, starting
at 0 and ending at 9).
H(x) = x mod m
H(131130) = 131130 mod 10
= 0 (there is no remainder)
'Cat' is inserted into the table at address 0.
Multiplication method
• The method that is used in the slide is the
multiplication method. It multiplies all the individual
digits in the key together, and takes the remainder
after dividing the resulting number by the table size.
In functional notation, the algorithm is:
H(x) = (a * b * c * d *....) mod m
Where: m is the table size, a, b, c, d, etc. are the
individual digits of the item, and mod stands for
modulo.
Let's apply this algorithm to an example.
Example
• Given a hash table of ten buckets (0 through 9), what
is the hash key for 'Cat'? Since 'Cat' = 131130
when converted to ASCII, then x = 131130
We are given the table size (i.e., m = 10).
The constant can be any number we want, let's use
five (i.e., c = 5).
•
H(x) = (a * b * c * d *....) mod m
H(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10
= 0 mod 10
=0
'Cat' is inserted into the table at address 0.
Folding method
• Algorithm: H(x) = (a + b + c) mod m
Where a, b, and c represent the preconditioned key
broken down into three parts, m is the table size, and
mod stands for modulo.
In other words: the sum of three parts of the
preconditioned key is divided by the table size. The
remainder is the hash key.
Examples
• Fold the key 123456789 into a hash table of ten spaces
(0 through 9).
• We are given x = 123456789 and the table size (i.e., m
= 10).
Since we can break x into three parts any way we want
to, we will break it up evenly.
Thus a = 123, b = 456, and c = 789.
H(x) = (a + b + c) mod m
H(123456789) = (123+456+789) mod 10
= 1368 mod 10
=8
•
123456789 is inserted into the table at address 8.
Other hashing methods
• Now that we have shown, in detail, a few of the many
hashing methods which you can use, we will briefly discuss
the length-dependent method,
 midsquare method,
 digit-analysis and
 addition method.
Other hashing methods
• Midsquare mehod => Multiply x by itself (i.e., x2) and select a
number of digits from the middle of the result. How many
digits you select will depend on your table size and the size of
your hash key.
• Addition method =>The sum of the digits of the preconditioned
key is divided by the table size. The remainder is the hash key.
• Digit analysis method => Certain digits of the preconditioned
key are selected in a certain predetermined and consistent
pattern.
Perfect Hashing
• In exceptional circumstances, it is possible
to design a perfect hashing function which
avoids collisions.
• For example, if all items in the collection
are given in advance (e.g. a dictionary) and
a function is designed to suit this collection
and nothing else.
Handling Collisions
There are several methods of solving the problem of
collision:
• open addressing: probe array for the "next" slot which
is still empty. Different methods for finding the
"next" slot:
 linear probing
 quadratic probing
 double hashing
• separate chaining: each slot in the array contains a
reference to a bucket of items with the keys hashed to
that slot. Usually a linked list.
Open addressing
• Using the open addressing technique, after a collision, new
locations are examined until an empty one is found.
• The item to be inserted is placed here.
• If an item is deleted, then it is replaced with the word
"DELETED" .
• The sequence in which other locations are examined can be
determined in many different ways.
• For example, after a collision, new locations could be
searched from bottom to top, top to bottom, or by every 2
places.
Open addressing
•
•
•
•
•
Compute the index given the key
Go to the index
If it is free, insert the item
If it is not free, go to the next possible slot.
To find the next possible slot, generates a
"probing sequence".
Probing Sequence
• linear probing: search sequentially for the next free slot. The
simplest probing sequence is
h(k), h(k) + 1, h(k) + 2, ...
but in general We can do h(k), h(k) + c, h(k) + 2c, h(k) + 3c, ...
where c is some constant.
Problem: clustering.
If several keys are mapped to the same index, we get large clusters
of occupied cells with empty spaces in between. The larger the
cluster, the longer the search. Suppose k1, k2 and k3 are all
mapped to the same index. Then searching for/inserting (k1,v1) is
OK, inserting (k2,v2) takes an extra step, inserting (k3,v3) takes
extra two steps etc.:
Probing Sequence
K1, V1 K2, V2
K3, V3
Linear probing
• After first being located to an already occupied table
position, the item is sent sequentially down the table until an
empty space is found.
• If m is the number of possible positions in the table, then the
linear probe continues until either an empty spot is found, or
until m-1 locations have been searched.
• One thing that happens with linear probing however is
clustering, or bunching together.
• Primary clustering occurs when items are always hashed to
the same spot in the table, and the lookup requires searching
through 4 or 5 buckets before finding the empty spot.
Clustering
• A graphic might help explain this:
The opposite of clustering is uniform distribution.
Simply, this is when the hash function uniformly
distributes the items over the table.
Chaining
• A collision has occurred. The position resolved by the
resolution technique is already full, therefore the
element you wish to insert into the hash table cannot
go there without losing data. Separate chaining
simply makes a backup list of elements that would
have otherwise been inserted into a given position
had that position been empty.
Chaining
• For example, let us assume that position x in the table has
been previously filled by a hash insertion.
• Now, let us assume that the element we wish to insert into
the hash table resolves to position x as well.
• We cannot simply overwrite the element already in
position x.
• Each hash table entry requires a pointer to a linked list of
records that can hold the element data.
• This list is only used when a collision occurs.
• The element that we are inserting, E, is then appended
onto the list associated with the target position, x. The
result is a linked list associated with each table position,
each list containing only those elements that collided on
that table position with an element already in that position.
Chaining
• The worst case: all keys hashed to the same
slot. Items kept in a linked list of size N
(size of the collection), search O(N).
• Best case: hash function distributes the keys
uniformly, the size of the bucket is N/k.
Probing Sequence
• Quadratic probing - the offset is squared at every step.
Jumps around much more, but the items with the same
keys make exactly the same jumps.
• Secondary clustering: each new item with the same
hash value requires a longer probing sequence.
• Double hashing: offset is calculated by a second
hashing function h', which is different from the first
one:
h(k), h(k) + h'(k), h(k) + 2h'(k), h(k) + 3h'(k), ...
Solves the problem of clustering.
Quadratic Probing
• h(k,i) = (h(k) + c1i + c2 i2) mod m
• Only certain combination of c1, c2, and m use the entire
hash table.
• h(k1,0) = h(k2,0) implies h(k1,i) = h(k2,i). This leads to
secondary clustering.
• There are only m (<<m!) distinct probe sequences.
• Example:
h(k1,i) = (h(k) + i + i2) mod m, c1 = c2 = 1
In this example, the probe sequence is
h(k)h(k) + 2h(k) + 6h(k) + 12...
Double hashing
• Double hashing is another method of collision resolution,
but unlike the linear collision resolution, double hashing
uses a second hashing function which normally limits
multiple collisions.
• The idea is that if two values hash to the same spot in the
table, a constant can be calculated from the initial value
using the second hashing function which can then be used
to change the sequence of locations in the table, but still
have access to the entire table.
Useful rules for the table size
• Unless it is possible to map each key to a
unique index in the array, table size should be
a prime number (important if modulo division
is used; also for linear probing).
• If open addressing is used, table size should be
larger than the number of values stored.
• Optimal load factor for open addressing is 1/2
(the table twice larger than the number of
items). Load factor is the ratio of the number
of items to the size of the hash table.
Summary
• Hash tables are very useful when the size of
collection is known in advance and search,
insertion and deletion should be fast.
• If the table is large enough and the hash function is
chosen well, these operations are almost constant.
• Difficult to resize (copy into a larger array and
change the hashing function) and to iterate through
all items (array contains gaps).