Transcript Hashing

Hashing
A quick lookup strategy
Copyright © 2003-2005Curt Hill
What is a hash?
• Hashing is another name for key
transformation
• The original key is usually a
character string or other sparse key
• The result is usually a dense integer
key
Copyright © 2003-2005Curt Hill
Components
• Every hash scheme has three pieces
– A hash function
– A collection of buckets
– Collision scheme
• The hash function transforms the
key into an integer 0 – N
• The buckets are numbered 0 – N and
hold the resulting data
Copyright © 2003-2005Curt Hill
Basics
• The idea:
– Have N containers for the data
– Have a function that maps the original
key to a number in the range 0 – (N-1)
– One access retrieves the data
regardless of size
– When it works it is the fastest way to
lookup a value
Copyright © 2003-2005Curt Hill
Hash function
• Convert a sparse key into a dense
key in the form of an integer
• The input is the real key which is a
sparse key
– Sparse keys use very few of very many
possible key values
• Typical key is a character string
while the result needs an integer
Copyright © 2003-2005Curt Hill
Typical hash function
• Input is a character string
• Output is an integer in range 0 - N
• Sum the ordinal value of each
character of the string
• Divide the result by N and take the
remainder
– This guarantees the right range
• Number theory tells us to make N
prime
Copyright © 2003-2005Curt Hill
Example values with N = 256
•
•
•
•
•
•
•
•
“Abcdef” returns 53
“Hi there” returns 233
“ABCDEF” returns 149
“FEDCBA” returns 149
“A character string” returns 197
“A big character string” returns 23
“Zoology” returns 243
See the pattern yet?
Copyright © 2003-2005Curt Hill
Hash functions
• Computing the function is easy
• Two problems
– One result produced by two different
keys – called a collision
– The order of the original key is
scrambled in the result
• Very good for an equality test and
very bad for a range test
Copyright © 2003-2005Curt Hill
Is it that simple?
• Not usually
• The previous scheme maps all
words with exactly the same
characters to the same integer value
• Many variation
– Do a different computation on even and
odd characters so a re-arrangment will
produce different values
Copyright © 2003-2005Curt Hill
Collisions
• Two keys giving the same integer
– Assume each bucket may only hold one
record
• Prevent hashing from the optimal
search technique it could be
• Almost every hashing scheme needs
a collision strategy
• There have been many variations
Copyright © 2003-2005Curt Hill
Collision Strategies
• Linear probing
– Add one to index until right one is found
• Quadratic probing
– Square the index when the collision
occurs
• Rehash
– Secondary hash function
• Chaining
– Link duplicates in an overflow area
Copyright © 2003-2005Curt Hill
Collision Strategies
Linear
0
Quadratic
Chaining
0
6 y
0
2 e
1
1d
1
1 d
1
2
2 e
2
2 e
2
2 t
3
2 t
3
3
6 y
4
4
2 t
4
5
5 c
5
5 c
5
6
6 m
6
6 m
6
7
6 y
7
7
Copyright © 2003-2005Curt Hill
1 d
5 c
6 m
Free Space
• As the free space decreases the
collisions increase
• The best plan is to have about twice
as many buckets as will be needed
Copyright © 2003-2005Curt Hill
Effect of load using linear
probing
Load factor Number of probes
10%
1.06
25%
1.17
50%
1.5
75%
2.5
90%
5.5
95%
10.5
Copyright © 2003-2005Curt Hill
Dynamics
• A hash table is much harder to both grow
and shrink
• Increasing or decreasing the number of
buckets requires:
– Allocate a new table (larger or smaller)
– Rehashing every item in old into the new
• Deleting a record (without resizing) is
also a problem because of collisions
Copyright © 2003-2005Curt Hill
Deletion of a key
• The problem is collisions
• If the key has no collisions just mark
the bucket as empty
• If it has collisions
– All of them need to be rehashed
– Finding them depends on the collision
strategy
Copyright © 2003-2005Curt Hill
Two more terms
• A perfect hash is one that has no
collisions
– Only occurs in situations with two
conditions:
– Keys are fixed and known in advance
– The hash function is tailored to these
keys
• A minimal hash has no empty
buckets
Copyright © 2003-2005Curt Hill
Data Skew
• A good hash function spreads the
keys uniformly among the integer
range
• How well does this work when the
data is not uniformly distributed?
• For example consider
– Names - There are many more Smiths
than Garnjobsts
– Numbers – there are many more 101
courses than most other numbers
• Can the hash function still give a
good distribution?
Copyright © 2003-2005Curt Hill
Summary
• Hashing works best when it works
well
– Stable index
– Good distribution from the hash
function
• It is much harder to make dynamic
than trees
Copyright © 2003-2005Curt Hill