Hashing

Transcript Hashing

Hashing
Chapter 20
Hash Table
• A hash table is a data structure that allows
fast find, insert, and delete operations
(most of the time).
• The simplest way of implementation a hash
table is an array.
Example: Suppose we want to store integers
in the range of 0- 65535.
Create any array “a” of size 65536 with
indices in the range of 0-65535 and
initialize the array with all zeros.
insert (i) -a[i]++
find (i)
-Is a[i] > 0 ?
remove (i) -if ( a[i] > 0) a[i] --
• If our keys are 8-letter alphabetic words,
there are 268 or about 200 billion possible
keys [about 200 ‘gig’ of keys].
• Only a small fraction of these keys will
actually occur
• Conceptually, a very large array, with very
few cells occupied.
• We need a better way
• Allow many of the different possible keys
which can occur to be mapped to the same
location under the action of the index
function.
• A hash function takes a key and maps it to
some index (possibly a smaller index) in the
array.
• A collision occurs when the hash function
maps two actual keys to the same index.
Hash table operations
• Given a hash function hash(key) which
returns an integer: The simplistic approach
is as follows
– insert(key): A[hash(key)] = object to insert
– find(key):
is the object at A[hash(key)] ?
– remove(key): remove object at A[hash(key)]
• But what happens on collisions?
Choosing a hash function
• Desirable properties of a good hash function
– quick and easy to compute
– uniformly distributes the keys over the range of
indices
– minimizes collisions
Methods of building a hash
function
• Truncation
– ignore part of the key and use the remaining
part as the index
• Folding
– Partition the key into several parts and combine
these parts to obtain the index
• Modular arithmetic
– Convert the key to an integer and mod by the
table size
Example
• Given 8-digit integers and a table of size
1000
– Truncation
• e.g. -- use the 4th, 7th and 8th digits to form the
index: hash(62538194) = 394
– Folding
• e.g. -- break into groups of 3, 3, and 2 digits, add the
parts and truncate if necessary: hash(62538194) =
(625+381+94) mod 1000 = 1100 mod 1000 = 100
Example cont’d
– Modular arithmetic
• e.g. Simply mod by the table size: hash(62538194)
= 62538194 mod 1000 = 194
• It seems best to have a table size which is a prime
number for modular arithmetic, so a table size of
997 or 1009 would perform a little better.
• A combination of these techniques may be
even better
Collision Resolution
• Open addressing
– The table is an array which holds at most one
object per index -- contiguous storage
• Chaining
– The table is an array of chains, all elements on
a chain have the same index [these chains are
sometimes called “buckets”] -- dynamic storage
Open Addressing
• Linear probing
– This is the simplest method of collision
resolution
– Start with the hash index and perform a linear
search for the desired key or an empty location
– The table is considered circular, the search
wraps around from the last index to the first
Open Addressing
• Clustering
– The major drawback of linear probing is that
when the table becomes about half full, these is
a tendency toward clustering
– Clustering occurs when records start to appear
to as long strings of adjacent positions, which
may have several different hash values
– Linear searches for empty locations become
longer and longer
An Example
Insert the items 67, 89, 17, 20, 90, 19 into
an empty hash table using an array of size
10 and using the following hash function
hash (key) = key mod 10.
Use linear probing to handle collisions.
Open Addressing
• Other techniques of collision resolution
– Rehashing
• use a second hash function to find an alternative
position
– Quadratic Probing
• if hash(key) = h, probe at locations h+1, h+4, h+9,
h+16, etc. [i.e., locations h+i2 for i = 1,2,3,4,…]
– Random Probing
• use a seeded pseudo-random number generator to
obtain the increment
Open Addressing
• Deletions
– deletions with open addressing is awkward.
(why?)
– lazy deletion is the preferred means – that is,
making items as deleted rather than physically
removing them from the table.
Chaining
• Advantages to linked storage
– with a good hash function, the linked lists will
be short
– clustering is not a problem -- records with
different keys are on different chains
– The size of the table is of less concern
– Deletions are easy and efficient
– [The chains could be binary search trees or
other structures]
Load Factor
• The load factor of a hash table is the ratio of
the number of items in the table to size of
the hash table
–
–
–
–
–
n - the number of items in the table
t - the size of the hash table
the load factor  = n/t
 = 0 indicates an empty table
 = 0.5 indicates a table half full
Load Factor
• In open addressing,  may never exceed 1,
and in practice,  > 0.5 will begin to cause
clustering problems.
• In chaining, there is no limit to the size of .
Linear Probing
Theorem: The average number of cells examined
in an insertion using linear probing is
[1 + 1/(1 – k)^2]/2 where k is the load factor.
Theorem: The average number of cells examined
in a successful search is approximately
[1 + 1/(1 – k)]/2 where k is the load factor.
Quadratic Probing
Note that in linear probing, each probe tries a
different cell. Does quadratic probing
guarantees that, when a cell is tried, we
have not already tried it during the course of
the current access? Does quadratic probing
guarantees that, when we are inserting x and
the table is not full, x will be inserted?
Quadratic Probing
• Theorem: If quadratic probing is used and
the table size is prime, then a new element
can always be inserted if the table is at least
half empty. Furthermore, in the course of
the insertion, no cell is probed twice.
Hash Table Vs. BST
Insert and find operations can be implemented
using a BST with average insert/find time of
O(logn). However, a BST is generally a more
powerful data structure than a hash table as it
can easily support routines that require order,
for example, finding the smallest/largest
element.
Hash Table VS. BST
If the input is sorted, a BST will perform
poorly. Although balanced trees can be used
to avoid the O (n) time insert/find, they are
quite expensive to implement. Hence, if no
ordering information is required and there is
any suspicion that the input might be sorted,
hashing is the data structure of choice.
Applications of Hash Tables
Hash tables are used in implementing
• Symbol Tables
• Game Programs
• Spelling Checkers
HW: Problems 20.1-20.6 on page 710

Hashing

Transcript Hashing

Directory