Transcript Document

Introduction to Data Structure
CHAPTER 8
HASHING
8.1 Symbol Table Abstract Data Type
8.2 Static Hashing
8.3 Dynamic Hashing
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-1
Contents
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Basic Concepts
Arrays
Stacks and Queues
Linked Lists
Trees
Graph
Chapter 7 Sorting
Chapter 8 Hashing
Chapter 9 Heap Structures
Chapter 10 Search Structures
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-2
8.1 Symbol Table

Definition

A symbol table is a set of name-attribute pairs

E.g.,
• thesaurus (name: word, attribute: a list of synonyms)
• data dictionary in database management

General operations performed on a symbol table

determine if a particular name is in the table

retrieve the attribute of that name

modify the attributes of that name

insert a new name and its attributes

delete a name and its attributes

ADT of Symbol Table, p. 396
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Name
Attribute
Chapter 8 Hashing
8-3
Symbol Table (cont.)

Data structure for symbol table

Binary search tree
• Worst case complexities for the operations are O(n)
• Best case … are O(log n)

Hashing technique
• Can provide a good performance for search, insert, and delete
• Instead of using comparisons to perform search, hashing relies on a
formula called the hash function

Two types of hashing

static hashing

dynamic hashing
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-4
8.2 Static Hashing


Hash table

A fixed-size table used to store identifiers.

The memory available to maintain the symbol table (hash table) is
assumed to be sequential.

The hash table ht consists of b buckets and each bucket contains s
records.

An arithmetic function f(x) used to compute the address of location (0
through b-1) of an identifier, x
E.g.,

If the identifier is 6 alpha-numerical long, with the first one being a
letter. Then, the total of distinct identifiers are
T
 26 * 36
i
 1.6 *109
0i 5
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-5
Static Hashing (cont.)

Terminology





Identifier density: the identifier density of a hash table is the ratio n/T,
where
n: the number of identifiers in the table
T: the total number of possible identifiers
Loading density: the loading density or loading factor is
α= n/(sb)
Two identifiers, i1, and i2, are said to be synonyms with respect to f if
f(i1) = f(i2)
Hash overflow: an overflow occurs when a new identifier i is mapped or
hashed by f into a full bucket
Hash collision: a collision occurs when two non-identical identifiers are
hashed into the same bucket
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-6
Hash table: Example




A hash table ht with
b = 26 buckets and s
=2
# identifiers, n = 10
Loading factor a =
10/52 = 0.19
Hashing function
f(x) = the first
character of x
Slot 0
Slot 1
acos
atan
2
char
ceil
3
define
4
exp
5
float
0
1
floor
6
25
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-7
Hashing Functions

Definition


A hash function, f, transforms an identifier, x, into a bucket
address in the hash table.
Characteristics of ideal hashing functions

Easy to compute and result in very few collisions

Should provide a mechanism to handle overflow

Should not have bias on the use of the hash table

i.e., a random x has an equal chance of hashing into any of
the b buckets
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-8
Hash Function: Mid Square

Mid-Square function, fm



Since the middle bits of the square usually depend on all
the characters in the identifier, different identifiers are
expected to result in different hash addresses with high
probability.
The number of bits dependents on the table size


squaring the identifier and then using an appropriate number of
bits from the middle of the square to obtain the bucket address
e.g., if r bits are used, the range of values is 2r
Example

Let x = ‘A’ = 65, x2 = 4225 = 1000010000001
r = 5,  f(x) = 8
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-9
Hash Function: Division

Division function, fD

An identifier x is divided by some number M and the remainder is
used as the hash address for x.
fD(x) = x % M

The bucket addresses are in the range of 0 through M-1.

Example
Let x = ‘A’ = 65, M = 16. Then fD(x) = 65 % 16 = 1
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-10
Hash Function: Division (cont.)

Bad choice of M

If M is a power of 2, then fD(x) depends only on the least
significant bits of x.
=> identifiers with the same suffix would result in collisions
•
Example
x1 = ‘BA’,

x2 = ‘CA’
If M is divisible by two, the odd keys are mapped to odd buckets
and even keys are mapped to even buckets.
=> the hash table is biased

Good choice of M

M a prime number
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-11
Hash Function: Folding


Concept

The identifier x is partitioned into several parts, all but the last
being of the same length.

All partitions are added together to obtain the hash address for x.
Two methods

Method 1: Shift folding
• different partitions are added together to get f(x)

Method 2: Folding at the boundaries
• identifier is folded at the partition boundaries, and digits
•
falling into the same position are added together to obtain f(x)
similar to reversing every other partition and then adding
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-12
Hash Function: Folding (cont.)


Example

x = 123|203|241|112|20 are partitioned into three decimal digits long

x1 = 123, x2 = 203, x3 = 241, x4 = 112, x5 = 20
Shift folding:
5
f ( x )   xi  123  203  241  112  20  699
i 1

Folding at the boundaries
 x2 = 302, x4 = 211
5
f ( x )   xi  123  302  241  211  20  897
i 1
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-13
Digit Analysis


This method is useful when a static file where all the
identifiers in the table are known in advance.
Concept
Each identifier x is transformed into a number using some radix r.
 The digits of each identifier are examined.
 Digits having most skewed distributions are deleted.
 Enough digits are deleted so that the number of remaining digits is
ASCII
small enough to give an address in the range of the hash table.
Example
 x = ‘data’ ====> (1009711697)10
 Frequency: 0:2, 1: 3, 6:1, 7:2, 9:2
 Average frequency = 2, the most skewed digits: 1 and 6
  f(x) = (009797)10


Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-14
Overflow Handling

There are two ways to handle overflow

Open addressing
• Linear probing
• Quadratic probing
• Rehashing
• Random hashing
•…

Chaining
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-15
Open Addressing




Assumes the hash table is an array.
The hash table is initialized so that each slot contains the
null identifier.
When a new identifier is hashed into a full bucket, find the
closest unfilled bucket.
This is called linear probing or linear open addressing
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-16
Open Addressing (cont.)

Example
0
A
Assume 26-bucket table with one slot per bucket
and the following identifiers: GA, D, A, G, L, A2,
A1, A3, A4, Z, ZA, E.
1
A2
2
A1
3
D

Let the hash function f(x) = first character of x.
4
A3

When entering G, G collides with GA and is
entered at ht[7] instead of ht[6].
5
A4
6
GA
7
G
8
ZA
9
E

25
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-17
Linear Probing

When linear probing is used to handle overflows, a hash
table search for identifier x proceeds as follows:


compute f(x)
examine identifiers at positions ht[f(x)], ht[f(x)+1], …, ht[f(x)+j],
until one of the following condition happens:
• ht[f(x)+j]=x; in this case x is found
• ht[f(x)+j] is null; x is not in the table
• We return to the starting position f(x); the table is full and x is
not in the table
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-18
Linear Probing (cont.)

Problem with linear probing



Example



Identifiers tend to cluster together. Also adjacent clusters tend to
coalesce.
This increases search time.
To find ZA, you need to examine ht[25], ht[0], …, ht[8] (total of
10 comparisons).
To retrieve each identifier once, a total of 39 buckets are
examined (average 3.25 bucket per identifier).
The expected average number of identifier comparisons, p, to
look up an identifier is approximately (2 -α)/(2-2α), whereαis
the loading density.


α=12/26=.47 and p = 1.5.
Even though the average number of probes is small, the worst
case can be quite large.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
0
A
1
A2
2
A1
3
D
4
A3
5
A4
6
GA
7
G
8
ZA
9
E
25
Chapter 8 Hashing
Z
8-19
Quadratic Probing



A quadratic probing scheme improves the growth of
clusters.
Concept

A quadratic function of i is used as the increment when
searching through buckets.

Perform search by examining bucket f(x), (f(x)+i2) % b, (f(x)-i2)
% b for 1 ≤ i ≤ (b-1)/2.
When b is a prime number of the form 4j+3, for j an
integer, the quadratic search examine every bucket in the
table.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-20
Chaining



Motivation

Linear probing perform poorly because the search for an identifier
involves comparisons with identifiers that have different hash
values.

Unnecessary comparisons can be avoided if all the synonyms are
put in the same list, where one list per bucket.
As the size of the list is unknown before hand, it is best to use linked
chain.
Each chain has a head node. Head nodes are stored sequentially.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-21
Chaining: Example
ht
0
1
2
3
4
5
6
7
8
9
10
11
25
A4
A3
A1
A2
A 0
0
0
D 0
0
E 0
G
GA 0
0
0
0
0
L 0
ZA
Z 0
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-22
Chaining (cont.)

The expected number of identifier comparisons can be
shown to be ~ 1 +α/2, where αis the loading density n/b
(b=number of head nodes).
 For α=0.5, it’s 1.25. And if α=1, then it’s about 1.5.

Another advantage of this scheme is that only the b head
nodes must be sequential and reserved at the beginning.
The scheme only allocates other nodes when they are
needed. This could reduce overall space requirement for
some load densities, despite of links.

Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin
Chapter 8 Hashing
8-23