Ch11. Skips Lists and Hashing
Download
Report
Transcript Ch11. Skips Lists and Hashing
Overview
Overview(1)
O(1) access to files
Variation of the relative file
Record number for a record is not arbitrary; rather,
it is obtained by a hashing function H applied to
the primary key, H(key)
Record numbers generated should be uniformly
and randomly distributed such that 0 < H(key) < N
Data Structures
1
Overview
Overview(2)
A hash function is like a block box that produces an address every
time you drop in a key
All parts of the key should be used by the hashing function H so
that a lot of records with similar keys do not all hash to the same
location
Given two random keys X, Y and N slots,
the probability H(X)=H(Y) is 1/N;
in this case, X and Y are called synonyms
and a collision occurs
Data Structures
2
11.1 Introduction
Introduction
Hash function : h(k)
Transforms a key K into an address
Hash vs other index
Sequential search : O(N)
Binary search : O(log2N)
B(B+) Tree index : O(logkN) where k
records in an index node
Hash : O(1)
Data Structures
3
11.1 Introduction
A Simple Hashing Scheme(1)
Name
BALL
LOWELL
TREE
Data Structures
ASCII Code for
First Two Letters
Product
Home
Address
66 65
66 X 65 =
4,290
4,290
76 96
76 X 96 =
6,004
6,004
84 82
84 X 82 =
6,888
6,888
4
11.1 Introduction
A Simple Hashing Scheme(2)
key
K=LOWELL
Record
Address
0
1
2
3
4 LOWELL . . .
Address
4
5
6
5
...
Data Structures
...
h(K)
LOWELL’s
home
address
11.1 Introduction
Idea behind Hash-based Files
Record with hash key i is stored in node i
All record with hash key h are stored in node h
Primary blocks of data level nodes are stored
sequentially
Contents of the root node can be expressed by a simple
function: Address of data level node for record with
primary key k = address of node 0 + H(k)
In literature on hash-based files, primary blocks of data
level nodes are called buckets
Data Structures
6
11.1 Introduction
e.g. Hash-based File
root node
0
70
15 50 1
1
30 51
2
3
10 45 3
7
6
11 60
14
57
Data Structures
5
4
15
61124 40
20 55
11.1 Introduction
Hashing(1)
Hashing Functions :
Consider a primary key consisting of
a string of 12 letters and
a file with 100,000 slots.
Since 2612 >> 105,
So synonyms (collisions) are inevitable!
Data Structures
8
11.1 Introduction
Hashing(2)
Possible means of hashing
First, because 12 characters = 12 bytes = 3(32-bit)
words, partition into 3 words and perform folding
as follows : either
(a) add modulo 232, or
(b) combine using exclusive OR, or
(c) invent your own method
Next, let R = V mod N
R => Record-number
V => Value obtained in the above
N => Number of Slots
so 0< R < N
Data Structures
9
11.1 Introduction
Hashing(3)
If M = number of records, N = number of available slots,
P(k) = probability of k records hashing to the same slot
then P(k) =
M/N
M
K
1
N
x
1- 1
N
k
~ f
~
ek*k!
where f is the loading factor
As f --> 1, we know that p(0)->1/e and p(1) -> 1/e.
The other (1-1/e) of the records must hash into (1-2/e) of the
slots, for an average of 2.4 slot.
So many synonyms!!
Data Structures
10
11.1 Introduction
Collision
Collision
Situation in which a record is hashed to an address
that does not have sufficient room to store the
record
Perfect hashing : impossible!
Different key, same hash value
(Different record, same address)
Solutions
Spread out the records
Use extra memory
Put more than one record at a single address
Data Structures
11
11.2 A Simple Hashing Algorithm
A Simple Hashing Algorithm
Step 1. Represent the key in numerical form
If the key is a string : take the ASCII code
If the key is a number : nothing to be done
e.g.. LOWELL = 76 79 87 69 76 76 32 32 32 32 32 32
L O W E L L ( 6 blanks )
Data Structures
12
11.2 A Simple Hashing Algorithm
Step 2 . Fold and Add
Fold
76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32
Add parts into one integer
(Suppose we use 15bit integer expression, 32767 is limit)
7679 + 8769 + 7676 + 3232 + 3232 + 3232
= 33820 > 32767 (overflow!)
Data Structures
Largest addend : 9090 ( ‘ZZ’ )
Largest allowable result : 32767 - 9090 = 19937
Ensure no intermediate sum exceeds using ‘mod’
( 7679 + 8769 )
(16448 + 7676 )
( 4187 + 3232 )
( 7419 + 3232 )
( 10651 + 3232)
mod
mod
mod
mod
mod
19937 = 16448
19937 = 4187
19937 = 7419
19937 = 10651
19937 = 13883
13
11.2 A Simple Hashing Algorithm
Step 3. Divide by size of the address
space
a = s mod n
a : home address
s : the sum produced in step 2
n : the number of addresses in the file
e.g.. a = 13883 mod 100 = 83 (s = 13883, n = 100)
A prime number is usually used for the divisor because primes tend to
distribute remainders much more uniformly than do nonprimes
So, we chose a prime number as close as possible to the desired size of
the address space (eg, a file with 75 records, a good choice for n is
101, then the file will become 74.3 percent full
Data Structures
14
11.3 Hashing Functions and Record Distributions
Hashing Functions and Record Distributions
Distributing records among address
Best
Record Address
1
A
2
B
3
4
C
5
D
6
E
7
F
8
9
G
10
Data Structures
Worst
Record Address
1
A
2
B
3
4
C
5
D
6
E
7
F
8
9
G
10
15
Acceptable
Record
A
B
C
D
E
F
G
Address
1
2
3
4
5
6
7
8
9
10
11.3 Hashing Functions and Record Distributions
Some other hashing methods
Better-than-random
Examine keys for a pattern
Fold parts of the key
Divide the key by a number
When the better-than-random methods do
not work ----- randomize!
Square the key and take the middle
Radix transformation
Data Structures
16
11.4 How Much Extra Memory Should Be Used?
How Much Extra Memory Should Be Used?
Packing Density =
# of records
# of spaces
The more records are packed,
the more likely a collision will occur
Data Structures
17
=
r
N
11.4 How Much Extra Memory Should Be Used?
Poisson Distribution
p(x) =
(r/N)xe -r/N
x!
(poisson distribution)
N = the number of available addresses
r = the number of records to be stored
x = the number of records assigned to a given address
p(x) : the probability that a given address will have
x records assigned to it after the hashing
function has been applied to all n records
( x records 가 collision 할 확률)
Data Structures
18
11.4 How Much Extra Memory Should Be Used?
Predicting Collisions for Different Packing
Densities
# of addresses no record assigned : N X P(0)
# of addresses one record assigned : N X P(1)
# of addresses more than two assigned :
N X [P(2) + P(3) + P(4) + ...]
# of overflows : 1 X NP(2) + 2 X NP(3) +...
Percentage of overflow records :
1 X NP(2) + 2 X NP(3) + 3 X NP(4) ...
X 100
N
Data Structures
19
11.4 How Much Extra Memory Should Be Used?
The larger space, the less overflows
N addresses
r records
r/N
Data Structures
Packing
Density(%)
Synonym as
% of records
4.8
9.4
13.6
17.6
21.4
24.8
28.1
31.2
34.1
36.8
10
20
30
40
50
60
70
80
90
100
20
11.5 Collision Resolution by Progressive Overflow
Collision Resolution by Progressive Overflow
Progressive overflow (= linear probing)
Insert a new record
1. Take home address if empty
2. Otherwise, next several addresses are
searched in sequence, until an empty one is
found
3. If no more space -- wrapping around
Data Structures
21
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow(Cont’d)
0
1
....
....
Key
York
Hash
Routine
Address
6
5
Novak. . .
6
Rosen. . .
7
Jasper. . .
8
Morely. . .
9
....
....
Data Structures
22
York’s home
address (busy)
2nd try (busy)
3rd try (busy)
4th try (open)
York’s actual
address
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow(Cont'd)
0
1
2
Hash
Routine
....
....
Key
Blue
97
Address
99
98
99 Jello...
....
....
Data Structures
23
Wrapping around
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow(Cont'd)
Search a record with a hash function value k:
from home address k,
look at successive records,
until Found,
or An open address is encountered
Worst case
When the record does not exist and the file is full
Data Structures
24
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow(Cont'd)
- Search length : # of accesses required to retrieve a
record (from secondary memory)
20
21
22
23
24
25
...
Actual
Address
Adams. . .
Bates. . .
Cole. . .
Dean. . .
Evans. . .
Home Search
Address length
1
1
2
2
5
20
21
21
22
20
...
...
Average search length = total search length
total # of records
Data Structures
25
= 2.2
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow(Cont'd)
With perfect hashing function : average search
length = 1
Average search length of no greater than 2.0 are
generally considered acceptable
5
4
Average
search 3
length
2
1
20
Data Structures
40
60
Packing
density
26
80
100
11.6 Storing More Than One Record per Address : Buckets
Storing More Than One Record
per Address : Buckets
Bucket : a block of records sharing the
same
Home address
Key Address
Green 30
30
Hall
Jerk 32
King 33
Land 33
Marx 33
Nutt 33
Data Structures
Bucket
address
30
31
32
33
Bucket contents
Green ...
Hall ...
Jenks ...
King...
Land...
Marks...
(Nutt... is an
overflow record)
27
11.6 Storing More Than One Record per Address : Buckets
Effects of Buckets on Performance
N : # of addresses
b : # of records fit in a bucket
bN : # of available locations for records
Packing density = r/bN
# of overflow records
N X [ 1XP(b+1) + 2XP(b+2) + 3XP(b+3)...]
As the bucket size gets larger,
performance continues to improve
Data Structures
28
11.6 Storing More Than One Record per Address : Buckets
Bucket Implementation
Collision counter =< bucket size
0
/////
/////
/////
/////
/////
/////
/////
BRICE
TROOP
An empty bucket
2
JONES
ARNSWORTH
/////
Two entries
5
JONES
ARNSWORTH
STOCKTON
A full bucket
Data Structures
29
11.6 Storing More Than One Record per Address : Buckets
Bucket Implementation(Cont'd)
Initializing and Loading
Creating empty space
Use hash values and find the bucket to store
If the home bucket is full, continue to look at
successive buckets
Problems when
No empty space exists
Duplicate keys occur
Data Structures
30
11.7 Making Deletions
Making Deletions
The slot freed by the deletion hinders(disturb) later searches
Use tombstones and reuse the freed slots
Home
Record address
Adams
Jones
Morris
Smith
5
6
6
5
5
6
7
8
Adams...
Jones...
Morris...
Smith...
Delete 5 Adams...
Morris
6 Jones...
7 ######
8 Smith...
A tombstone for Morris
Data Structures
31
11.8 Other Collision Resolution Techniques
Other Collision Resolution Techniques
Double hashing :
Chained progressive overflow :
Chaining with a separate overflow area :
avoid clustering with a second hash
function for overflow records
each home address
contains a pointer to the record with the same address
move all
overflow records to a separate overflow area
Scatter tables :
Hash file contains only pointers to records
(like indexing)
Data Structures
32
11.8 Other Collision Resolution Techniques
Overflow File
When building the file, if a collision occurs, place the
new synonym into a separate area of the file called
the overflow section
This method is not recommended;
if there is a high load factor,
there will either be overflow from this overflow section
or it will be organized sequentially and performance suffers;
if there is a low load factor, much space is wasted
Data Structures
33
11.8 Other Collision Resolution Techniques
Linear Probing(1)
When a synonym is identified, search forward from the address
given by the hash function (the natural address) until an empty
slot is located, and store this record there
This is an example of open addressing (examining a predictable
sequence of slots for an empty one)
Data Structures
34
A S E A R C H I
key :
hash :
N G E X A M P L E
1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A
insertion sequence
S
E
A A
Memory Space
C
H
I
N
G
E E
E E G H I X
A A C A
M
P
L
Data Structures
G H I X E
E E 35
R
11.8 Other Collision Resolution Techniques
Rehashing(1)
Another form of open addressing
In linear probing, if synonym occurred, incremented r by 1 and
searched next location
In rehashing, use a second hash function for the displacement: D
= (FOLD(key) mod P) + 1, where P < N is another prime number
This method has the advantage of avoiding congestion, because
each synonym under the first hash function likely uses a different
displacement D, and this examines a different sequence of slots
Data Structures
36
Rehashing(where P=3)(2)
N G E X A M P L E
A S E A R C H I
key :
hash :
1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A
insertion sequence
S
E
A
A
Memory Space
C
H
I
N
G
A
A
E
H
E
E
H
E
G
X
N
A
M
P
L
Data Structures
R
37
E
11.8 Other Collision Resolution Techniques
Chaining without Replacement(1)
Uses pointers to build linked lists within the file;
each linked list contains each set of synonyms;
the head of the list is the record at the natural address of these
synonyms
When a record is added whose natural address is occupied, it is
added to the list whose head is at the natural address
Linear probing usually used to find where to put a new synonym,
although rehashing could be used as well
Data Structures
38
11.8 Other Collision Resolution Techniques
Chaining without Replacement(2)
A problem is that linked lists can coalesce:
Suppose that H(R1) = H(R2) = H(R4) = i and H(R3) =
H(R5) = i+1, and records are added in the order R1,
R2, R3, R4, R5.
Then the lists for natural addresses i and i+1
coalesce. Periodic reorganization shortens such lists.
Let FWD and BWD be forward and backward pointers
along these chains (doubly-linked)
Data Structures
39
11.8 Other Collision Resolution Techniques
Chaining without Replacement(3)
i
i+1
Data Structures
R1
R2
R3
R4
R5
40
H
H
11.8 Other Collision Resolution Techniques
Chaining with Replacement(4)
Eliminates problem with deletion that caused abandonment to
be necessary
Further reduces search lengths
When inserting a new record,
if the slot at the natural address is occupied by a record for
which it is not the natural address,
then record is relocated so the new record may be replaced
at its natural address
Synonym chains can never coalesce,
so a record can be deleted even if is the head of a chain,
simply by moving the second record on the chain to its
natural address ( ABANDON thus is no longer necessary )
Data Structures
41
11.8 Other Collision Resolution Techniques
Chaining with Replacement(5)
after R1, R2 added
R1
R2
H
after R3 hashes to i+1
i
i+1
R1
R3
R2
H
H
finally
i
i+1
R1
R3
R2
R4
R5
H
H
2 chains : R1 - R2 -R4
R3 - R5
Data Structures
42
Patterns of Record Access
Pareto Principle ( 80/20 Rule of Thumb): 80 % of the
accesses are performed on 20 % of the records!
The concepts of “the Vital Few and the Trivial Many”
20 % of the fisherman catch 80 % of the fish
20 % of the burglars steal 80 % of the loot
If we know the patterns of record access ahead, we
can do many intelligent and effective things!
Sometimes we can know or guess the access patterns
Very useful hints for file syetms or DBMSs
Intelligent placement of records
fast accesses
less collisions
Data Structures
43