Hashed files

Transcript Hashed files

ECE 569
Database System Engineering
Spring 2003
Yanyong Zhang www.ece.rutgers.edu/~yyzhang
Course URL www.ece.rutgers.edu/~yyzhang/spring03
ECE569 Lecture 04-2.1
Spring 2003
Access Paths

Associative access can be realized via scan.

The class of algorithms and data structures
designed for translating attribute values into TID,
or into other types of internal addresses of tuples
having those attribute values, is called access
paths.

Depending on what kind of selection predicate is
to be supported, the techniques for associative
access vary greatly.
ECE569 Lecture 04-2.2
Spring 2003
Content addressability techniques


Primary key access.

A tuple of a relation must be retrieved efficiently via the
value of its primary (unique) key(s). e.g., key-sequenced
files and hased files.

Point query vs. range query
Secondary key access


A set of tuples are produced
Multi-table access

Tuple access is often based on relationships between
different tuples.
ECE569 Lecture 04-2.3
Spring 2003
Operations on files


Assumptions

n = number of records in file

R = number of records that can fit in block
Lookup – Given a key find corresponding record


Insertion – add record to file (allows duplicates)


On average, n / (2R) block accesses.
Read last block; it may need to allocate a new block.
Approximately, requires 2 accesses
Deletion – delete record

look up record n / (2R)

Write back to disk (1 access)

Reorganize (unpinned) – move tuple from last page to utilize
space (2 disk accesses)
ECE569 Lecture 04-2.4
Spring 2003
Hashed Files

File is divided into B buckets

Hash function h maps elements of the key space to
range [0, B)

Key space is large and unevenly distributed
- SSNs as character strings
- Each character takes on at most 10 of the possible 256
values


Hash function h must map key values evenly among a
relatively small number of values.
A bucket directory is an array of B pointers to the
allocated buckets.

Small enough to fit entirely in memory

Buckets are allocated only as they are needed.
ECE569 Lecture 04-2.5
Spring 2003
Hashed files
Hash Table
0
1
B-2
B-1
ECE569 Lecture 04-2.6

.
.K
su
ch
.
th
.
at
.
h(
K)
=
1
K
su
ch
th
at
h(
Buckets
Tuples with key K b
such that h(K b) = 1
Tuples with key K b
such that h(K b) = B -2
2
Tuples with key K b
such that h(K b) = B -1
Spring 2003
Range of
Potential
Key Values
(the shaded
areas denote
used key
values)
ECE569 Lecture 04-2.7
FOLDING
Range of positive integers
Hash-based associative access
HASHING
tuple
address
space
Spring 2003
Folding

Convert arbitrary data types to a positive integer h can be
applied to.

Reduce number of bits so that arithmetic is efficient.

Example: Key is “Keefe” and 16803

Key value is the concatenation of byte representation of
individual fields

Folded value of key is
0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3

Partition result into words and combine using XOR
0x4b 0x65 0x65 0x66
0x65 0x0
0x0
0x0
0x41 0xa3 0x0
0x0

0x6f 0xc6 0x65 0x66 = 1875273062
ECE569 Lecture 04-2.8
Spring 2003
Hashing

goal of hasing

How to choose hash function if all the key values
are uniformly distributed?

The critical issue is to produce 1:1 mapping

Collision: different inputs are mapped to the same
output.

The criteria of a good hash function is to keep the
collision as small as possible.
ECE569 Lecture 04-2.9
Spring 2003
Static Hashing

Input: folded key values

Output: bytes (relative to the beginning of the file),
blocks ??


Bytes are not good because of the varying tuple size.

A block/page is called a bucket.
H: {0 … 232-1} -> {0, B-1}

Continuous allocation

Fixed size: B pages are allocated at file creation time.

Insert
- Determine the bucket
- Check the bucket ( collision may happen)
ECE569 Lecture 04-2.10
Spring 2003
How to find a good hash function

Division / remainder (Congruential hashing)


H(Kb) = kb mod B where kb is folded key value and B
is the number of buckets.
Nth power

Compute kbN, and from the resulting bit string (n x 31
bits) take log2B bits from the middle.

Base transformation

Polynomial division

Numerical analysis

encryption
ECE569 Lecture 04-2.11
Spring 2003
Performance

Assumption




Perfect hash function (tuples are uniformly distributed
over B buckets)
Lookup

½  n/R  1/B
To finish first match

n/R  1/B
If tuple does not exist
Insertion

n/R  1/B + 1
Test for duplicates

1
Otherwise
Deletion

 ½  n/R  1/B delete first match
ECE569 Lecture 04-2.12
Spring 2003
Collision

Two keys collide if they hash to same value

A bucket with room for R tuples can accommodate
R – 1 collisions before it overflows

Internal resolution: Place overflow blocks in another
bucket
- (h(K) + 1) mod B
open addressing
- (h2(h1(K))
multiple hashing
ECE569 Lecture 04-2.13
Spring 2003
Collision - continued

External resolution: Allocation overflow block, link to
overflow chain
buckets
Overflow
pages
ECE569 Lecture 04-2.14
Spring 2003
Discussion

How do you limit the number of pages accessed
when retrieving a tuple, for both external and
internal resolution?
ECE569 Lecture 04-2.15
Spring 2003
How to locate a tuple in a page?

Sequential search

Page directory

hash
ECE569 Lecture 04-2.16
Spring 2003
Extendible Hashing

The number of buckets can grow/shrink.

An intermediate data structure translates the hash
results into page addresses. This data structure
needs to be as compact as possible.

Hashes into an array of pointer to buckets (directory).

The array is small enough to be kept in memory.
ECE569 Lecture 04-2.17
Spring 2003
Directory Growth
• To adapt to dynamically varying size of hash filemodify directory size
• Assume a hash function h(Kb) that produces a bit
string s.
• The directory is of size 2d. d is called the global
depth and is initially 0.
• Use least significant d bits of s to determine
bucket to access
• Each bucket has a corresponding local depth in
the range [0, d]
ECE569 Lecture 04-2.18
Spring 2003
Example
Insert
Each
0x13, 0x10, 0x07, 0x00, 0x1f
page can contain no more than 2 tuples
0x13,0x10
0
0
0
0x10,0x00
1
0
1
d=1
d=0
0x10,0x00
1
00
0
01
1
10
d=1
0x13,0x10
0x13,0x07
1
2
11
d=2
0x13,0x07
ECE569 Lecture 04-2.19
2
Spring 2003
Example – insert 0x1f
0x10,0x00
1
000
001
010
2
011
100
101
0x13
3
0x07,0x1f
3
110
111
d=3
ECE569 Lecture 04-2.20
Spring 2003
Performance

2 steps for retrieving a tuple

If we can keep the directory in memory, each
retrieval is one page access

Assuming 4 bytes per entry, 4KB pages, 1GB hash
files, and we want to keep the entire directory in
memory, what is the minimum buffer size?
ECE569 Lecture 04-2.21
Spring 2003
Discussion

How easy is it to keep the directory in the
memory?

How do we reduce the structure when the file
shrinks?
ECE569 Lecture 04-2.22
Spring 2003