CSE 331. Computer Organization
Download
Report
Transcript CSE 331. Computer Organization
ECE 569
Database System Engineering
Fall 2004
Yanyong Zhang www.ece.rutgers.edu/~yyzhang
Course URL www.ece.rutgers.edu/~yyzhang/fall04
ECE569 Lecture 04-2.1
Fall 2004
Associative access
The system is not asked to retrieve tuples based
on information about their storage location; rather,
it has to find all tuples the attribute values of which
fulfill certain conditions – associative access.
Associative access can be realized by sequential
scanning, which happens for complicated queries.
select
R.x, S.y,
from
R,S
where
R.k = S.f and R.b < 12;
But for simple selection predicates, this is very
slow (even for an in-memory database)
ECE569 Lecture 04-2.2
Fall 2004
Access Path
The class of algorithms and data structures
designed for translating attribute values into TID,
or into other types of internal addresses of tuples
having those attribute values, is called access
paths.
Depending on what kind of selection predicate is
to be supported, the techniques for associative
access vary greatly.
ECE569 Lecture 04-2.3
Fall 2004
Content addressability techniques
Primary key access.
A tuple of a relation must be retrieved efficiently via the
value of its primary (unique) key(s). e.g., key-sequenced
files and hased files.
Point query vs. range query
Secondary key access
A set of tuples are produced
Multi-table access
Tuple access is often based on relationships between
different tuples. E.g., all orders placed by a given
customer have to be found.
ECE569 Lecture 04-2.4
Fall 2004
Associative access path techniques
Hashing (key transformation)
Using the primary key value as a parameter to a function,
which returns the storage location of the tuple.
Key comparison
Maintaining a dynamic search structure on the set of
values in the key attribute. These values can be
organized into tables, lists, trees, and so on
e.g, B+ tree
ECE569 Lecture 04-2.5
Fall 2004
Operations on files (heap files)
Assumptions
n = number of records in file
R = number of records that can fit in block
Lookup – Given a key find corresponding record
Insertion – add record to file (allows duplicates)
On average, n / (2R) block accesses.
Read last block; it may need to allocate a new block.
Approximately, requires 2 accesses
Deletion – delete record
look up record n / (2R)
Write back to disk (1 access)
Reorganize (unpinned) – move tuple from last page to utilize
space (2 disk accesses)
ECE569 Lecture 04-2.6
Fall 2004
Hashed Files
File is divided into B buckets
Hash function h maps elements of the key space to
range [0, B)
Key space is large and unevenly distributed
- SSNs as character strings
- Each character takes on at most 10 of the possible 256
values
Hash function h must map key values evenly among a
relatively small number of values.
ECE569 Lecture 04-2.7
Fall 2004
Range of
Potential
Key Values
(the shaded
areas denote
used key
values)
ECE569 Lecture 04-2.8
FOLDING
Range of positive integers
Hash-based associative access
HASHING
tuple
address
space
Fall 2004
Folding
Convert arbitrary data types to a positive integer h can be
applied to.
Reduce number of bits so that arithmetic is efficient.
Example: Key is “Keefe” and 16803
Key value is the concatenation of byte representation of
individual fields
Folded value of key is
0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3
Partition result into words and combine using XOR
0x4b 0x65 0x65 0x66
0x65 0x0
0x0
0x0
0x41 0xa3 0x0
0x0
0x6f 0xc6 0x65 0x66 = 1875273062
ECE569 Lecture 04-2.9
Fall 2004
Hashing
goal of hashing
How to choose hash function if all the key values
are uniformly distributed?
The critical issue is to produce 1:1 mapping
Collision: different inputs are mapped to the same
output.
The criteria of a good hash function is to keep the
collision as small as possible.
ECE569 Lecture 04-2.10
Fall 2004
Static Hashing
Input: folded key values
Output: bytes (relative to the beginning of the file),
blocks ??
Bytes are not good because of the varying tuple size.
A block/page is called a bucket.
H: {0 … 232-1} -> {0, B-1}
Continuous allocation
Fixed size: B pages are allocated at file creation time.
Insert
- Determine the bucket
- Check the bucket ( collision may happen)
ECE569 Lecture 04-2.11
Fall 2004
How to find a good hash function
Division / remainder (Congruential hashing)
H(Kb) = kb mod B where kb is folded key value and B
is the number of buckets.
Nth power
Compute kbN, and from the resulting bit string (n x 31
bits) take log2B bits from the middle.
Base transformation
Polynomial division
Numerical analysis
encryption
ECE569 Lecture 04-2.12
Fall 2004
Performance
Assumption
Perfect hash function (tuples are uniformly distributed
over B buckets)
Lookup
½ n/R 1/B
To finish first match
n/R 1/B
If tuple does not exist
Insertion
n/R 1/B + 1
Test for duplicates
1
Otherwise
Deletion
½ n/R 1/B delete first match
ECE569 Lecture 04-2.13
Fall 2004
Collision
Two keys collide if they hash to same value
A bucket with room for R tuples can accommodate
R – 1 collisions before it overflows
Internal resolution: Place overflow blocks in another
bucket
- (h(K) + 1) mod B
linear probing
- (h2(h1(K))
multiple hashing
ECE569 Lecture 04-2.14
Fall 2004
Collision - continued
External resolution: Allocation overflow block, link to
overflow chain
buckets
Overflow
pages
ECE569 Lecture 04-2.15
Fall 2004
Discussion
What are the disadvantages of static hashing?
How do you limit the number of pages accessed
when retrieving a tuple, for both external and
internal resolution?
ECE569 Lecture 04-2.16
Fall 2004
Trie
The buckets will dynamically grow/shrink/balance
Fundamental: trie
A
0
A
0
0
0
0
B
1
ECE569 Lecture 04-2.17
B
C
(b)
(a)
000
001
010
011
100
101
110
111
1
1
C
1
1
D
A
D
B
C
(c)
Fall 2004
Dynamic hashing function
We need a hash function whose range of values
can change dynamically
One such hash function can be constructed using
a series of functions hi(k), i = 0, 1, …., such that for
any k, either hi(k) = hi-1(k) or hi(k) = hi-1(k)+2i-1.
Choose H(k), which maps the key space into
random bit patterns of length m, for m sufficiently
large. Then hi(k) may be defined as the integers
formed by the last i bits of H(k).
ECE569 Lecture 04-2.18
Fall 2004
Extendible Hashing
The number of buckets can grow/shrink.
An intermediate data structure translates the hash
results into page addresses. This data structure
needs to be as compact as possible.
Hashes into an array of pointer to buckets (directory).
The array is small enough to be kept in memory.
ECE569 Lecture 04-2.19
Fall 2004
Directory Growth
• To adapt to dynamically varying size of hash filemodify directory size
• Assume a hash function h(Kb) that produces a bit
string s.
• The directory is of size 2d. d is called the global
depth and is initially 0.
• Use least significant d bits of s to determine
bucket to access
• Each bucket has a corresponding local depth in
the range [0, d] which indicates the difference
between all the records in this bucket
ECE569 Lecture 04-2.20
Fall 2004
Example
Insert
Each
0x13, 0x10, 0x07, 0x00, 0x1f
page can contain no more than 2 tuples
0x13,0x10
0
0
0
0x10,0x00
1
0
1
d=1
d=0
0x10,0x00
1
00
0
01
1
10
d=1
0x13,0x10
0x13,0x07
1
2
11
d=2
0x13,0x07
ECE569 Lecture 04-2.21
2
Fall 2004
Example – insert 0x1f
0x10,0x00
1
000
local degree =
001
global degree –
010
log2(# of arrows pointing to
this bucket)
2
011
100
101
0x13
3
0x07,0x1f
3
110
111
d=3
ECE569 Lecture 04-2.22
Fall 2004
Performance
2 steps for retrieving a tuple
If we can keep the directory in memory, each
retrieval is one page access
Assuming 4 bytes per entry, 4KB pages, 1GB hash
files, and we want to keep the entire directory in
memory, what is the minimum buffer size?
ECE569 Lecture 04-2.23
Fall 2004
Discussion
How easy is it to keep the directory in the
memory?
How do we reduce the structure when the file
shrinks?
How do you make the directory small, and increase
space utilization? (deferred splitting)
ECE569 Lecture 04-2.24
Fall 2004
Linear hashing
(a)
a
b
c
d
d=2
00 01 10 11
w
(b)
a
b
000 01
c
d
10 11
d=2
100
x
(c)
a
b
c
d
000 001 10 11
w
d=2
100 101
y
x
(c)
a
b
c
d
000 001 010 11
d=2
ECE569 Lecture 04-2.25
y
w
100 101 110
(d)
a
b
c
d
w
x
000 001 010 011 100 101 110 111
d=3
Fall 2004
Addressing in linear hashing
Which hash function should be used?
d is the degree, p is the address of the next page to split
The algorithm is as follows:
begin
if (hd(k) >= p) then page:= hd(k)
else
page := hd+1(k)
if necessary, chase the overflow chain
end
ECE569 Lecture 04-2.26
Fall 2004
Discuss
Where to store the overflow?
ECE569 Lecture 04-2.27
Fall 2004