Search Key

Transcript Search Key

Access Methods
This is a modified version of Prof. Hector Garcia Molina’s
slides. All copy rights belong to the original author.
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Basic Concepts
search key pointer
Value

?
record
value
Search Key - set of attributes used to look up
records in a file.
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Index Evaluation Metrics

Access types supported efficiently. E.g.,





Point query: find “Tom”
Range query: find students whose age is between 20-40
Access time
Update time
Space overhead
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Ordered Indices

In an ordered index, index entries are stored sorted
on the search key value. E.g., author catalog in
library.
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
same order
Primary index
Also called clustering
index
•The search
key of a
primary
index is
usually but
not
necessarily
the primary
key.
1/14/2005
10
20
10
30
50
70
30
40
90
110
130
150
50
60
70
80
90
100
170
190
210
230
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Search key
different order
Secondary index:
non-clustering index.
10
20
30
40
30
50
50
60
70
...
80
40
20
70
100
10
90
60
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Search key
Dense Index
Dense Index:
contains index records
for every search-key
values.
10
20
10
20
30
40
30
40
50
60
70
80
50
60
70
80
90
100
90
100
110
120
1/14/2005
Sequential File
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Sparse Index
Sparse Index:
contains index records
for only some searchkey values.
Applicable when
records are
sequentially
ordered on
search-key
1/14/2005
Sequential File
10
20
10
30
50
70
30
40
90
110
130
150
50
60
70
80
90
100
170
190
210
230
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Sequence
field
Secondary indexes
• Sparse index
30
50
30
20
80
100
20
70
80
40
90
...
100
10
does not make sense!
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
90
60
Multilevel Index
Sequential File
Sparse 2nd level
10
90
170
250
330
410
490
570
10
30
50
70
90
110
130
150
170
190
210
230
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
10
20
30
40
50
60
70
80
90
100
Multilevel Index
Sequence
field
Secondary indexes
10
50
90
...
sparse
high
level


1/14/2005
10
20
30
40
30
50
50
60
70
...
80
40
20
70
100
10
Lowest level is dense
Other levels are sparse
Yan Huang - CSCI5330 Database
Implementation – Access Methods
90
60
Conventional indexes
Advantage:
- Simple
- Index is sequential file good for scans
Disadvantage:
- Inserts expensive
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Outline:


Conventional indexes
B+-Tree
 NEXT
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods

NEXT: Another type of index


Give up on sequentiality of index
Try to get “balance”
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
B+Tree Example
n=4
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
180
200
150
156
179
120
130
100
101
110
30
35
3
5
11
30
120
150
180
100
Root
to keys
< 57
to keys
95
81
57
Sample non-leaf
to keys
57 k<81
81k<95
Key is moved (not copied) from lower level
non-leaf node to upper level non-leaf node
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
to keys
95
Sample leaf node:
From non-leaf node
57
81
95
To record
with key 57
To record
with key 81
To record
with key 85
to next leaf
in sequence
Key is copied (not moved) from leaf
1/14/2005
Yan Huang - CSCI5330 Database
node
to non-leaf nodeImplementation
– Access Methods
Leaf:
30
35
n=4
30 35
Non-leaf:
30
30
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Size of nodes:
n pointers
n-1 keys
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Don’t want nodes to be too empty

Use at least
Root
: 2 pointers
Non-leaf: n/2 pointers
Leaf
1/14/2005
: (n-1)/2 keys
Yan Huang - CSCI5330 Database
Implementation – Access Methods
n=4
Full node
min. node
30
30
35
Leaf
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
counts even if null
120
150
180
3
5
11
Non-leaf
B+tree rules
tree of order n
(1) All leaves at same lowest level
(balanced tree)
(2) Pointers in leaves point to records

except for “sequence pointer”
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
(3) Number of pointers/keys for B+tree
Max Max Min
ptrs keys ptrsdata
Non-leaf
(non-root)
Leaf
(non-root)
Root
1/14/2005
Min
keys
n
n-1
n/2
n/2- 1
n
n-1
(n-1)/2
(n-1)/2
n
n-1
2
1
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Insert into B+tree
(a) simple case

space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
n=4
1/14/2005
30
31
32
3
5
11
30
100
(a) Insert key = 32
Yan Huang - CSCI5330 Database
Implementation – Access Methods
n=4
1/14/2005
30
30
31
3
57
11
3
5
7
100
(b) Insert key = 7
Yan Huang - CSCI5330 Database
Implementation – Access Methods
n=4
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
180
200
180
160
179
150
156
179
120
150
180
160
100
(c) Insert key = 160
n=4
(d) New root, insert 45
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
40
45
30
32
40
20
25
10
12
1
2
3
10
20
30
40
30
new root
Deletion from B+tree
(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys
(d) Cases (b) or (c) at non-leaf
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
(b) Coalesce with sibling
n=5
Delete 50
1/14/2005
40
50
10
20
30
40
10
40
100

Yan Huang - CSCI5330 Database
Implementation – Access Methods
(c) Redistribute keys
n=5
Delete 50
1/14/2005
35
40
50
10
20
30
35
10
40 35
100

Yan Huang - CSCI5330 Database
Implementation – Access Methods
(d) Non-leaf coalesce
n=5
Delete 37
25

1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
40
45
30
37
30
40
25
26
30
20
22
10
14
1
3
10
20
25
40
new root
B+tree deletions in practice
–
Often, coalescing is not implemented

Too hard and not worth it!
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Index Definition in SQL

Create an index
create index <index-name> on <relation-name>
(<attribute-list>)
E.g.: create index gindex on country(gdp);

To drop an index
drop index <index-name>
E.g.: drop index gindex;
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Multi-key Index
Motivation: Find records where
DEPT = “Toy” AND SAL > 50k
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Strategy I:


Use one index, say Dept.
Get all Dept = “Toy” records
and check their salary
I1
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Strategy II:

Use 2 Indexes; Manipulate Pointers
Toy
1/14/2005
Sal
> 50k
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Strategy III:

Multiple Key Index
One idea:
I2
I1
1/14/2005
I3
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example
Art
Sales
Toy
Dept
Index
10k
15k
17k
21k
12k
15k
15k
19k
Salary
Index
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example
Record
Name=Joe
DEPT=Sales
SAL=15k
For which queries is this index good?
Find RECs Dept = “Sales”
Find RECs Dept = “Sales”
Find RECs Dept = “Sales”
Find RECs SAL = 20k
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
SAL=20k
SAL > 20k
Interesting application:
Geographic Data
y
DATA:
<X1,Y1, Attributes>
<X2,Y2, Attributes>
x
...

1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Queries:



What city is at <Xi,Yi>?
What is within 5 miles from <Xi,Yi>?
Which is closest point to <Xi,Yi>?
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example
40
10
25
i
h
30
20
20
15 35
20
10
e
n
o
j
k
m
10
5
h i
f
15
15
j k
g
l
1/14/2005
m
d e
c
b
f
l
c
g
20
a b
• Search points near f
• Search points near b
n o
Yan Huang - CSCI5330 Database
Implementation – Access Methods
a
d
Queries




Find points with Yi > 20
Find points with Xi < 5
Find points “close” to i = <12,38>
Find points “close” to b = <7,24>
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods

Many types of geographic index
structures have been suggested


1/14/2005
Quad Trees
R Trees
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Two more types of multi key indexes


Grid
Bitmap index
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Grid Index
Key 2
X1 X2
……
Xn
V1
V2
Key 1
Vn
To records with key1=V3, key2=X2
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
CLAIM

Can quickly find records with




key 1 = Vi  Key 2 = Xj
key 1 = Vi
key 2 = Xj
And also ranges….

E.g., key 1  Vi  key 2 < Xj
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
 But there is a catch with Grid Indexes!
V2
V3
X1
X2
X3
X4
Like
Array...
V1
X1
X2
X3
X4
How is Grid Index stored on disk?
X1
X2
X3
X4

Problem:
 Need regularity so we can compute
of <Vi,Xj> entry
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
position
Solution: Use Indirection
V1
V2
V3
V4
Buckets
X1 X2 X3
----------
*Grid only
contains
pointers to
buckets
Buckets
---1/14/2005
---Yan Huang - CSCI5330 Database
Implementation – Access Methods
With indirection:


Grid can be regular without wasting space
We do have price of indirection
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Can also index grid on value ranges
Salary
Grid
1
2
50K-
3
8
0-20K
20K-50K
1
Linear Scale
1/14/2005
2
3
Toy Sales Personnel
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Grid files
+ Good for multiple-key search
- Space, management overhead
(nothing is free)
- Need partitioning ranges that evenly split keys
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example Grid File for account
Divide branch-name into
non-uniform intervals
?
Branch-name <Central and
10k<=balance<50k
two attributes
as search key
Divide balance into nonuniform intervals
What about Central<=branch-name<Townsend and 50k<=balance?
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example Grid File for account
Bj
Bk
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Grid Files (Cont.)

Linear scales must be chosen to uniformly distribute
records across cells.


Periodic re-organization to increase grid size will
help.


Otherwise there will be too many overflow buckets.
But reorganization can be very expensive.
Space overhead of grid array can be high.
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Bitmap Indices

Another index could be used for multiple valued search
keys
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Bitmap Indices (Cont.)
The income-level value of record 3 is L1
Bitmap(size = table size)
Unique values
of gender
Unique values
of income-level
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Bitmap Indices (Cont.)

Some properties of bitmap indices



Number of bitmaps for each attribute?
Size of each bitmap?
When is the bitmap matrix sparse and what attributes are good
for bitmap indices?
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Bitmap Indices (Cont.)

Bitmap indices generally very small compared with
relation size




E.g. if record is 100 bytes, space for a single bitmap is 1/800 of
space used by relation.
If number of distinct attribute values is 8, bitmap is only 1% of
relation size
What about insertion?
Deletion?
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Bitmap Indices Queries
Sample query: Males with income level L1
10010 AND 10100 = 10000
even faster!
What about the number of males with income level L1?
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Bitmap Indices Queries

Queries are answered using bitmap operations



1/14/2005
Intersection (and)
Union (or)
Complementation (not)
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Hashing
key  h(key)
<key>
.
.
.
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Buckets
(typically 1
disk block)
Two alternatives
.
.
.
(1) key  h(key)
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
records
.
.
.
Two alternatives
(2) key  h(key)
key 1
Index

Alt (2) for “secondary” search key
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
record
Example hash function



Key = ‘x1 x2 … xn’ n byte character string
Have b buckets
h: add x1 + x2 + ….. xn

1/14/2005
compute sum modulo b
Yan Huang - CSCI5330 Database
Implementation – Access Methods
 This may not be best function …
Good hash
function:
1/14/2005
 Expected number of
keys/bucket is the
same for all buckets
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Within a bucket:

Do we keep keys sorted?

Yes, if CPU time critical
& Inserts/Deletes not too frequent
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Next: example to illustrate
inserts, overflows, deletes
h(K)
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
EXAMPLE 2 records/bucket
INSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
0
d
1
a
c
2
b
3
h(e) = 1
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
e
EXAMPLE: deletion
Delete:
e
f
c
0
a
1
b
c d
e
2
3
1/14/2005
f
g
d
maybe move
“g” up
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Rule of thumb:



Try to keep space utilization
between 50% and 80%
Utilization =
# keys used
total # keys that fit
If < 50%, wasting space
If > 80%, overflows significant
depends on how good hash
function is & on # keys/bucket
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
How do we cope with growth?


Overflows and reorganizations
Dynamic hashing


1/14/2005
Extensible
Linear
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Extensible hashing: two ideas
(a) Use i of b bits output by hash function
b
h(K) 
00110101
use i  grows over time….
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
(b) Use directory
h(K)[i ]
.
.
.
to bucket
.
.
.
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example: h(k) is 4 bits; 2 keys/bucket
i=
1
1
0001
i=
1/14/2005
00
01
10
1 2
1001
1010 1100
Insert 1010
2
11
1 2
1100
Yan Huang - CSCI5330 Database
Implementation – Access Methods
New directory
Example continued
i= 2
00
01
10
11
Insert:
0111
2
0000
0001
1 2
0001 0111
0111
2
1001
1010
2
1100
0000
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example continued
0000 2
i= 2
0001
00
0111 2
01
10
11
Insert:
1001
1/14/2005
1001 3
1001
1010 1001 2 3
1010
1100 2
Yan Huang - CSCI5330 Database
Implementation – Access Methods
i=3
000
001
010
011
100
101
110
111
Extensible hashing: deletion


1/14/2005
No merging of blocks
Merge blocks
and cut directory if possible
(Reverse insert procedure)
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Deletion example:

Run thru insert example in reverse!
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Summary
Extensible hashing
+ Can handle growing files
- with less wasted space
- with no full reorganizations
- Indirection
(Not bad if directory in memory)
Directory doubles in size
-
(Now it fits, now it does not)
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Linear hashing

Another dynamic hashing scheme
Two ideas:
b
(a) Use i low order bits of hash
01110101
grows
(b) File grows linearly
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
i
Example b=4 bits,
i =2, 2 keys/bucket
• insert 0101
0101
0000
1010
00
• can have overflow chains!
Future
growth
buckets
0101
1111
01
10
11
m = 01 (max used block)
Rule
1/14/2005
If h(k)[i ]  m, then
look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example b=4 bits,
i =2, 2 keys/bucket
0101
0000
1010
0101
0101
1111
00
01
• insert 0101
1010
10
1111
11
m = 01 (max used block)
10
11
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Future
growth
buckets
Example Continued: How to grow beyond this?
i=2 3
0000
0 00
100
0101
0101
001
101
1010
010
110
0 11
111
m = 11 (max used block)
100
101
1/14/2005
0101
0101
1111
Yan Huang - CSCI5330 Database
Implementation – Access Methods
100
101
...
 When do we expand file?


Keep track of:
# used slots
total # of slots
If U > threshold then increase m
(and maybe i )
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
=U
Summary
Linear Hashing
+
Can handle growing files
- with less wasted space
- with no full reorganizations
+ No indirection like extensible hashing
-
Can still have overflow chains
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Example: BAD CASE
Very full
Very empty
1/14/2005
Need to move
m here…
Would waste
space...
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Summary
Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Indexing vs Hashing

Hashing good for probes given key
e.g., SELECT …
FROM R
WHERE R.A = 5
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods
Indexing vs Hashing

INDEXING good for
Range Searches:
e.g., SELECT
FROM R
WHERE R.A > 5
1/14/2005
Yan Huang - CSCI5330 Database
Implementation – Access Methods