Why Not Store Everything in Main Memory? Why use disks?

Download Report

Transcript Why Not Store Everything in Main Memory? Why use disks?

7. Indexes
Heap files allow record retrieval:
by specifying the Record IDentifier, RID, or
by scanning all records sequentially.
Sometimes, retrieval of records by specifying the values in
one or more fields is needed (semantic search or value-based
querying), e.g.,
Find all students in the CS dept; or
Find students with gpa > 3;
Indexes are auxilliary files (separate from the data files they index)
that enable answering these value-based queries efficiently.
An index contains a “search key attribute”, k, and “data entries”, k*,
which lead us to the records containing the search key value.
the k*s can be actual records,
or pointers to the records
or pointers to pointers to the records (indirect pointers).
In these notes we will always assume the second alternative
Section 7
#1
Index Classification
• Primary Indexes vs. Secondary Indexes:
– If the search key is, or at least contains, the clustered primary key,
then the index is called a primary index, else it is called a
secondary index.
• Clustered Indexes vs. Unclustered Indexes:
– If closeness of key values implies closeness of the data records
containing those key values, the index is called a clustered index,
else it is an unclustered index. (recall, since physical disk storage
is not completely linear, "close" means
• 1st: close on one disk track
• 2nd: on differnt tracks but on the same cylinder
• 3rd: on the different cylinders but on close cylinders.
–
A file can be clustered on at most 1 attribute (search key) but that
–
The cost of retrieving data records through an index varies greatly
based on whether index is clustered or not!
attribute may be composite (made up of multiple attributes).
Section 7
#2
Index Classification continued
key
Anchor records of each page
•
If there is at least one index entry
per existing attribute value, then
it is called dense, else sparse or
non-dense.
Name, age, bonus
Ashby, 25, 3000
22
Basu,
33, 4003
25
Bristow, 30, 2007
30
Ashby
33
Cass
Cass,
Smith
Daniels, 22, 6003
Jones,
50, 5004
40
40, 6003
44
•
•
•
•
Every sparse index must be
clustered! Sparse indexes are
smaller.
We show a sparse index on the
Name attribute.
The key values (name in this
example) that do occur in a sparse
index are called ANCHOR values
(always the first key value that
occurs on a page).
In this example the anchor names
are Ashby, Cass and Smith. Basu,
Bristow, Daniels, Jones and Tracy
44
Smith, 44, 3000
50
Tracy, 44, 5004
Sparse Index
on
Name
Dense Index
on
Age
Data File
Section 7
#3
Primary Index
PRIMARY INDEX: I(k, k*)
k  ordered or clustered "key" field values from an ordered or clustered field of file with
the uniqueness property (individual value occurrences are "unique" i.e., each value can occur at most once.)
k* = pointer to page (sematic pointer - namely page number) containing the first record
with key value, k.
Example: Assume the blocking factor (bfr) is 2, which means there is room for 2 records per page.
STUDENT
|S#|SNAME |LCODE |
|17|BAID |NY2091|
|25|CLAY |NJ5101|
|32|THAISZ|NJ5102|
|38|GOOD |FL6321|
|57|BROWN |NY2092|
|83|THOM |ND3450|
page 1
page 2
Primary Index on S#
|S#|pg
|17| 1
|32| 2
|57| 3
page 3
Section 7
#4
Clustering Index
is like a primary index except that the attribute need not be a key
- but the file must be clustered on the attribute, k
- and the pointer for any k is the 1st page with that k-value
ENROLL2
|S#|C#|GRADE
|17|6 | 96 |
|25|6 | 76 |
|32|6 | 62 |
|38|6 | 98 |
|32|6 | 91 |
|25|7 | 68 |
|32|8 | 89 |
|17|9 | 95 |
page 1
page 2
page 3
page 4
|C#|pg| Dense Clustering_Index on C#
|6 | 1|
|7 | 3|
|8 | 4|
|9 | 4|
|C#|pg| Non-dense Clustering_Index on C#
|6 | 1| (indexing new anchor records only)
|8 | 4|
There's no more search overhead with this 2nd
type of non-dense clustering index, but
How can you know which page has C#=7?
(search pages sequentially starting at
page=1)
How can you know which page has C#=9?
(search pages sequentially starting at
page=4)
Section 7
#5
Secondary Index
These indexes are the same as Clustering Indexes except,
- the file need not be clustered on k,
- k* points to the page or record containing k
- every record must be indexed (all secondary indexes are dense). Why?
Option1: If there are multiple occurences of k, use multiple index entries for that k.
S#|C#|GRADE ENROLL (unclustered C#)
32|8 | 89 | page 1
25|6 | 76 |
32|6 | 62 | page 2
25|8 | 86 |
38|6 | 98 | page 3
32|7 | 91 |
17|5 | 96 | page 4
25|7 | 68 |
17|8 | 95 |
page 5
C#|pg Secondary_Index, Option1 on C#
5 | 4
6 | 1
6 | 2
6 | 3
7 | 3
7 | 4
8 | 1
8 | 2
8 | 4
Option2: Use repeating groups of
pointers (requires a variable
length page attribute)
|C#|page
|5 | 4
|6 | 1,2,3
|7 | 3,4
|8 | 1,2,4
Option3: Use 1 index entry for each value, with
1 pointer to a linked list of record pointers.
(1 level of indirection)
|S#|C#|GRADE
ENROLL (unclustered C#)
|32|8 | 89 |
|25|6 | 76 |
|32|6 | 62 |
|25|8 | 86 |
|38|6 | 98 |
|32|7 | 91 |
|17|5 | 96 |
|25|7 | 68 |
|17|8 | 95 |
|C#|
|5 |
|6 |
|7 |
|8 |
page
-->|4|
-->|1|->|2|->|3|
-->|3|->|4|
-->|1|->|2|->|4|
Section 7
#6
Multi-level Index
(made up of an index on an index)
For any index, since it is a file clustered on the key, k, it can have a primary or
clustering index on it. (constituting the second level of the multilevel index).
STUDENT
|S#|SNAME |LCODE |
|17|BAID |NY2091|
|25|CLAY |NJ5101|
|32|THAISZ|NJ5102|
|38|GOOD |FL6321|
|57|BROWN |NY2092|
|83|THOM |ND3450|
|91|PARK |MN7334|
|94|SIVA |OR1123|
|S#|pg|
|17| 1|
|32| 2|
|57| 3|
|91| 4|
page 1 of STUDENT
page 4
(of index file) S#-index (nondense, primary)
page 1 of First Level Index
page 2
2nd_LEVEL (a second level, nondense index)
|S#|pg|
|17| 1|
|57| 2|
Section 7
#7
Tree-structured or Multilevel indexing techniques:
ISAM:
(a variation of multilevel Secondary indexing) static structure;
B+ tree: dynamic structure which adjusts gracefully under insert and delete.
ISAM
index entry
1 index every record by <k, k*>.
K*
0
K
1
K*
1
K 2
K m K*m
K*
2
This Index file may still be quite large, but we can apply the idea repeatedly!
* Leaf pages contain data entries, <k,k*>.
* Non-Leaf or Inodes contain k values only
Non-leaf (inode
Leaf
Pages
Overflow
page
Primary pages
Section 7
#8
Example ISAM Tree
Blocking factor is 2 (each node has 2 k
entries)
In any internal node or inode (non-leaf) add
a ptr for key_values < first k-value
Root
40
20
10,10*
15,15*
20,20*
33
27,27*
51
33,33*
37,37*
40,40*
46,46*
51,51*
63
55,55*
63,63*
Section 7
97,97*
#9
Insert k=23
Insert k=48
Insert k=41
Insert k=42
40
Index
Pages
Primary
Leaf
Pages
Overflow
Pages
20
10,10*
15,15*
20,20*
33
27,27*
51
33,33*
Need overflow page
23,23*
37,37*
40,40*
46,46*
51,51*
63
55,55*
63,63*
Need overflow page
48,48* 41,41*
Need overflow page
42,42*
Section 7
# 10
97,97*
40
Deleting 42
Deleting
51
Deleting 97
10,10*
20
15,15*
20,20*
23,23*
33
27,27*
51
33,33*
37,37*
40,40*
46,46*
51,51*
63
55,55*
63,63*
48,48* 41,41*
42,42*
* Note that 51 appears in index levels, but not in leaf!
Section 7
# 11
97,97*
B+ Tree: The Most Widely Used Index
• keeps tree height-balanced.
• Minimum 50% occupancy (except for root). Each
node contains m entries, where d  m  2d.
– d is called the degree or order of the index.
• Supports equality and range-searches efficiently.
Index Entries
(“Direct search set or index set”)
Data Entries
("Sequence set")
Section 7
# 12
Example B+ Tree (d=2)
Search begins at root, key comparisons direct it to a leaf (similar to ISAM except use  comparisons to keys
and take left pointer iff < lowest key).
Search for 5
Search for 15
Search for all data entries  24
Root
13
2*
3*
5*
5*
7*
14* 16*
17
24
19* 20* 22*
30
Leaves are doubly linked for
fast sequential <, , , > search
24* 27* 29*
33* 34* 38* 39*
15 is not in the file!
Section 7
# 13
Example B+ Tree (contd.)
• Search for all data entries < 23
• (note, this is the reason for the double linkage).
Root
13
2*
3*
5*
7*
14* 16*
17
24
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
Section 7
# 14
Inserting a Data Entry into a B+ Tree
• Find correct leaf L.
• Put data entry in L.
–
–
If L has enough space, done!
Else, must split L (into L and a new node L2)
• Redistribute entries, copy up (promote) middle key.
• middle value which was promoted and is now the anchor key for L2).
• This can happen recursively (e.g., if there is no space for the promoted middle
value in the inode to which it is promoted)
–
To split inode, redistribute entries evenly, but push up (promote) middle
key.
•
•
•
So promote means Copy up at leaf;
Move up at inode.
Splits “grow” tree
only a root split increases height.
– Only tree growth possible: wider or 1 level taller at top.
Section 7
# 15
Entry to be inserted in parent node.
(Note that 17 is moved up and only
appears once in the index. Contrast
this with a leaf split.)
17
No room for 5, so split and move 17 up.
Inserting 8*
5
5
13
13
17
3*
24 30
24
2*
5*30 7*
No room for 8, so split.
2*
3*
5*2*
7*
3*
8*
5*
7*
14* 16*
19* 20* 22*
24* 27* 29*
33* 34* 38* 39*
5 is promoted to parent node.
(Note that 5 is
s copied up and
continues to appear in the new leaf node, L2, as its anchor value.)
Observe how minimum occupancy is guaranteed in both leaf and index pg splits.
• Note difference between copy-up (leaf) and move-up (inode)
Section 7
# 16
Root
13
2*
3*
5*
7*
17
24
19* 20* 22*
14* 16*
30
24* 27* 29*
33* 34* 38* 39*
B+ Tree Before Inserting 8*
After Inserting 8*
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
Note height_increase, balance and occupancy maintenance.
Section 7
# 17
Deleting a Data Entry from a B+ Tree
• Start at root, find leaf L where entry belongs.
• Remove the entry.
–
If L is at least half-full, done!
–
If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent node
with same parent as L).
• If re-distribution fails, merge L and a sibling.
• Merge could propagate to root, and therefore
decreasing height.
Section 7
# 18
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
19* 20* 22*
30
33* 34* 38* 39*
24* 27* 29*
Example Tree After Inserting 8*
Root
Then Deleting 19*, 20*
17
5
2*
•
•
3*
27
13
5*
7* 8*
14* 16*
22* 24*
30
27* 29*
33* 34* 38* 39*
Deleting 19* is easy.
Deleting 20* is done with re-distribution of 24* (and
revision of anchor value (from 24 to 27) in inode.
Section 7
# 19
17
... And Then Deleting 24*
5
2*
3*
27
13
5*
7* 8*
22* 24*
14* 16*
30
27* 29*
33* 34* 38* 39*
• Must merge.
27
17
5
2*
3*
30
13
5*
7* 8*
22* 27* 29*
14* 16*
33* 34* 38* 39*
• Observe `toss’ of index entry, 27, now that inode is below min occupancy so
merge it with its sibling
• and index entry, 17 can be `pulled down’ (sibling merge, followed by pull-down)
Root
5
2*
3*
5*
7*
8*
13
14* 16*
17
30
22* 27* 29*
33* 34* 38* 39*
Section 7
# 20
Multidimensional Index
Multidimensional data almost always requires multidimensional indexing for effective
access.
One dimensional indexes assume a single search column, attribute (search key) which
can be a composite column or key.
Data structures, that support queries into multidimensional data specifically, fall in two
categories:
1. Hash-table-like (e.g., Grid files and Partitioned Hash Functions)
2. Tree-like, eg,
multi-key indexes,
kd-trees,
quad-trees (for sets of points);
R-trees (for sets of regions as well as sets of points) ),
Predicate-trees (P-trees) for vertical compressed, representations of data
Section 7
# 21
Hash-like Structures for Multidimensional
e.g., Data Grid Files
Partition the POINTS in the space into a grid. In each dimension "grid lines" partition space into stripes.
Points that fall directly on a grid line belong to the stripe above or to the right of it.
Example: 12 customer(age,salary) 2-dimensional data records.
(age,salary): (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)
If vertical grid lines are drawn at age=40, age=65, horizontal at SAL=90K, SAL=224K
Grid hash function
age
range
0-39
40-55
sal
range
0-89K
0-89K
points
The points are hashed by ranges:
400K
(24,60)
380K
(46,60)
360K
(50,80)
340K
40-55 90-223K (50,100)
320K
(50,120)
300K
56-99 90-223K (70,100)
280K
(84,140)
260K
0-39 224-400K (30,260)
240K
(26,400)
220K
40-55 224-400K (44,360)
200K
(50,280)
180K
56-99 224-400K (60,260)
160K
140K
Inserting into Grid files: If there is room, 120K
100K
insert, else (two methods)
80K
1. add overflow block and chain it to the
60K
primary block, or
40K
2. reorganize the structure by adding or
20K
moving grid lines (similar to dynamic
0K
hashing)
0
10
20
30
40
*
*
*
*
*
*
**
**
*
A problem with Grid files is that the number
of buckets grows exponentially with
dimension and the grid may become sparse.
56
40
50
*
60
70
80
90
Section 7
100  AGE
# 22
Hash-like Structures for Multidimensional
e.g., Partitioned hash Files
is a sequence of hash functions, h=(h1,...hn) such that hi produces the ith segment of bits in the
hash key, that is, h(a) is the concatenation of bit subsequences, h1(a)h2(a)..hn(a).
Example: The data file is CUSTOMER(AGE,SAL) consisting again of (24,60) (46,60) (50,80)
(50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)
Use 2 hash functions and 3 bits, the 1st bit is for age with hash function, mod2(tens_digit of age)
and the last 2 bits are for salary with hash function, mod4(hundreds_digit of sal)
The lookup table is:
Partitioined hash function
key
points
000
001
010
011
100
101
110
111
(24,060) (46,060) (26,400)
(84,140)
(60,260)
(44,360)
(50,080)
(50,100) (50,120) (70,100)
(30,260) (50,280)
Section 7
# 23
Tree-like Structures for Multidimensional
e.g., Multi-key Index
Assume several attributes representing "dimensions" of the data points (data cube tuples)
- uses a multi-level index, e.g., suppose there are 2 attributes:
Provides a second level of Indexes on 2nd attribute to all tuples with same 1st attribute value
/|-->
/ |-->
Index on
.--> < |-->
1st attr /
\ |..
/|/
\|-->
/ |
/ |
/|-->
/
|
/ |-->
--> <
|----> < |-->
\
|
\ |..
\ |\
\|-->
\ | \
\| \
/|-->
\ \
/ |-->
\ `>< |-->
\
\ |..
\
\|-->
\
`-> . . .
indexes
on 2nd
attr
Take the (age, salary) points again (24,60) (24,260) (24,400)
(50,80) (50,100) (50,120) (50,280) (60,100) (60,260) (84,140)
. - - - - - - - - - - ->
/
.- - - - - - ->
/
/
.- - - ->
___/_________/______/
.--> |_60_|_260_|_400_____|
/
.- - - - - - -- - - - - ->
age
/
____/________________
24----' .-> |_80_|_100|_120_|_280_|- - - -->
50-------'
\
`- - - - - - - -->
60.
_______
`- - - - - - - - - - ->
84 `-->|100|260|- - - - - - - - - - - - - -->
\
`- - - - - - - - - - - - - - - - -->
\
_____
`- >|_140_|- - - - - - - - - - - - - - -->
Section 7
(24,060)
(24,060)
(24,400)
(50,080)
(50,280)
(50,120)
(50,100)
(60,260)
(60,100)
(84,140)
# 24
Tree-like Structures for Multidimensional
e.g., k dimensional (kd tree) Index
Interior nodes have (Attribute, Value, LowPointer, HighPontr)
- Value is a value which splits data points
- The example below will show (a, V, down, up) with pointers going down for LowPointer and up
for HighPointer. (goes up on greater or equal actually).
- Attributes used for different levels are different and ROTATE among the dimensions (round
robin).
- The leaves are blocks of records (assume data blocks hold 2 records, i.e., the blocking factor,
bfr, is 2).
- to search: decide along the tree until you reach a leaf (going up on greater or equal)
- to insert: decide along the tree until you reach the proper leaf if there is room there, insert; else
split the block and divide its contents according to the appropriate attribute (next one in the
rotation).
Example: (insert into kd-tree in this order using age first then salart, sal):
age,sal (50,80) (84,140) (30,260) (44,360) (50,120) (70,100) (24,60) (26,400) (50,280) (46,60)
(60,260) (50,100)
insert the first 2 pairs (no tree yet, since just 1 leaf block):
50, 80
84, 140
age sal
30, 260
(leaf is full so split it and divide the contents by sal=150)
30,260
sal
/
{ ,150} <
\
50,80
84,140
Section 7
# 25
Tree-like Structures for Multidimensional
e.g., k dimensional (kd tree) Index continued
30,260
sal
/
{ ,150} <
\
50,80
84,140
age sal
44,360 (leaf is not full so insert)
30,260
44,360
sal
/
{ ,150} <
\
50,80
84,140
age sal
50,120 (leaf is full so split, divide contents by age=55)
30,260
44,360
sal
/
{ ,150} <
\
84,140
\age
/
{ 55, } <
\
50,80
50,120
Section 7
# 26
Tree-like Structures for Multidimensional
e.g., k dimensional (kd tree) Index continued
age sal
50,120
30,260
44,360
sal
/
{ ,150} <
\
84,140
\age
/
{ 55, } <
\
50,80
50,120
age sal
70,100 (leaf is not full so insert)
30,260
44,360
sal
/
{ ,150} <
84,140
\
70,100
\age
/
{ 55, } <
\
50,80
50,120
age sal
24,060 (leaf is full so split, divide by sal=75)
30,260
44,360
sal
/
{ ,150} <
84,140
\
70,100
\age
/
{ 55, } <
\
50,080
\
50,120
\
sal /
{ ,75)<
\
24,060
Section 7
# 27
Tree-like Structures for Multidimensional
e.g., k dimensional (kd tree) Index continued
30,260
44,360
sal
/
{ ,150} <
\
84,140
70,100
\age
/
{ 55, } <
\
50,080
50,120
\
\
sal /
{ ,75)<
\
24,060
age sal
50,280 (leaf full split, div by sal=300)
44,360
/
sal
(300, }<
age
/
\
{ 28, }<
30,260
/
\
50,280
/
26,400
age sal
26,400 (leaf full split, div by age=28)
30,260
44,360
/
age
/
{ 28, }<
/
\
/
sal
/
{ ,150} <
\
26,400
/
sal
/
{ ,150} <
\
84,140
70,100
\age
/
{ 55, } <
\
50,080
50,120
\
\
84,140
70,100
\age
/
{ 55, } <
\
50,080
\
50,120
\
sal /
{ ,75)<
\
24,060
sal /
{ ,75)<
\
24,060
Section 7
# 28
Tree-like Structures for Multidimensional
e.g., k dimensional (kd tree) Index continued
44,360
sal
/
(300, }<
age
/
\
{ 28, }<
30,260
/
\
50,280
/
26,400
/
sal
/
{ ,150} <
\
age sal
46,060 (leaf not full so insert)
84,140
70,100
\age
/
{ 55, } <
\
44,360
/
sal
(300, }<
age
/
\
{ 28, }<
30,260
/
\
50,280
/
26,400
50,080
50,120
\
\
sal /
{ ,75)<
\
/
24,060
sal
/
{ ,150} <
\
84,140
70,100
\age
/
{ 55, } <
\
50,080
\
50,120
\
sal /
{ ,75)<
\
24,060
46,060
Section 7
# 29
Tree-like Structures for Multidimensional
e.g., k dimensional (kd tree) Index continued
44,360
/
sal
(300, }<
age
/
\
{ 28, }<
30,260
/
\
50,280
/
26,400
/
sal
/
{ ,150} <
\
age sal
60,260 (leaf full split by age=40)
84,140
70,100
\age
/
{ 55, } <
\
50,080
50,120
\
\
sal /
{ ,75)<
\
24,060
46,060
44,360
sal
/
(300, }<
age
/
\
30,260
{ 28, }<
\ age
/
/
\
{ 40, }<
/
26,400
\
/
50,280
sal
/
60,260
{ ,150} <
84,140
\
70,100
\age
/
{ 55, } <
\
50,080
\
50,120
\
sal /
{ ,75)<
\
24,060
46,060
Section 7
# 30
44,360
/
Tree-like Structures for Multidim
e.g., k dim (kd tree) Index continued
sal
(300, }<
age
/
\
30,260
{ 28, }<
\ age
/
/
\
{ 40, }<
/
26,400
\
/
50,280
sal
/
60,260
{ ,150} <
84,140
\
70,100
\age
/
{ 55, } <
\
50,080
\
50,120
age sal
\
sal /
50,100 (full split age=50 full again split sal=90)
{ ,75)<
\
44,360
24,060
sal
/
46,060
(300, }<
age
/
\
30,260
{ 28, }<
\ age
/
/
\
{ 40, }<
/
26,400
\
/
50,280
sal
/
60,260
{ ,150} <
84,140
50,120
\
70,100
50,100
\age
/
sal
/
{ 55, } <
{ , 90 }<
\
age
/
\
\
{ 50, }<
50,080
\
sal /
\
{ ,75)<
\
24,060
46,060
Section 7 # 31
Tree-like Structures for Multidimensional datasets e.g., Quad tree indexes
- Interior nodes (Inodes) correspond to rectangulars in 2-D (more generally, they can be constructed to
represent hypercubes higher dimensional space)
- If the number of points in the rectangle fits in a block, it's a leaf, else the rectangle is treated as interior node
with children corresponding to its 4 quadrants.
- to insert into the quad treee index: search to find the proper leaf; if there's room, insert; else split node into
4 quadrants, divide contents appropriately.
Example: Build the Quad-tree index as it would develop, assuming (age,sal) arrive in this order:
age,sal (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)
Insert (24,60) (46,60)
The only leaf node is:
age sal
24,060
46,060
400K
380K
360K
340K
320K
300K
280K
260K
240K
220K
200K
180K
160K
140K
120K
100K
80K
60K
40K
20K
0K
*
0
10
20
*
30
40
50
60
70
80
90
100  AGE
Section 7
# 32
Tree-like Structures for Multidim datasets
e.g., Quad tree indexes
insert
age sal
50, 080
24,060
46,060
(leaf full split (e.g., at age=50 and sal=200) divide contents by quadrant
.-NW
/
/---NE
age,sal /
{50,200} <
\
\---SW 24,060
\
46,060
\
`SE 50,080
400K
380K
360K
340K
320K
300K
280K
260K
240K
220K
200K
180K
160K
140K
120K
100K
80K
60K
40K
20K
0K
*
0
10
20
*
30
40
*
50
60
70
80
90
100  AGE
Section 7
# 33
Tree-like Structures for Multidim datasets e.g., Quad tree indexes
insert
50, 120 full split SE at 75,100
(not full insert
.-NW
.-NW
/
/
/---NE
/---NE
age,sal /
age,sal /
{50,200} <
{50,200} <
.-NW 50,100
\
\
/
50,120
\---SW 24,060
\---SW 24,060
/---NE
\
46,060
\
46,060
/
\
\
<
`SE(75,100)<
`SE 50,080
\
50,100
\---SW 50,080
\
\
`SE
insert
50, 100
.-NW
/
/---NE
/
age,sal
{50,200} <
\
\---SW 24,060
\
46,060
\
`SE 50,080
400K
380K
360K
340K
320K
300K
280K
260K
240K
220K
200K
180K
160K
140K
120K
100K
80K
60K
40K
20K
0K
ETC.
*
0
10
20
*
30
40
**
*
50
60
70
80
90
100  AGE
Section 7
# 34
Tree-like Structures for Multidim datasets: Region tree (Rtree) indexes
- inodes of an R-tree correspond to interior regions, (which can be overlapping) (usually regions are
rectangles, tho, not necessarily)
- R-tree regions have subregions that represent the contents of their children
- And the subregions need not cover the region they subdivide (but all data must be within a subregion)
Example, Consider the spatial image:
Assume a leaf can hold 6 regions (bfr=6)
Example: Consider the spatial image:
and that the 6 regions or objects above are
together on 1 leaf block, whose region
100________________________________________________________
is shown as the outer red rectangle
|
|
|
| Thus the R-tree has a root and 1 leaf:
|
|
|
| ( (0,0), (100,90) ) (corners of outer
|
.---------.
|
red region)
|
|
|
|
|
| school |
|
|
|_________|
|
|
| road1 road2 house1 school house2 pipeline
|
|
(a full leaf with 6 objects)
|---------------------------.
|
|
road1
|
.-------.
|
|---------------------------|
|house2 |
|
|
|r |
|_______|
|
|
.------.________ |o_|_____________________________|
|
|house1|________ |a_|________pipeline_____________|
|
|______|
|d |
|
|
|2 |
|
|
| |
|
|
| |
|
0 `-------------------------------------------------------'
Section 7 # 35
0
100
Rtree indexes cont.
(0,0), (100,90)
road1 |(60,50)
road2 | house1(20,20),
school | house2
| pipeline
(0,0),
(100,80)
POP
100________________________________________________________
|
|
|
|
|
.---------.
|
|
|
|
| school |
|
|_________|
|
|
POP
|
|
|
|
|
|
|
|
|
|
Split the full leaf putting 4 objects in 1
new leaf and 3 in the other
(minimize overlap and split ~evenly)
|---------------------------.
|
|
road1
|
.-------.
|
|---------------------------|
|house2 |
|
|
|r |
|_______|
|
|
.------.________ |o_|_____________________________|
|
|house1|________ |a_|________pipeline_____________|
Now suppose a local cellular phone
|
|______|
|d |
|
company adds a POP as shown.
|
|2 |
|
|
| |
|
|
| |
|
0 `-------------------------------------------------------'
0
100
Section 7 # 36
Rtree indexes cont.
(0,0), (100,90)
(80,50)
(0,0), (60,50)
(20,20), (100,80)
road1 | road2 | house1 house2 house3
school | house2 | pipeline
POP
Note that house2 is in both regions.
100________________________________________________________
|
|
|
|
|
.---------.
|
|
|
|
| school |
|
|_________|
|
|
POP
|
|
|
|
|
|
|
|
|
|
Since house3 is not in either region (and
both have room) we must decide to
expand one of them.
If we pick the green, expanding it to
(0,20), (100,80) we add 1600 units2
If we pick the purple, expanding it to
((0,0), (80,50) we add 1000 units2
so to minimize we pick the purple.
|---------------------------.
|
|
road1
|
.-------.
|
|---------------------------|
|house2 |
|
|
|r |
|_______|
|
|
.------.________ |o_|_____________________________|
|
|house1|________ |a_|________pipeline_____________|
Now
|
|______|
|d |
|
|
|2 |
|
|
| |
|
house3
|
| |
|
0 `-------------------------------------------------------'
0
100
suppose we insert house3
Section 7
# 37
Binary Radix Tree Index (AKA a trie) an additional index structure
(e.g., used in IBM AS/400 systems)
Similar to B-tree, except
- only the common parts of key values are embedded in inodes
- a single bit is used to make the navigation direction decision at each level (0 for up
and 1 for down). (zero-based bit positions are used)
Example: (in this example, the tree structure is being built left-to-right)
Starting with an empty structure,
INSERT
JAY | LA | 25 | STAR
nam_trie_INDEX
name
part RRN
b3 JAY
1
J<
ON
(assigned RRN=1 to it)
CUSTOMER FILE
RRN nam loc age
job
1 | JAY | LA | 25 | STAR
2 | JON | LA | 45 | HOOD
2
INSERT
JON | LA | 45 | HOOD
(assigned RRN=2
1st letters are teh same (J) so the common pat is embedded in the root
2nd letters: A and O, bit 3 (zero-based count) is 1st difference (and makes the decision)
0123 4567 bit positions
DBCDIC for A=1100 0001
EBCDIC for O=1101 0110
Section 7
# 38
Binary Radix Tree Index (AKA a trie) cont.
nam_trie_INDEX
CUSTOMER FILE
RRN nam loc age
b2 N 3
b3
b3 A<
JAY
1
Y 1
J<
J
ON
2
job
1 | JAY | LA | 25 | STAR
2 | JON | LA | 45 | HOOD
3 | JAN | RO | 93 | DOC
<
ON 2
INSERT
JAN | RO | 93 | DOC
(assigned RRN=3
0123 4567 bit positions
DBCDIC for N=1101 0101
EBCDIC for Y=1110 0000
Section 7
# 39
Binary Radix Tree Index (AKA a trie) cont.
nam_trie_INDEX
b2
b2 N 3
b3 A<
Y 1
J
<
INSERT
CUSTOMER FILE
RRN nam loc age
job
1 | JAY | LA | 25 | STAR
2 | JON | LA | 45 | HOOD
3 | JAN | RO | 93 | DOC
4 | SUE | RO | 16 | PROG
<
ON 2
SUE 2
SUE | RO | 16 | PROG
(assigned RRN=4
0123 4567 bit positions
DBCDIC for J= 1101 0001
EBCDIC for Y= 1110 0010
Section 7
# 40
Thank
you.
Section 7