Transcript B + -Tree

Multilevel Index:
+
B -Tree
Session 6
Data Management for Decision Support
Outline of Lecture
•
•
•
•
•
Search Trees
Structure of B+-Tree
Lookup in B+-Trees
Insertion into B+-Tree
Deletion from B+-Tree
Search Trees
• Tree that is used to guide the search for a record
given the value of one of the records fields
– e.g., idea similar to multi-level index structure
• Interior nodes contain search key values Ki and
pointers to other tree nodes Pj
• Leaf nodes containing data (or pointers to data)
• In a search tree of order n, every node has at most
n-1 search values K and n pointers P
Search Trees
• Tree data structure on which two constraints must
hold at all times
– On order of search keys in each node =
K1 < K2 < … < Kn-1
– On values of the search keys in the subtrees pointed at by
Pi
P1 K1 P2 K2 ... Pn-1Kn-1 Pn
tree pointer
x
x
x
X < K1
K1  X < K2
Kn-1  X
B-Tree Data Structure
• Most general index structure used in commercial
systems is the B-tree
• General family of data structures
– Particular variant most often used is known as B+-tree
• B-trees automatically maintain as many levels of
index as is appropriate for size of file being indexed
• B-trees manage space on blocks they use so that
every block is between half used and completely full
– No overflow blocks are ever needed
• Focus the discussion on B+-Tree
– Other type, B-tree, discussed in the handout
B+-Tree
• A B+-Tree is a search tree that has additional
constraints that ensure that the tree is always
balanced
• Every path from root to leaf is of the same length
B+-Tree Overview
• Leaf nodes have entry for every value of the search
field, along with data pointer to the record (or block
that contains record)
– Leaf nodes linked together
• Order n of tree determines size of nodes (n fixed)
• Interior nodes: n tree pointers, at least n/2 must be
used
• Leaf nodes: one pointer points to next leaf block, of
remaining n pointers, at least n/2 point to data records
or key values, all leaf nodes on the same level
180
200
120
150
180
30
100
Root
150
156
179
120
130
100
101
110
30
35
3
5
11
+B Tree
Example
Order n=4
to keys
< 57
to keys
57 k< 81
95
81
57
Sample Non-Leaf
to keys
81 k< 95
to keys
 95
To record
with key 95
To record
with key 81
To record
with key 57
95
81
57
Sample Leaf Node
From non-leaf node
to next leaf
in sequence
Size of Nodes Revisited
Size of nodes:
n pointers
n-1 keys
(n fixed)
Note:
• n= 4 is atypically small!
• Usually pick largest n such n pointers and n-1 keys fit on
one block
• If we assume block size is 4,096 bytes, integers are 4 bytes,
pointers are 8 bytes, no header information
• Find largest n such that 4(n-1)+8(n)  4096
• That value is 341
Don’t Want Nodes to be Too Empty
• Use at least
Non-leaf:
n/2 pointers
Leaf:
n/2
pointers to data
Sample Tree Nodes
Leaf
counts even if null
30
30
35
min. node
120
150
180
Non-leaf
Full node
3
5
11
n=4
B+-Tree Rules
Tree of order n:
(1) All leaves at same lowest level
(balanced tree)
(2) Pointers in leaves point to records
except for “sequence pointer”
B+-Tree Rules Continued
(3) Number of pointers/keys for B+-tree
Max Max Min
ptrs keys ptrsdata
Non-leaf
(non-root)
Leaf
(non-root)
Root
Min
keys
n
n-1
n/2
n/2 - 1
n
n-1
n/2
n/2
n
n-1
1
1
Applications of B+-Trees
(1) Search key of B+-Tree is primary key for data file,
index is dense (data file may or may not be sorted by
pk):
– There is one key-pointer pair in a leaf for every record of the
data file
(2) Data file is sorted by its primary key:
– B+-Tree is a sparse index with one key-pointer pair at a leaf
for each block of the file
(3) Data file is sorted by an attribute that is not a key
(search key for B+-Tree):
– For each value K that appears in the data file, there is one keypointer pair at the leaf. Pointer goes to the first of the records
that have K as their sort-key value.
Applications of B+-Trees
• Additional B+-Tree variants
• E.g., allow multiple occurrences of search key at the
leaves
– Need to change slightly the definition of what search
keys mean
– Figure out modified B+-Tree structure as homework
Lookup in B+-Trees
• Assume no duplicate keys at leaves
• Find record with search-key value K
BASIS: If we are at a leaf, look among keys there. If
the ith leaf is K, then ith pointer will take us to
desired record
INDUCTION: Interior node with keys K1, K2, …, Kn.
There is only one child that could lead to a leaf with
key K. If K< K1, first child. If K1  K < K2, it is
the second child and so on. Recursively apply search
procedure at this child.
180
200
150
156
179
120
130
100
101
110
30
35
3
5
11
120
150
180
30
100
Lookup Example
Find record with search key K = 160
Range Queries
• Range queries have a term in WHERE-clause that
compares the search key with a value or values
– Using comparison operators other than ‘=‘ or ‘<>’
– E.g., SELECT * FROM R WHERE R.k > 40;
• To find all keys in range [a,b], do lookup to find key a
• Whether or not it exists, lead to leaf where a could be
– Search leaf for keys that are a or greater
– Each key has associated pointer to one of the records
whose key is in desired range
• Use pointer to neighboring leaf and keep examining
records until
– Find key higher than b; stop
– Reach end of leaf; go to next leaf and repeat process
Range Query Example
180
200
150
156
179
120
130
100
101
110
30
35
3
5
11
30
120
150
180
100
Find record with search key 37  K  160
Inside the Numbers
• Search path through B+-Tree is log  n-1/2 (K) , K
search key values in file
• If we assume block size is 4,096 bytes, integers are 4
bytes, pointers are 8 bytes, n =341
• With n=341, if we have 1 million search-key values,
lookup requires  log170(1,000,000)  = 3 blocks to be
accessed
– Compared to  log2(1,000,000)  = 20 blocks for search of binary
tree if every node were stored on different block
Efficiency of B+-Trees
• Recall, 340 key-pointer pairs could fit in one block
• Suppose avg. block has occupancy midway between
min. (170) and max. (340), 255
• With a root, 255 children and 2552 = 65025 leaves,
we have 2553 or about 16.6 million pointers to
records
– Can accommodate file with up to 16.6 million tuples by
3-level B+-tree
Insert into B+-Tree
(a) simple case
– space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root
Insert is Recursive
• Find place for new key in appropriate leaf if there is
room
• If no room, split leaf into two and divide keys so
that each node is half full (or just over half full)
• Splitting a node at one level appears to parent node
as if new key-pointer pair needs to be inserted at
that higher level; proceed if room, o.w. split parent,
and continue up the tree
• Exception: when inserting into root and there is no
room, split root into two and insert new root; recall
that it is ok for root to have one key and two
pointers
Insert
30
31
32
3
5
11
30
100
(a) Insert key = 32
n=4
Insert
(a) Insert key = 7
30
30
31
3
57
11
3
5
7
100
n=4
180
200
160
179
150
156
179
180
120
150
180
160
100
(c) Insert key = 160
Insert
n=4
Insert
40
45
40
30
32
40
20
25
10
12
1
2
3
10
20
30
new root
30
(d) New root, insert 45
Deletion from B+-Tree
(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys
(d) Cases (b) or (c) at non-leaf
Delete
n=5
(b) Coalesce with sibling
40
50
10
20
30
40
10
40
100
– Delete 50
Delete
(c) Redistribute keys
35
40
50
10
20
30
35
10
40 35
100
– Delete 50
n=5