Transcript What is a B

B-Trees
• Disk Storage
• What is a multiway tree?
• What is a B-tree?
• Why B-trees?
• Insertion in a B-tree
• Deleting from a B-tree
Disk Storage
• In this lecture you will study two external data storage structures:
M-way and B-trees.
• Data is stored on disk (i.e., secondary memory) in blocks.
• A block is the smallest amount of data that can be accessed on a
disk.
• Each block has a fixed number of bytes.
• Each block may hold many data records.
999000
Record 1
data
999004
999004
Record 2
data
empty
Record 3
empty
empty
empty
999000
Record 4
What is a Multiway (M-way) search tree?
• A multiway (or M-way) search tree of order m is a finite set of keys
and references. Either the set is empty or:
1. Each non-leaf node consists of distinct keys ki and of at least 2 and
at most m non-empty M-way subtrees T0, T1, . . . , Tm-1. The number
of keys is one less than the number of non-empty subtrees.
k1
k2
k3
...
km-2
km-1
T0
T1
T2
Tm-2
Tm-1
key < k1
k1 < key < k2
k2 < key < k3
km-2 < key < km-1
key > km-1
2. The keys and subtrees of a non-leaf node are ordered as:
T0, k1, T1, k2, T2, . . . , km-1, Tm-1 such that:
• All keys in subtree T0 are less than k1.
• All keys in subtree Ti , 1 <= i <= m - 2, are greater than ki but
less than ki+1.
• All keys in subtree Tm-1 are greater than km-1
What is an (M-way) search tree?(cont’d)
3. Each leaf node consists of distinct keys pi and at least 2 and at most
m non-null references to data blocks B0, B1, B2, . . . , Bm-1. The
number of keys is one less than the number of non-null references.
4. The keys in a non-leaf node and the referenced blocks are ordered
as: B0, p1, B1, p2, B2, . . . , pm-1, Bm-1 such that:
• All data in block B0 correspond to keys that are less than p1.
• All data in block Bi , 1 <= i <= m - 2, correspond to pi and to
keys that are greater than pi but less than pi+1.
• All data in block Bm - 1 correspond to pm-1 and to keys that are
greater than pm-1.
Multiway (M-way) search tree Examples
• A multiway search tree of order 3:
10
4
1
2
7
4
26
14
7
10
20
32
14
20
26
8
17
24
28
9
19
32
Multiway search tree Examples (cont’d)
• Example:A multiway search tree of order 3:
Note: The leaf nodes in an M-way tree are not necessarily at the same level.
• Example:A multiway search tree of order 5:
What is a B-Tree?
• A B-tree of order m (or branching factor m), where m > 2, is either an
empty tree or a multiway search tree with the following properties:
 The root is either a leaf or it has at least two non-empty subtrees and
at most m non-empty subtrees.
 Each non-leaf node, other than the root, has at least m / 2 nonempty subtrees and at most m non-empty subtrees. (Note: x is the
lowest integer > x ).
 The number of keys in each non-leaf node is one less than the
number of non-empty subtrees for that node.
 Each leaf node has at least m / 2 and at most m references to data
blocks.
 The number of keys in each leaf node is one less than the number of
non-null references in that node.
 All leaf nodes are at the same level; that is the tree is perfectly
balanced.
 A data block referenced by by a leaf node may store at least L / 2
and at most L data records, where L is a fixed constant, L  1.
B-Tree Examples
Example: A B-tree with m = 4 and L = 3
Example:
Choosing M and L for a B-Tree
• A good value for L , the number of records in a block, is the largest
integer satisfying the inequality:
L <= BlockSize / RecordSize
• A good value for the branching factor, m, is the largest integer
satisfying the inequality:
ReferenceSize * m + KeySize * (m - 1) <= BlockSize
• Example: A B-tree stores student records. Each student record is
1024 bytes. The key for a record is a student ID that is 36 bytes. If a
block holds 4096 bytes and a reference requires 32 bytes. Find
suitable values for m and L.
L <= 4096 / 1024 => L = 4
32m + 36(m - 1) <= 4096
=>
m <= 59.7
=> m = 58
Why B-Trees?
• Accessing data in secondary memory is much slower than
accessing data in primary memory
• Thus, when data is too large to fit in primary memory, minimizing the
number of disk accesses is important.
• Many algorithms and data structures that are efficient for
manipulating data in primary memory are not efficient for
manipulating large data in secondary memory because they do not
minimize the number of disk accesses.
• For example, AVL trees are not suitable for representing huge tables
residing in secondary memory.
• The height of an AVL tree increases, and hence the number of disk
accesses required to access a particular record increases, as the
number of records increases.
Why B-Trees? (cont’d)
•
B-trees are suitable for representing huge tables residing in
secondary memory because:
1. With a large branching factor m, the height of a B-tree is low
resulting in fewer disk accesses.
2. The branching factor can be chosen such that a node reference
corresponds to a block of secondary memory.
3. B-trees usually keep related records on the same block. Again
this results in fewer disk accesses.
Why B-Trees? (comparison with AVL Trees)
• The height h of a B-tree of order m, with a total of n keys, satisfies
the inequality:
h <= 1 + log m / 2 ((n + 1) / 2)
• If m = 300 and n = 16,000,000 then h <= 3.31 .
• Thus, in the worst case finding a key in such a B-tree requires 3 disk
accesses.
• The average number of comparisons for an AVL tree with n keys is
log n + 0.25 where n is large.
• If n = 16,000,000 the average number of comparisons is 17.
• Thus, in the average case, finding a key in such an AVL tree
requires 17 disk accesses.
Insertion in B-Trees
•
•
•
In insertion or deletion, a B-tree undergoes changes that must
maintain:
 Its height balance.
 Its leaves to be at the same level.
 Each of its nodes, except the root, to be at least half full (i.e., to
contain a minimum of m / 2  - 1 keys, where m is the order of
the tree).
A node of a B-tree of order m is said to underflow if, after deletion, it
contains  m / 2  - 2 keys.
There are 2 insertion cases:
1. The leaf in which the insertion is to be done does not overflow.
2. The leaf overflows:
(a) with an odd number of keys,
(b) with an even number of keys.
Insertion: case 1
• The leaf does not overflow:
B-tree of order 5
Insert 511
Insertion: case 2a
• The leaf overflows with an odd numbers of keys.
B-tree of order 5
Insert 490
Insertion: case 2a (cont’d)
Insertion: case 2b
The leaf overflows with an even number of keys.
Left bias
insertion of 102
B-tree of order 4
Right bias
insertion of 102
Deletion
• If the key to be deleted is not in a leaf, swap it with either its
successor or predecessor (each will be in a leaf).
• Delete the key from the leaf.
• There are four cases:
1. The leaf does not underflow.
B-tree of order 4
Delete 140
Deletion (cont’d)
2. The leaf underflows and the adjacent right sibling has at least m / 2 
keys.
B-tree of order 5
Delete 113
Deletion (cont’d)
3. The leaf underflows and the adjacent left sibling has at least m / 2
keys.
B-tree of order 5
Delete 135
Deletion (cont’d)
4. The leaf underflows and each adjacent sibling has m / 2 - 1 keys.
merge node, sibling and the
separating key x
If the parent of the merged node underflows, the merging process
propagates upward. In the limit, a root with one key is deleted and the
height decreases by one.
Deletion (cont’d)
B-tree of order 5
Delete 412
The parent of the merged node does not underflow. The merging process
does not propagate upward.
Deletion (cont’d)
B-tree of order 5
Delete D