Multiway Search Trees

Download Report

Transcript Multiway Search Trees

Multiway
Search Trees
• Data may not fit into main memory
• BigO model no longer applies for performance
analysis
• Cutting disk access by half will cut running time
by half
Multiway
Search Trees
• Very large database -- assume 10 million records
– key = 32 bytes
– record = 32 bytes
• Performance
– Unbalanced binary search tree
• depth is O(n) in worst case (10 million disk accesses)
• in average case, 32 disk accesses
– 5 sec. processing time, assuming 160 msec per accesss
– Random search tree
• for nodes that are very deep, 100 disk accesses required (16
sec.)
•
•
More branching => less height
Need multiple children per node
M-ary
trees non-binary trees used as search trees,
• General
especially in external storage
• Each node can contain one or more keys
• Order n tree: each node contains at most n subtrees
and contains at most n-1keys
• Ex: order 3
M-ary
trees
• If nodes are not full, space is wasted
• Advantage of using external storage
– I/O is slowest operation
– once data is read into memory, search cost is minimal
– time efficiency is improved at the expense of external
space utilization
– size of each node is maximized (order 200 not
uncommon)
• Question that must be addressed: how and where
are records kept
– store records in tree nodes?
– associate pointers with keys to records and store pointers
in nodes?
– store records in leaves?
• Variations of multiway search trees are used to
BTree
• In general, a B-tree of order m is an m-way search tree that is
s empty or is of height > 1 and satisfies the following
either
properties:
– The root node is a leaf or it has at least 2 children
– All nodes other than the root node and leaves have between m/2 and
m children.
– All leaf nodes are at the same level
• An m-way search tree, T, is a tree in which all nodes are of
degree < m.
• A non empty T has the following properties:
– Interior nodes are of the type: P0, (K1,P1),(K2,P2), ...,(Kn,Pn ) where
the Pi , 0  i  n are pointers to the subtrees of T and the Ki , 1i  n
are key values; and 1  n  m
– Ki < Ki+1 , 1  i < n (i.e. keys are ordered)
– All key values in the subtree Pi are less than Ki+1 and greater than Ki-1
, 0 < i < n.
– All key values in the subtree Pn are greater than Kn and those in P0 are
less than K1.
2-3
ATree
specialization of m-way search trees is the 2-3 tree (B-tree of order 3)
•
• Definition: A 2-3 tree is a B-tree with the following properties:
– A node has room to store at most two keys
– Each internal node has either two children (if it contains one key) or three
children (if it contains two keys)
– All leaves are on the same level, so the tree is always height balanced
– For every node, values of all descendants in the left subtree are less than the
value of the first key; values in the center subtree are greater than or equal to
the value of the first key.
– If two keys exists, values of all descendant nodes in the center subtree are less
than the value of the second key; values in the right subtree are greater or
equal the value of the second key
• Example:
• Each node must have room for 2 keys and three pointers
• The 2-3 tree is seldom implemented
B+
• Tree
B+ tree is the most commonly implemented variant
– B+-trees store records only in the leaf nodes
– internal nodes store key values; keys guide the search to the correct
leaf node
– a leaf node in a B+-tree of order m can store more or less than m
records
• depends on the size of a record vs the size of a key
– leaf pages must maintain at least 50% utilization (i.e., remain 1/2
full)
– leaf nodes are linked to allow sequential access
• Example B+-tree of order m= 3 with maximum page size L
= 5:
Operations on
B-Trees
• Find
– Starting at root, do a tree walk; branch according to
value of keys stored in each node.
• Insertion
– Always add the new element to a leaf node
– If the leaf node is full, the node must be split; a split
may propagate splits all the way to the root
– General rules for insertion in a B-tree of order m
• If node contains m keys, node is split into two nodes with
(m+1)/2 and (m+1)/2, respectively.
• If parent node has m children, it is split in same manner.
• Moving up tree, nodes are split if necessary until a parent with
less than m children is found.
• If root is split, a new root with two children is created.
Insertion, M
= 3, L = 5
• Case 1: No split
– Find insertion point by following appropriate branches to leaf node where
new key belongs.
– Ex. insert 48
• Case 2: Overflow in leaf node
– Create a new node. Distribute keys between the two nodes, two to each
node.
– Move smallest key of new node up to parent node.
– Ex. Insert key 107 in (b)
• Tree is now one level deeper
•
Insertion
[continued]
Case 3: Overflow in leaf node and in index node.
– Create a new leaf node and insert new key as in Case 2.
– If parent node is full, create new index node.
– Place smallest index key in old node, largest index key in new node, and
move middle index key up to parent.
• Ex. insert key 95 in (d)
• Only way the tree can gain height: a node split propagates up the tree
and splits the root
B*
Tree
• B* trees differ from B+ trees in insertion algorithm
s
• Alternate method if leaf node overflows
–
–
–
–
Redistribute keys between siblings and update parent node.
If sibling is also full, split the two pages into three
Index nodes handled the same way
Method costs more in coding, but uses less space; nodes tend to be at
least 2/3 full
Deletion
Algorithm
• Locate leaf containing record to be deleted
• Remove the record if leaf node is more than half
full
• Otherwise (underflow)
– if sibling is more than half full, borrow a key or keys
from sibling so that each node has the same number of
records
– If no sibling has enough keys to spare, merge the current
node's keys into a sibling and remove node from the tree
• Update keys in parent nodes--if necessary, up to root
• If root must be deleted due to merging, tree loses one level
• When collapsing a pair of leaf nodes together, the
right node should be removed, since the left node
is pointed to by its left neighbor in the linked list of
leaf nodes
Deletion: Example,
= 5,
• SimpleL
deletion:
deleteM
75 = 3
Before Deletion
After Deletion
•
Deletion:
Example
Deletion
by barrowing: delete 81
Before Deletion
After Deletion
•
Deletion:
Example
Deletion
by merging : delete 83
Before Deletion
After Deletion
Analysis of
B-Trees
• Asymptotic
cost of search, insertion and deletion from B•
•
•
•
•
•
trees, B*-trees and B+-trees is (log n), where n=number
of records.
The maximum depth of the tree is log m/2 n
Only half the available pointers are used at all levels
At each node on path, O(log m) work is done to determine
which branch to take (using binary search)
Insert or Delete can require O(m) work to update all node
information.
Hence, worst case asymptotic cost for insert/delete is
O(m logm n) = O((m (log2n/log2 m) = (log n)
Find operation takes O(log n) time, but finding next record
is only O(1) (involves accessing one additional leaf at
most)
The Efficiency
of B-Trees
• Consider a B+-tree of order 100 and leaf nodes
with up to 100 records
• A 1-level B+-tree can have at most 100 records
• A 2-level B+-tree must have at least 100 records (2
leaves w/50 records each) and at most 10,000
records (100 leaves w/100 records each)
• A 3-level B+-tree must have at least 5000 records
(2 second-level nodes w/ 50 children containing 50
records each) and at most 1 million records (100
second level nodes with 100 full children each)
• A 4-level B+-tree must have at least 250k records
and at most 100 million records. (not many
databases bigger than this)
Search Tree Summary
and Comparisons
• Binary search trees are efficient index structures
– Balanced search trees provide means to search, insert,
and delete entries using at most O(log n) time.
– Data can be accessed in order by traversing the tree
– Have important dynamic insertion and deletion
capabilities
• B-trees compared with other trees
– Like BST, finding an element requires searching only a
single path between root & leaf
– Like AVL, insertion & deletion algorithms guarantee
that longest path between root and leaf is O(log n)
– Unlike AVL, height always of uniform depth
– Unlike Splay, never allowed to get deep
•
Example: M = 5, L =
Before
5,InsertInsert43,85,
97
75, 89, 110
12, 32, 34, 38,40
43, 54, 59, 71, 73
75, 78, 82, 84, 86
89, 92, 94, 96, 99
110, 112, 113, 120, 121
† After Insert 85
89
43, 75, 84
12, 32, 34, 38,40
43, 54, 59, 71, 73
110
75, 78, 82
84, 85, 86
89, 92, 94, 96, 99
110, 112, 113, 120, 121
Design
Problem
• Design
an efficient B+ tree data structure for the following
database:
– 10 million records
•
•
key = 32 bytes
record = 256 bytes
– Disk:
•
•
Block/page size = 4096 bytes
Every disk access returns one block of data.
• Goal:
– Place data and keys in nodes such that every disk access returns at
least one complete node of the tree
• Leaf Nodes:
– Every leaf node should be contained in one disk block,
•
256 bytes/record and 8192 bytes/block and yields 32 records per
block, therefore L = 32.
• Every internal node contains (m-1) keys and m pointers.
– Since each key is 32 bytes, and assuming a pointer takes 4 bytes,
each nodes will consist of 32(m-1) + 4m bytes. At least one node per