Generalized Search Trees

Download Report

Transcript Generalized Search Trees

Generalized Search Trees
J.M Hellerstein, J.F. Naughton and A. Pfeffer, “Generalized
Search Trees for Database Systems,” Proc. 21st Int’l Conf.
On VLDB, Sep. 1995
Presented By Ihab Ilyas
Topics
Motivation.
Database Search Trees.
Generalized Search Tree.
Properties.
Methods.
Applications.
Motivation
New applications (Multimedia, CAD tools,
document libraries…etc.)
New Data types
Extending search trees to
maximum flexibility
Before GiST
Specialized Search Trees
Example: Spatial Search Trees ( R-Trees)
 Problem: New Applications implies new tree
structure from scratch

Search Trees For Extensible Data Types
Example: Extending B+ to index any ordinal
data
 Problem: Extending data but not the set of
queries supported.

GiST
A third direction for extending search trees
Extensible both in data types supported and
in the queries applied on this data.
Allows new data types to be indexed in a
manner that supports the queries natural to
the data type.
GiST (Cont.)
Unifies previously disparate structures
for currently common data types.

Examples: B+ and R trees can be
implemented as extensions to GiST. Single
code base for indexing multiple dissimilar
applications.
Database Search Trees
Canonical rough picture of database search tree
Key1 Key2 ….
Internal Nodes
Leaf nodes (Linked List)
Search Trees (cont.)
Search Key: A search key may be arbitrary
predicate that holds for each datum below
the key.
Search Tree: A hierarchy of categorizations,
in which each categorization holds for all
data stored under it in the hierarchy.
Generalized Search Tree
Definition: A GiST is a balanced multi-way
tree of variable fan-out between kM and M
Where k is the fill factor.
2
1
k
M
2
With the exception of the root node that can
have fan-out from 2 to M.
GiST (Cont.)
Leaf nodes: (p,ptr)
p: Predicate used as a search key.
 ptr: the identifier of some tuple of the database.

Non-leaf nodes: (p,ptr)
p: Predicate used as a search key.
 ptr: Pointer to another tree node.

Properties
Every node contain between kM and M
unless it is the root.
For each index entry (p,ptr) in a leaf node, p
holds for the tuple
For each index entry (p,ptr) in a non-leaf
node, p is true when instantiated with the
values of any tuple reachable from ptr.
All leaves appear on the same level.
Note on Properties
p holds for p1,p2
…. (p,ptr) …..
p’ holds for p1,p2
…. (p’,ptr’) …..
…. (p1,ptr1) …..
p’  p Not Required
…. (p2,ptr2)
The ability of orthogonal
classification.. Recall R-Tree
GiST Methods
Key Methods: the methods the user can
specify to configure the GiST. The methods
encapsulate the structure and behavior of
the object class used for keys in the tree.
Tree Methods: Provided by the GiST, and
may invoke the required key methods.
Key Methods
E is an entry of the form (p,ptr) , q is a query, P a set of entries
Consistent(E,q): False if p^q guaranteed
unsatisfiable, true otherwise.
Union(P): returns predicate r that holds for all
predicates in P
Compress(E): returns (p’,ptr).
Decompress(E): returns (r,ptr) where pr. This a
lossy compression as we do not require p r
Key Methods (Cont.)
Penalty(E1,E2): returns domain specific penalty
for inserting E2 into the subtree rooted at E1.
Typically the penalty metric is representation of
the increase of size from E1.p to Union(E1,E2).
PickSplit(P): M+1 entries, splits P into two sets of
entries P1,P2, each of the size kM. The choice of
the minimum fill factor is controlled here.
Tree Methods
Search: Controlled by the Consistent
Method.
Insert: Controlled by the Penalty and
PickSplit.
Delete: Controlled by the Consistent
Example
R
(p,ptr)
Penalty = m
(p,ptr)
New (q,ptr)
(p,ptr)
Penalty = n
(p,ptr)
m<n
(p,ptr)
New (q,ptr)
Penalty =i
(p,ptr)
Penalty = j
(p,ptr)
(p,ptr)
(q,ptr)
(p,ptr)
j<i
(p,ptr)(p,ptr)
Full.. Then split according
to PickSplit
(p,ptr)
(p,ptr)
Applications
GiST Over Z (B+ Trees)
GiST Over Polygons in R2 (R Trees)
B+ Trees Using GiST
p here is on the form Contains([x ,y ),v)
Consistent(E,q) returns true if
p
p
If q= Contains([xq,yq),v): (xp<yq)^(yp>xq)
 If q= Equal (xq,v): xp xq <yp

Union(P) returns
[Min(x1,x2,…,xn),MAX(y1,y2,….,yn)).
B+ Trees Using GiST (Cont.)
Penalty(E,F)



If E is the leftmost pointer on its node, returns
MAX(y2-y1,0)
If E is the rightmost pointer on its node, returns
MAX(x1-x2,0)
Otherwise, returns MAX(y2-y1,0)+MAX(x1-x2,0)
 P 


 2 
PickSplit(P) let the first
entries in order
to go to the left node and the remaining in
the right node.
B+ Trees Using GiST (Cont.)
Compress(E) if E is the leftmost key on a non-leaf
node return 0 bytes otherwise, returns E.p.x
Decompress(E)


if E is the leftmost key on a non-leaf node let x= -
otherwise let x=E.p.x
If E is the rightmost key on a non-leaf node let y= . If
E is other entry in a non-leaf node, let y = the value
stored in the next key. Otherwise, let y = x+1
R - Trees Using GiST
The key here is in the form (xul,yul,xlr,ylr)
Query predicates are:

Contains ((xul1,yul1,xlr1,ylr1), (xul2,yul2,xlr2,ylr2))
Returns true if (xul1 xul2) ^( yul1 yul2) ^ ( xlr1 xlr2) ^ ( ylr1 ylr2)

Overlaps ((xul1,yul1,xlr1,ylr1), (xul2,yul2,xlr2,ylr2))
Returns true if (xul1 xlr2) ^( yul1 ylr2) ^ ( xul2 xlr1) ^ ( ylr1 yul2)

Equal ((xul1,yul1,xlr1,ylr1), (xul2,yul2,xlr2,ylr2))
Returns true if (xul1= xul2) ^( yul1= yul2) ^ ( xlr1= xlr2) ^ ( ylr1= ylr2)
R – Trees Using GiST(Cont.)
Consistent(E,q)
p contains (xul1,yul1,xlr1,ylr1), and q is either
Contains, Overlap or Equal (xul2,yul2,xlr2,ylr2)
 Returns true if Overlaps ((xul1,yul1,xlr1,ylr1),
(xul2,yul2,xlr2,ylr2))

Union(P) returns coordinates of the
maximum bounding rectangles of all
rectangles in P.
R – Trees Using GiST (Cont.)
Penalty(E,F)
 Compute
q= Union(E,F) and return
area(q) – area(E.p)
PickSplit(P)
 Variety
of algorithms are provided to best
split the entries in a over-full node.
R – Trees Using GiST (Cont.)
Compress(E)

Form the bounding rectangle of E.p
Decompress(E)

The identity function