slides - SEAS - University of Pennsylvania
Download
Report
Transcript slides - SEAS - University of Pennsylvania
GiST, Concluded, and
Query Execution
Zachary G. Ives
University of Pennsylvania
CIS 650 – Implementing Data Management Systems
September 16, 2008
Content on hashing and sorting courtesy Ramakrishnan & Gehrke
Generalizing B+ Trees and R Trees: GiST
Question: can we create an index toolkit that can be
customized for any kind of index we’d like?
GiST doesn’t quite get there – but a nice framework for
understanding the concepts
Observation: some aspects of the B+ Tree and R Tree that
can be drawn out:
Height balanced – requires re-organization
Data on leaves; intermediate nodes help focus on a child
High fan-out
GiST is actually used in PostgreSQL
2
Similarities and Differences
Recall differences we saw between B+ Trees and R Trees:
B+ Tree:
one-dimensional
full ordering among data
R Tree:
many-dimensional
no complete ordering among the data
means that intermediate nodes’ children may fit in more than one
place
also, intermediate node’s bounding box may have lots of space
that’s not occupied by child nodes (relates to “curse of
dimensionality”)
needs to rely on heuristics about where to insert a node
may need to search many possible paths along the tree
3
Search: Do We Explore a Path?
Each subtree node can be tested –
Consistent(node, predicate)
Returns FALSE only if the node or its children can’t
contain the predicate – i.e., can return false positives
What does this correspond to in a B+ Tree? An R Tree?
What additional work needs to be done if contains() returns
true?
4
Insertion – Simple Case
Union(entrySet) returns predicate
Returns a predicate (e.g., bounding box) that holds over a
set (e.g., of children) – basically a disjunction
Compress(entry) returns entry’
Simplifies the predicate of the entry, potentially increasing
the chance of false positives
Examples: simplify polygon; truncate string
Decompress(entry) returns entry’
The inverse of the above – may have false positives
Penalty(entry1, entry2)
Gives the cost of inserting entry2 into entry1
Example: how much does it expand bounding rectangle?
5
Insertion – Splitting
PickSplit(entrySet) returns <entrySet1, entrySet2>
Splits the set of children in an intermediate node into two
sets
Typically split according to a “badness metric”
6
Basic Routines I
How do we search for a node set satisfying a
predicate?
Search(node, predicate)
For every nonleaf, if Consistent, call recursively on
children; return union of result sets from children
For leaf, if Consistent, return singleton set, else empty set
Ordered domains: FindMin(root, predicate), Next(root,
predicate, current)
Used to do a linear scan of the data at the leaves, if ordered
7
Basic Routines II
How do we insert a node?
Insert(node, new, level)
L = ChooseSubtree(node, new, level)
If room in L, insert new as a child, else invoke Split(node, L, new)
AdjustKeys(node, L)
ChooseSubtree(node, new, level) returns node at level
Recursively descend tree, minimizing Penalty
If at desired level, return node
Among child entries in node, find one with minimal Penalty, return
ChooseSubtree on that child
8
Helper Functions
Split(root, node, new)
Invoke PickSplit on union of children of node and new
Put one partition into node and create a new node’ with the remainder
Insert all of node’ children into Parent(node) – if insufficient space,
invoke PickSplit on this parent
Modify the predicate describing node
AdjustKeys(root, node)
If node = root, or predicate referring to node is correct, return
Otherwise,
modify predicate for node to contain the union of all keys of node
recursively call AdjustKeys(root, parent(node))
9
Query Execution
Takes a “physical plan” and attempts to use it to produce
query answers
Some considerations in building execution engines:
Efficiency – minimize copying, comparisons
Scheduling – make standard code-paths fast
Data layout – how to optimize cache behavior, buffer
management, distributed execution, etc.
10
Execution System Architectures
Central vs. distributed vs. parallel vs. mediator
Data partitioning – vertical vs. horizontal
Monet model – binary relations
Distributed – data placement
One operation at a time – INGRES
Pipelined
Iterator-driven
Dataflow-driven
Hybrid approaches
11
Execution Strategy Issues
Granularity & parallelism:
Pipelining vs. blocking
Materialization
Join
PressRel.Symbol = EastCoast.CoSymbol
Join
Project
PressRel.Symbol = Clients.Symbol
CoSymbol
Select
Client = “Atkins”
Scan
Scan
Scan
PressRel
Clients
EastCoast
12
Iterator-Based Query Execution
Execution begins at root
Join
open, next, close
Propagate calls to children
PressRel.Symbol = EastCoast.CoSymbol
May call multiple child nexts
“Synchronous pipelining”
Minimize copies
Join
Project
PressRel.Symbol = Clients.Symbol
CoSymbol
Efficient scheduling &
resource usage
Can you think of alternatives and
their benefits?
Select
Client = “Atkins”
Scan
Scan
Scan
PressRel
Clients
EastCoast
13
The Simplest Method
Iteration over tables
Sequential scan
Nested loops join
What’s the cost? What tricks might we use to speed it up?
Optimizations:
Double-buffering
Overlap I/O and computation
Prefetch a page into a shadow block while CPU processes different block
Requires second buffer to prefetch into
Switch to that when the CPU is finished with the alternate buffer
Alternate the direction of reads in file scan
14
Speeding Operations over Data
Three general data organization techniques:
Indexing
Associative lookup & synopses
Sorting
Hashing
15
Indices
GiST and B+ Trees
Alternatives for storage:
<key, record>; <key, rid>;
<key, {rids}>
Clustered vs. unclustered
Bitmapped index – bit position for
each value in the domain
Requires a domain with discrete
values (not necessarily ordinal)
Booleans; enumerations; rangebounded integers
Low-update data
Efficient for AND, OR only
expressions between different
predicates
16
Usefulness of Indices
Where are these structures most useful?
Sargable predicates
Covering indices
In many cases, only help with part of the story
Filter part of the answer set, but we still need further
computation
e.g., AND or OR of two predicates
General rule of thumb:
Unclustered index only useful if selectivity is < 10-20%
Impact of flash memory (SSDs)?
17
Sorting – External Binary Sort
Divide and conquer: sort into
subfiles and merge
Each pass: we read & write every
page
3,4
6,2
9,4
8,7
5,6
3,4
2,6
4,9
7,8
5,6
4,7
8,9
2,3
4,6
3,1
1,3
Input file
PASS 0
1-page runs
PASS 1
2
2
1,3
5,6
2
2-page runs
PASS 2
2,3
If N pages in the file, we need:
dlog2(N)e + 1
passes to sort the data, yielding a
cost of:
2Ndlog2(N)e + 1
4,4
6,7
8,9
1,2
3,5
6
4-page runs
PASS 3
1,2
2,3
3,4
4,5
6,6
7,8
9
8-page runs
General External Merge Sort
How can we utilize more than 3 buffer pages?
To sort a file with N pages using B buffer pages:
Pass 0: use B buffer pages. Produce dN / Be sorted runs of B
pages each
Pass 2, …, etc.: merge B-1 runs
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT B-1
Disk
B Main memory buffers
Disk
Cost of External Merge Sort
Number of passes: 1+dlogB-1 dN / Bee
Cost = 2N * (# of passes)
With 5 buffer pages, to sort 108 page file:
Pass 0: d108/5e = 22 sorted runs of 5 pages each (last
run is only 3 pages)
Pass 1: d22/4e = 6 sorted runs of 20 pages each
(final run only uses 8 pages)
Pass 2: d6/4e = 2 sorted runs, 80 pages and 28 pages
Pass 3: Sorted file of 108 pages
Applicability of Sort Techniques
Join
Intersection
Aggregation
Duplicate removal as an instance of aggregation
XML nesting as an instance of aggregation
21
Merge Join
Requires data sorted by join attributes
Merge and join sorted files, reading sequentially
a block at a time
Maintain two file pointers
While tuple at R < tuple at S, advance R (and vice versa)
While tuples match, output all possible pairings
Maintain a “last in sequence” pointer
Preserves sorted order of “outer” relation
Cost: b(R) + b(S) plus sort costs, if necessary, plus buffer
In practice, approximately linear, 3 (b(R) + b(S))
22
Hashing
Several types of hashing:
Static hashing
Extensible hashing
Consistent hashing (used in P2P; we’ll see later)
23
Static Hashing
Fixed number of buckets (and pages); overflow when
necessary
h(k) mod N = bucket to which data entry with key k belongs
Downside: long overflow chains
h(key) mod N
key
0
2
h
N-1
Primary bucket pages
Overflow pages
Extendible Hashing
If a bucket becomes full split in half
Use directory of pointers to buckets, double the directory,
splitting just the bucket that overflowed
Directory much smaller than file, so doubling it is much
cheaper
Only one page of data entries is split
Trick lies in how hash function is adjusted!
LOCAL DEPTH
Example
GLOBAL DEPTH
2
Directory is array of size 4.
For r’s bucket, take last ‘global
depth’ # bits of h(r); we
denote r by h(r)
If h(r) = 5 = binary 101, it
is in bucket pointed to by
01
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
DATA PAGES
Insert: If bucket is full, split it (allocate new page, re-distribute)
If necessary, double directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
Bucket D
Insert h(r)=20 (Causes Doubling)
LOCAL DEPTH
2
32*16*
GLOBAL DEPTH
2
00
Bucket A
3
32* 16* Bucket A
GLOBAL DEPTH
2
3
1* 5* 21*13* Bucket B
01
000
2
1* 5* 21* 13* Bucket B
001
10
2
11
10*
Bucket C
15* 7* 19*
Bucket D
2
011
10*
Bucket C
101
2
110
15* 7* 19*
Bucket D
111
2
4* 12* 20*
010
100
2
DIRECTORY
LOCAL DEPTH
Bucket A2
(`split image'
of Bucket A)
3
DIRECTORY
4* 12* 20*
Bucket A2
(‘split image'
of Bucket A)
Points to Note
20 = binary 10100
Last 2 bits (00) r belongs in A or A2
Last 3 bits needed to tell which
Global depth of directory: Max # of bits needed to tell which bucket
an entry belongs to
Local depth of a bucket: # of bits used to determine if an entry
belongs to this bucket
When does bucket split cause directory doubling?
Before insert, local depth of bucket = global depth
Insert causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split image page
(Use of least significant bits enables efficient doubling via copying of
directory!)
Comments on Extendible Hashing
If directory fits in memory, equality search answered with one
disk access; else two
Directory grows in spurts, and, if the distribution of hash values is
skewed, directory can grow large
Multiple entries with same hash value cause problems!
Delete:
If removal of data entry makes bucket empty, can be merged with
‘split image’
If each directory element points to same bucket as its split image, can
halve directory
Relevance of Hashing Techniques
Hash indices use extensible hashing
Uses of static hashing:
Aggregation
Intersection
Joins
30
Hash Join
Read entire inner
relationtuple
into hash
table (join attributes
as key)
For each tuple from
outer, look up in hash
table & join
tuple
Not fully pipelined
tuple
tuple
tuple
t
31
Running out of Memory
Prevention: First partition the data by value into memorysized groups
Partition both relations in the same way, write
to files
Recursively join the partitions
Resolution: Similar, but do when hash tables full
Split hash table into files along bucket
boundaries
Partition remaining data in same way
Recursively join partitions with diff. hash fn!
Hybrid hash join: flush “lazily” a few buckets at a time
Cost: <= 3 * (b(R) + b(S))
32
The Duality of Hash and Sort
Different means of partitioning and merging data when
comparisons are necessary:
Break on physical rule (mem size) in sorting
Merge on logical step, the merge
Break on logical rule (hash val) in hashing
Combine using physical step (concat)
When larger-than-memory sorting is necessary, multiple
operators use the same key, we can make all operators work
on the same in-memory portion of data at the same time
Can we do this with hashing? Hash teams (Graefe)
33
34
What If I Want to Distribute Query
Processing?
Where do I put the data in the first place (or do I
have a choice)?
How do we get data from point A point B?
What about delays?
What about “binding pattern” restrictions?
Looks kind of like an index join with a sargable predicate
35
Pipelined Hash Join Useful for Joining
Web Sources
tuple
tuple
tuple
Two hash tables
As a tuple comes in, add
to the appropriate side &
join with opposite table
Fully pipelined, adaptive
to source data rates
Can handle overflow as
with hash join
Needs more memory
36
The Semi-Join/Dependent Join
Take attributes from left and feed to the
right source as input/filter
Important in data integration
Simple method:
for each tuple from left
send to right source
get data back, join
More complex:
Hash “cache” of attributes & mappings
Don’t send attribute already seen
Bloom joins (use bit-vectors to reduce traffic)
JoinA.x = B.y
A
x
B
37
Wrap-Up
Query execution is all about engineering for efficiency
O(1) and O(lg n) algorithms wherever possible
Avoid looking at or copying data wherever possible
Note that larger-than-memory is of paramount importance
(Should that be so in today’s world?)
As we’ve seen it so far, it’s all about pipelining things through as
fast as possible
But may also need to consider other axes:
Adaptivity/flexibility – may sometimes need this
Information flow – to the optimizer, the runtime system
38
Upcoming Readings and Talks
For Thursday:
Read Chaudhuri survey as an overview
Read and review Selinger et al. paper
For Tuesday:
Read Volcano and Starburst papers
Write one review contrasting the two on the major issues
Especially: how do they handle search, comparison of costs?
Note that I’ll be giving a talk in the Dept. Research
Seminar, Levine 101, next…
39