slides - SEAS - University of Pennsylvania

Transcript slides - SEAS - University of Pennsylvania

GiST, Concluded, and
Query Execution
Zachary G. Ives
University of Pennsylvania
CIS 650 – Implementing Data Management Systems
September 16, 2008
Content on hashing and sorting courtesy Ramakrishnan & Gehrke
Generalizing B+ Trees and R Trees: GiST
 Question: can we create an index toolkit that can be
customized for any kind of index we’d like?
 GiST doesn’t quite get there – but a nice framework for
understanding the concepts
 Observation: some aspects of the B+ Tree and R Tree that
can be drawn out:
 Height balanced – requires re-organization
 Data on leaves; intermediate nodes help focus on a child
 High fan-out
 GiST is actually used in PostgreSQL
2
Similarities and Differences
Recall differences we saw between B+ Trees and R Trees:
 B+ Tree:
 one-dimensional
 full ordering among data
 R Tree:
 many-dimensional
 no complete ordering among the data
 means that intermediate nodes’ children may fit in more than one
place
 also, intermediate node’s bounding box may have lots of space
that’s not occupied by child nodes (relates to “curse of
dimensionality”)
 needs to rely on heuristics about where to insert a node
 may need to search many possible paths along the tree
3
Search: Do We Explore a Path?
 Each subtree node can be tested –
Consistent(node, predicate)
 Returns FALSE only if the node or its children can’t
contain the predicate – i.e., can return false positives
 What does this correspond to in a B+ Tree? An R Tree?
 What additional work needs to be done if contains() returns
true?
4
Insertion – Simple Case
 Union(entrySet) returns predicate
 Returns a predicate (e.g., bounding box) that holds over a
set (e.g., of children) – basically a disjunction
 Compress(entry) returns entry’
 Simplifies the predicate of the entry, potentially increasing
the chance of false positives
 Examples: simplify polygon; truncate string
 Decompress(entry) returns entry’
 The inverse of the above – may have false positives
 Penalty(entry1, entry2)
 Gives the cost of inserting entry2 into entry1
 Example: how much does it expand bounding rectangle?
5
Insertion – Splitting
 PickSplit(entrySet) returns <entrySet1, entrySet2>
 Splits the set of children in an intermediate node into two
sets
 Typically split according to a “badness metric”
6
Basic Routines I
 How do we search for a node set satisfying a
predicate?
Search(node, predicate)
 For every nonleaf, if Consistent, call recursively on
children; return union of result sets from children
 For leaf, if Consistent, return singleton set, else empty set
 Ordered domains: FindMin(root, predicate), Next(root,
predicate, current)
 Used to do a linear scan of the data at the leaves, if ordered
7
Basic Routines II
 How do we insert a node?
 Insert(node, new, level)
 L = ChooseSubtree(node, new, level)
 If room in L, insert new as a child, else invoke Split(node, L, new)
 AdjustKeys(node, L)
 ChooseSubtree(node, new, level) returns node at level
 Recursively descend tree, minimizing Penalty
 If at desired level, return node
 Among child entries in node, find one with minimal Penalty, return
ChooseSubtree on that child
8
Helper Functions
 Split(root, node, new)
 Invoke PickSplit on union of children of node and new
 Put one partition into node and create a new node’ with the remainder
 Insert all of node’ children into Parent(node) – if insufficient space,
invoke PickSplit on this parent
 Modify the predicate describing node
 AdjustKeys(root, node)
 If node = root, or predicate referring to node is correct, return
 Otherwise,
 modify predicate for node to contain the union of all keys of node
 recursively call AdjustKeys(root, parent(node))
9
Query Execution
Takes a “physical plan” and attempts to use it to produce
query answers
 Some considerations in building execution engines:
 Efficiency – minimize copying, comparisons
 Scheduling – make standard code-paths fast
 Data layout – how to optimize cache behavior, buffer
management, distributed execution, etc.
10
Execution System Architectures
 Central vs. distributed vs. parallel vs. mediator
 Data partitioning – vertical vs. horizontal
 Monet model – binary relations
 Distributed – data placement
 One operation at a time – INGRES
 Pipelined
 Iterator-driven
 Dataflow-driven
 Hybrid approaches
11
Execution Strategy Issues
Granularity & parallelism:
 Pipelining vs. blocking
 Materialization
Join
PressRel.Symbol = EastCoast.CoSymbol
Join
Project
PressRel.Symbol = Clients.Symbol
CoSymbol
Select
Client = “Atkins”
Scan
Scan
Scan
PressRel
Clients
EastCoast
12
Iterator-Based Query Execution
 Execution begins at root
Join
 open, next, close
 Propagate calls to children
PressRel.Symbol = EastCoast.CoSymbol
May call multiple child nexts
 “Synchronous pipelining”
Minimize copies
Join
Project
PressRel.Symbol = Clients.Symbol
CoSymbol
 Efficient scheduling &
resource usage
Can you think of alternatives and
their benefits?
Select
Client = “Atkins”
Scan
Scan
Scan
PressRel
Clients
EastCoast
13
The Simplest Method
Iteration over tables
 Sequential scan
 Nested loops join
 What’s the cost? What tricks might we use to speed it up?
 Optimizations:
 Double-buffering




Overlap I/O and computation
Prefetch a page into a shadow block while CPU processes different block
Requires second buffer to prefetch into
Switch to that when the CPU is finished with the alternate buffer
 Alternate the direction of reads in file scan
14
Speeding Operations over Data
Three general data organization techniques:
 Indexing
 Associative lookup & synopses
 Sorting
 Hashing
15
Indices
GiST and B+ Trees
Alternatives for storage:
 <key, record>; <key, rid>;
<key, {rids}>
 Clustered vs. unclustered
Bitmapped index – bit position for
each value in the domain
 Requires a domain with discrete
values (not necessarily ordinal)
 Booleans; enumerations; rangebounded integers
 Low-update data
 Efficient for AND, OR only
expressions between different
predicates
16
Usefulness of Indices
Where are these structures most useful?
 Sargable predicates
 Covering indices
In many cases, only help with part of the story
 Filter part of the answer set, but we still need further
computation
 e.g., AND or OR of two predicates
General rule of thumb:
 Unclustered index only useful if selectivity is < 10-20%
Impact of flash memory (SSDs)?
17
Sorting – External Binary Sort
Divide and conquer: sort into
subfiles and merge
Each pass: we read & write every
page
3,4
6,2
9,4
8,7
5,6
3,4
2,6
4,9
7,8
5,6
4,7
8,9
2,3
4,6
3,1
1,3
Input file
PASS 0
1-page runs
PASS 1
2
2
1,3
5,6
2
2-page runs
PASS 2
2,3
If N pages in the file, we need:
dlog2(N)e + 1
passes to sort the data, yielding a
cost of:
2Ndlog2(N)e + 1
4,4
6,7
8,9
1,2
3,5
6
4-page runs
PASS 3
1,2
2,3
3,4
4,5
6,6
7,8
9
8-page runs
General External Merge Sort
 How can we utilize more than 3 buffer pages?
 To sort a file with N pages using B buffer pages:
 Pass 0: use B buffer pages. Produce dN / Be sorted runs of B
pages each
 Pass 2, …, etc.: merge B-1 runs
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT B-1
Disk
B Main memory buffers
Disk
Cost of External Merge Sort
 Number of passes: 1+dlogB-1 dN / Bee
 Cost = 2N * (# of passes)
 With 5 buffer pages, to sort 108 page file:
 Pass 0: d108/5e = 22 sorted runs of 5 pages each (last
run is only 3 pages)
 Pass 1: d22/4e = 6 sorted runs of 20 pages each
(final run only uses 8 pages)
 Pass 2: d6/4e = 2 sorted runs, 80 pages and 28 pages
 Pass 3: Sorted file of 108 pages
Applicability of Sort Techniques
 Join
 Intersection
 Aggregation
 Duplicate removal as an instance of aggregation
 XML nesting as an instance of aggregation
21
Merge Join
 Requires data sorted by join attributes
Merge and join sorted files, reading sequentially
a block at a time
 Maintain two file pointers
 While tuple at R < tuple at S, advance R (and vice versa)
 While tuples match, output all possible pairings
 Maintain a “last in sequence” pointer
 Preserves sorted order of “outer” relation
 Cost: b(R) + b(S) plus sort costs, if necessary, plus buffer
In practice, approximately linear, 3 (b(R) + b(S))
22
Hashing
Several types of hashing:
 Static hashing
 Extensible hashing
 Consistent hashing (used in P2P; we’ll see later)
23
Static Hashing
 Fixed number of buckets (and pages); overflow when
necessary
 h(k) mod N = bucket to which data entry with key k belongs
 Downside: long overflow chains
h(key) mod N
key
0
2
h
N-1
Primary bucket pages
Overflow pages
Extendible Hashing
If a bucket becomes full split in half


Use directory of pointers to buckets, double the directory,
splitting just the bucket that overflowed
Directory much smaller than file, so doubling it is much
cheaper


Only one page of data entries is split
Trick lies in how hash function is adjusted!
LOCAL DEPTH
Example
GLOBAL DEPTH
2
 Directory is array of size 4.
 For r’s bucket, take last ‘global
depth’ # bits of h(r); we
denote r by h(r)
 If h(r) = 5 = binary 101, it
is in bucket pointed to by
01
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
DATA PAGES

Insert: If bucket is full, split it (allocate new page, re-distribute)

If necessary, double directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
Bucket D
Insert h(r)=20 (Causes Doubling)
LOCAL DEPTH
2
32*16*
GLOBAL DEPTH
2
00
Bucket A
3
32* 16* Bucket A
GLOBAL DEPTH
2
3
1* 5* 21*13* Bucket B
01
000
2
1* 5* 21* 13* Bucket B
001
10
2
11
10*
Bucket C
15* 7* 19*
Bucket D
2
011
10*
Bucket C
101
2
110
15* 7* 19*
Bucket D
111
2
4* 12* 20*
010
100
2
DIRECTORY
LOCAL DEPTH
Bucket A2
(`split image'
of Bucket A)
3
DIRECTORY
4* 12* 20*
Bucket A2
(‘split image'
of Bucket A)
Points to Note
20 = binary 10100
 Last 2 bits (00)  r belongs in A or A2
 Last 3 bits needed to tell which
 Global depth of directory: Max # of bits needed to tell which bucket
an entry belongs to
 Local depth of a bucket: # of bits used to determine if an entry
belongs to this bucket
When does bucket split cause directory doubling?
 Before insert, local depth of bucket = global depth
 Insert causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split image page
 (Use of least significant bits enables efficient doubling via copying of
directory!)
Comments on Extendible Hashing
If directory fits in memory, equality search answered with one
disk access; else two
 Directory grows in spurts, and, if the distribution of hash values is
skewed, directory can grow large
 Multiple entries with same hash value cause problems!
Delete:
 If removal of data entry makes bucket empty, can be merged with
‘split image’
 If each directory element points to same bucket as its split image, can
halve directory
Relevance of Hashing Techniques
 Hash indices use extensible hashing
 Uses of static hashing:
 Aggregation
 Intersection
 Joins
30
Hash Join
Read entire inner
relationtuple
into hash
table (join attributes
as key)
For each tuple from
outer, look up in hash
table & join
tuple
 Not fully pipelined
tuple
tuple
tuple
t
31
Running out of Memory
 Prevention: First partition the data by value into memorysized groups
Partition both relations in the same way, write
to files
Recursively join the partitions
 Resolution: Similar, but do when hash tables full
Split hash table into files along bucket
boundaries
Partition remaining data in same way
Recursively join partitions with diff. hash fn!
 Hybrid hash join: flush “lazily” a few buckets at a time
 Cost: <= 3 * (b(R) + b(S))
32
The Duality of Hash and Sort
Different means of partitioning and merging data when
comparisons are necessary:
 Break on physical rule (mem size) in sorting
 Merge on logical step, the merge
 Break on logical rule (hash val) in hashing
 Combine using physical step (concat)
 When larger-than-memory sorting is necessary, multiple
operators use the same key, we can make all operators work
on the same in-memory portion of data at the same time
 Can we do this with hashing? Hash teams (Graefe)
33
34
What If I Want to Distribute Query
Processing?
 Where do I put the data in the first place (or do I
have a choice)?
 How do we get data from point A  point B?
 What about delays?
 What about “binding pattern” restrictions?
 Looks kind of like an index join with a sargable predicate
35
Pipelined Hash Join Useful for Joining
Web Sources
tuple
tuple
tuple
 Two hash tables
 As a tuple comes in, add
to the appropriate side &
join with opposite table
 Fully pipelined, adaptive
to source data rates
 Can handle overflow as
with hash join
 Needs more memory
36
The Semi-Join/Dependent Join
 Take attributes from left and feed to the
right source as input/filter
 Important in data integration
 Simple method:
for each tuple from left
send to right source
get data back, join
 More complex:
 Hash “cache” of attributes & mappings
 Don’t send attribute already seen
 Bloom joins (use bit-vectors to reduce traffic)
JoinA.x = B.y
A
x
B
37
Wrap-Up
Query execution is all about engineering for efficiency
 O(1) and O(lg n) algorithms wherever possible
 Avoid looking at or copying data wherever possible
 Note that larger-than-memory is of paramount importance
 (Should that be so in today’s world?)
As we’ve seen it so far, it’s all about pipelining things through as
fast as possible
But may also need to consider other axes:
 Adaptivity/flexibility – may sometimes need this
 Information flow – to the optimizer, the runtime system
38
Upcoming Readings and Talks
For Thursday:
 Read Chaudhuri survey as an overview
 Read and review Selinger et al. paper
For Tuesday:
 Read Volcano and Starburst papers
 Write one review contrasting the two on the major issues
 Especially: how do they handle search, comparison of costs?
Note that I’ll be giving a talk in the Dept. Research
Seminar, Levine 101, next…
39