Join Synopses for Approximate Query Answering

Download Report

Transcript Join Synopses for Approximate Query Answering

Join Synopses for
Approximate Query Answering
Swarup Acharya
Phillip B. Gibbons
Viswanath Poosala
Sridhar Ramaswamy
Purpose
• Provide good approximate answers for join queries.
• Make use of join synopsis.
• Allocate space for join synopses.
• Maintain join synopses.
Why?????
• Reduce overhead for large DB’s.
• Approximate answer is best.
• Reduce access to base relation.
• Fast response time.
Proposed
• Make use of pre-computed samples of DJ’s. (Join
Synopses).
• For queries with foreign key joins, provide good
approximate join aggregates using small number of join
synposes.
• Allocation of space among the sets.
• If workload characteristics is available.
• Provide confidence bounds.
Aqua System
• Goal- improve response time for queries by reducing
access to base relation.
• Maintains smaller sized statistical summaries –
“SYNOPSES”
• Provides confidence bounds.
• It has 3 components
1:Statistics Collection
2:Query Rewriting
3:Maintenance.
• It sits on top of a DBMS.
Problem with Joins
• Natural set of synopses would be random
samples from each of the base relation.
• Non-Uniform Result Sample.for the join of base samples to be a
uniform random sample of original
relation the probability must equal.
• Small join result sizes.
• Using samples on base relations is not feasible.
Join Synopses
• Pre-compute samples of join results.
• Compute samples of the results of a small
set of distinguishing joins.
• Can obtain random samples for all
possible joins in the schema.
• This scheme is for foreign key joins.
• Model the DB as a graph.
Join Synopses
• There is a 1-1 correspondence between a tuple in a
relation ‘r’ & a tuple in the output of any foreign key join
involving ‘r’ & any of its descendants in the graph.
• A sample Sr of a relation ‘r’ can be used to produce
another relation J(Sr) called a join synopsis of ‘r’. (
provides random samples).
• The subgraph of G on the ‘k’ nodes in any k-way foreign
key join must be a connected subgraph with a single root
node.
Join Synopses
• For each node u in G, corresponding to a relation r1, define J(u) to
be the output of the maximum foreign key join r1xr2x..xrk with
source r1.
• Let Su be a uniform random sample of r1.
• The join synopsis J(Su) is the output of Suxr2xr3…..xrk.
• J(Su) is a uniform random sample of J(u) with |Su| tuples.
• Thus we can extract from our synopsis a uniform random sample of
the output of any k-way foreign key join.
• From 1 join synopsis for a node whose foreign key join has k
relations, we can extract a URS of the output of between k-1 &
pow(2,k-1)-1 distinct foreign ey joins.
Allocation
• Allocate space among various join synopses when certain
properties of query workload are known.
• Identify heuristics for the common case when such properties are
not known.
• Let ‘S’ be a set of queries with selects, aggregates, group by’s &
foreign key joins.
• For each relation Ri, find fraction Fi of queries in S for which Ri is
the source relation in a foreign key join.
• Select join synopsis sizes so as to minimize the average relative
error.
• It is known that the error bounds are inversely proportional to
sqrt(n).(n- number of tuples in join sample).
Allocation
• The average relative error bound over the
queries is proportional to sum(Fi/sqrt(ni)).
• In the absence of query work load information
heuristic strategies can be used.
• There are 3• EqJoin
• CubeJoin
• PropJoin
Maintenance of Join Synopses
• Need to maintain synchronization when base relation is
updated.
• If a tuple is added do this.
– Let Pu be current probability for including a newly
arrived tuple for relation u in the random sample Su.
– Let uxr2xr3x….xrk be the max foreign key join with
source u.
– We add Tp (new tuple) to Su with probability Pu.
– If Tp is added to Su, we add to J(Su) the tuple
Tpxr2xr3x….rk.
Maintenance of Join Synopses
• On delete of a tuple Tp from u, do this
• Find if Tp is in Su. If there, then delete it from Su &
remove the tuple in J(Su) corresponding to Tp.
• If sample becomes too small , repopulate the
sample by rescanning relation u.
•This algorithm performs look ups to base relation with
probability Pu.
•We never update join synopses for any ancestors of u.
Experimental Evaluation
• Test bed – TPC-D decision support benchmark. DB of
around 300 MB.
• Machine – 296MHz UltraSPARC-II, 256 MB of memory,
Solaris 5.6.
• Query used is based on Q5 & an aggregate computed
on join of Lineitem, Customer, Order, Supplier, Nation,
Region.
• The query used is
Join Synopsis Accuracy
Join Synopsis Maintenance
Conclusion
• Focus on computing approximate answers to
aggregates computed on multi-way joins.
• For DB’s with schema’s that involve only foreign
key joins, join synopses has been proposed as s
solution.
• Join synopses can be maintained effectively
during updates.