Join Synopses for Approximate Query Answering

Download Report

Transcript Join Synopses for Approximate Query Answering

Join Synopses for Approximate
Query Answering
Swarup Acharya, Philip B. Gibbons,
Viswanath Poosala, Sridhar Ramaswamy
By Vladimir Gamaley
Abstract
• In large data environments its difficult to
provide fast and reliable answers.
• In this paper we demonstrate the difficulties
in the traditional approach and propose the
new technics for evaluation and
maintenance of Join Synopses.
Introduction
• Tradition query processing approach (exact
answer though minimal time use)
• Not always an exact answer is needed
• Sometimes appropriate answer is enough
Introduction (continued)
• Schemes for providing approximate
answers that rely on basic relations alone
suffer from serious disadvantages.
• Use of precomputed small sets of
distinguished joins.
Introduction (continued)
•
•
•
•
•
Careful allocation of place
Allocation heuristics
Providing approximate bounds
Join synopses maintenance
Experimental study results
AQUA System
• The goal of AQUA system is to improve
response times by avoiding accesses to the
original data.
• Maintenance of small synopses of various
samples and histograms.
AQUA System (Components)
•Statistics Collection
•Query Rewriting
•Maintenance
AQUA System (Architecture)
Queries
Answers
Bounds
AQUA
Data Warehouse
New
Data
Aqua
Synopses
Problems with joins
Uniform random samples provide:
• Non uniform result samples
• Small join results sizes
Problems with joins (example)
R.X
a
a
b
b
a1
a2
S.X
a
b1
Base probability for tuple to be selected 1/r
a1 and a2 - 1/r3
a1 and b1 - 1/r4
for k way foreign join - 1/rk
b
Join Synopses
Foreign Key Join:
A two way join r1 r2 is a foreign key join if the join attribute is
a foreign key in r1 (a key in r2).
For k > 2, a k-way foreign join if there is an ordering r1,r2..rk
and for j = 1,2,.. K, si-1 ri is a 2-way foreign join where
si-1 is a relation obtained joining r1, r2, … ri-1
Join synopses
TPC-D scheme
L
O
PS
P
C
N
R
S
Join synopses (continued)
Lemma 1:
The subgraph of G on the k nodes in any k - way foreign key
join must be a connected graph with a single root node
Lemma 2:
There is a 1-1 correspondence between tuples in a relation r1
and tuple in any k-way foreign key join.
Join Synopses (continued)
Join Synopses:
For each node u in G, corresponding to a relation r1, define
J(u) to be the output of the maximum foreign key join r1,r2..rk
with source k1. Let Su be a uniform random sample of r1.
Define a join synopses J(Su) to be the output of join Su, r2, ..rk.
The join synopses for scheme consists of join synopses for all
u’s.
Join Synopses (continued)
Theorem:
Let r1,r2…rk, k>=3 be an arbitrary k-way foreign
join, with source relation r1. Let u be the node in G
corresponding to r1 and let Su be a uniform random
sample of r1. Let A be the set of attributes in r1,r1…rk
Then:
1. J(Su) is a uniform random sample of J(u) with |Su|
tuples
2. Join r1, r2…rk is a projection of J(u) on the
attributes in r1, r2…rk
Join Synopses (continued)
Lemma:
From a single join synopses for a node whose maximum
foreign key has k relations we can extract uniform random
sample of between k-1 to 2k-1 -1 distinct foreign key joins
Lemma:
For any node u whose maximum foreign key join is a k-way
join, number of tuples in its renormalized join synopsis
J(Su) is at most k|Su|
Space allocation strategies
ni - numbr of tuples allocated to the join
fi- fraction of queries for which the join is a relation or the
source of foreign key join

i
fi
ni
Theorem:
ni = N’ fi 
 si 
2/3
si - size of join tuple


2 / 3 1/ 3
N’ = N/  fj sj 
 j

Space allocation strategies
Heuristics
EqJoin: Equally between relations
CubeJoin: In proportion to the cube root of their
join synopses tuple size
PropJoin: In proportion to their join synopses size.
Maintenance of Join Synopses
Adding a tuple
Deleting a tuple
Experiments Results
TestBed:
TPC-D decision benchmark. Scale factor 0.3 (database of
about 300 megabytes).
296MHz UltraSparc-II. I/O - 5 MB/sec
Experiment 1
accuracy - summary size
EquiBase, PropBase - produce answers only when
the summary size exceeds 1.5% of the database.
EquiJoin, PropJoin - good results even for 0.1% of
the database.
Experiment 2
execution timing
Actual execution time - 122 seconds.
The response time increases with the summary size.
Query using Join Synopses needs in two orders less time!
Related Work
Approximate query answering
Statistical techniques
Conclusions
Schemes based on join synopses provide better answer than
those, based only on the basic relations samples.
Approximate answering is becoming extremely important in
new application of data warehouses.
However, there are still more problems: group-bys, ranks
etc...