Transcript ETH
Symbolic Query Processing
Eric Lo
ETH Zurich
a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich),
Tamer Ozsu (U of Waterloo) and Peter Hass (IBM Almaden Research Center)
© ETH Zürich
Symbolic Query Processing
Treat all data as symbols (think of variables)
E.g., a1 represents any value under the domain of
attribute a
Table R and S are called symbolic relations
ETH Zurich
2
Background – Symbolic Execution 1/3
Borrow the concept from symbolic execution
A well known program verification technique
Represent values of program variables with
symbolic values instead of concrete data
Manipulate expressions based on those
symbolic values
ETH Zurich
3
Background – Symbolic Execution 2/3
Find a test case for path 1236
Symbolic execution – start:
1. minsalary = read_input();
1. minsalary = ben
2. bensalary = minsalary + 2000;
2. bensalary = ben + 2000;
3. if (bensalary < 80000)
3. bensalary = ben + 2000
4.
output “no kidding!”;
Symbolic execution – end
5. else
6.
and !(bensalary < 80000);- ( )
output “that’s right”;
Instantiate ():
ben = 90000 expected input
“that’s right” expected output
ETH Zurich
4
Background – Symbolic Execution 3/3
Has been research for > 20 years
Still have many limitations
E.g., cannot handle highly complex software
However, many large software vendors still put
hope on this technique for program verifications
E.g., Microsoft Research
No progress on database applications
involve an external database and SQL
ETH Zurich
5
SQP Applications
Extend program verification and symbolic
execution techniques to support database
applications
For DBMS testing focus of today
ETH Zurich
6
Symbolic Query Processing
Query manipulates data
according to different needs
R
b=c
b1
S
Want the join results to have
one tuple? set c1=b1
Want the join results to have:
four tuples
Zipf distribution (t1 joins more, t2
joins less)?
ETH Zurich
7
DBMS Testing
To test a DBMS, we generate a lot of test
databases and execute a lot of test queries
DBMS vendors are looking for a way to control
the intermediate results of a test query such that
we can test an individual component of a DBMS
under a particular test case
ETH Zurich
8
DBMS Testing Example
Test the accuracy of a cardinality
estimation component of a query
optimizer under
a multi-way hash join query
a two-way join query with
aggregation
If we can make sure executing the
test query on the test database
expected answer
gives
ETH Zurich
9
DBMS Testing
The test query is given
Physical join ordering can
be fixed (by testers)
Evaluation algorithm (e.g.,
using hash-join) can be
fixed too
However, the size of the
intermediate results
cannot be fixed easily
ETH Zurich
10
DBMS Testing Problem
Guarantee that executing a test query on a test
database can obtain the desired intermediate query
results (e.g.,. output cardinality, data distribution)
ETH Zurich
11
DBMS Testing Problem
A test case T is:
a parametric query Qp
with a set of constraints C on each
intermediate result
A good test database D means
Qp (D) satisfies C
- if the set of parameters p is properly
instantiated
Test case T
D covers test case T
ETH Zurich
12
Trial-and-error
Generate Database 3, 2, and 1
Using traditional database generators
such as IBM Test DB generator, MSR
DB generator, etc
Search for parameters
T2 is never covered
The database generation process
does not care about the test queries
ETH Zurich
13
Latest approach – Finding query
parameters
MSR realized this problem [TKDE06]
Given the test database + the test query
Qp, search parameter values for p such that
Qp(D) (almost) fit the cardinality
requirements defined on the test case
It is a NP-hard problem
Same as the previous approach, T2 is never
covered
ETH Zurich
14
QAGen – Query Aware test database
Generator
Based on symbolic query
processing
We can control the output size
of each intermediate query
result (and even more)
ETH Zurich
15
QAGen – Generate a query-aware test
database for each test case
ETH Zurich
16
QAGen overview
ETH Zurich
17
QAGen overview – Query Analyzer
Analyzer the query and assign
the knob to an operator
A knob is a parameter of an
operator to control the output
(e.g., output cardinality,
distribution)
A knob for an operator is not
always available for tuning
ETH Zurich
18
QAGen overview – Query Analyzer
A knob for an operator is not always available for tuning
join distribution?
Yes
ETH Zurich
join distribution?
No
19
QAGen overview – Query Analyzer
The available knob(s) for an operator depends on its input characteristics
Definition: pre-grouping data
Definition: non pre-grouping data
ETH Zurich
20
QAGen overview – Query Analyzer
ETH Zurich
21
Symbolic Query Engine and
Symbolic Database
ETH Zurich
22
Symbolic Query Engine and
Symbolic Database (SDB)
σa>p
An SQL operator:
Add predicates to a symbol
Replace a symbol with another other
>p
<=p
symbol (e.g., joining)
E.g., SELECT a FROM R
WHERE a > p;
1 output
ETH Zurich
23
Symbolic Query Engine and
Symbolic Database (SDB)
How to physically store the
symbolic data?
>p
<=p
Options:
Implement a native symbolic database
Use relational database
s
- How to represent “a1 > p”?
- Stores all predicates that are associated
with a symbol s in a separate relation called
Pred.
a1 a1>p
a2 a2<=p
PTable
PTable
ETH Zurich
24
Data Instantiator
ETH Zurich
25
Data Instantiation
• Data instantiator uses a constraint solver:
• Input: a (propositional) constraint (e.g., A + B > 50)
• Output: any concrete values for the constraint (e.g., A=99,
B=12)
ETH Zurich
26
Symbolic Query Engine
ETH Zurich
27
Symbolic Query Engine
Iterator-based
open(), getNext(), close()
No naughty user
ETH Zurich
Contradicting knob values
28
SQP – Table operator
Fill up the table with
symbols
ETH Zurich
29
SQP – σ operator
ETH Zurich
30
SQP –
operator (with FK constraint)
Action: join key replacement
ETH Zurich
31
SQP –
operator (with FK constraint)
Action: join key replacement
ETH Zurich
32
SQP –
operator (with FK constraint)
When the input of the join is
pre-grouped, the world has
changed
It sometimes happen, e.g.,
2-way join
Base tables A, B and C with
foreign key relationships
A B, B C
ETH Zurich
33
SQP –
operator (with FK constraint)
Do not support join distribution (the knob is disabled by
the analyzer)
Controlling the output cardinality is a subset-sum problem
(weakly NP-hard)
Subset-sum has a
pseudo-polynomial time exact
solution using
dynamic programming
ETH Zurich
34
SQP –
operator (with FK constraint)
Blocking
During open()
Materialize Table S in a temporary relation
SELECT COUNT(k)
From S
GROUP BY k
Solve the subset-sum
ETH Zurich
35
SQP – χ operator
Action 1:
Aggregation attribute replacement
• o_date3 o_date1
• o_date4 o_date2
1st output group (o_date1)
2nd output group (o_date2)
ETH Zurich
36
SQP – χ operator
Action 2 (base case version):
- Adding aggregation constraints to PTable, base case:
<l_price1, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7>
<l_price2, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7>
<l_price3, aggsum1 = ‘’> <l_price4, aggsum1=‘’>
<l_price7, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7>
<l_price5, aggsum2 = l_price5+ l_price6 + l_price8>
<l_price6, aggsum2 = l_price5+ l_price6 + l_price8>
<l_price8, aggsum2 = ‘’>
ETH Zurich
37
SQP – χ operator
Action 2 (optimized version):
- A constraint solver call is exponential to the size of predicates
- Adding 2 aggregation constraints to PTable:
<l_price1, aggsum1 = l_price1 x 5>
<l_price5, aggsum2 = l_price5 x 3>
and do l_price replacement
ETH Zurich
38
Data Instantiation
ETH Zurich
39
Data Instantiation
Use a constraint solver to instantiate the symbolic
database
for each symbolic relation r
for each tuple t
for each symbol s
load the related predicates P
instantiate P
cache P
ETH Zurich
40
Experiment 1 – Operator Performance
Study the performance (and scalability) of
Individual operator during SQP
The data instantiation phase
Use TPC-dbgen to generate 3 TPCH-DB
10M, 100M, 1G
Q8(TPCH-DB) to collect the intermediate results
R for each operator
QAGen(Q8, R) Q8 query aware database
ETH Zurich
41
Experiments –
TPC-H Query 8
ETH Zurich
42
Experiment 1 – TPC-H Query 8
ETH Zurich
43
Experiment 2 – Effects of knob values
Use TPCH Q8
6 sets of knob values
TPCH-Uniform, TPCH-Zipf
Min-Uniform, Min-Zipf
Max-Uniform, Max-Zipf
ETH Zurich
44
Experiment 2 – Effects of knob values
ETH Zurich
45
Experiment 3 – System Scalability
ETH Zurich
46
Related Work, Future Work, Conclusions
Reverse Query Processing (ICDE07)
Given the result R, the query Q, reversely process Q
to generate D for function testing database
applications, view maintenance, debugging SQL
Multiple SQL statements (to ACM TSE journal)
ETH Zurich
47
ETH Zurich
48
Current approach 2 – Stochastically
generate many test queries
Based on a given test database,
RAGS/QGen generates many
valid SQL queries to test the
system
No guarantee that T1 can be
covered
Same as the previous approach,
T2 is never covered
ETH Zurich
49
QAGen overview – Query Analyzer
Each knob combination
(e.g., output cardinality + join
distribution) for an operator
may have different ways to
implement it
The output is an knobannotated execution plan
ETH Zurich
50