Transcript ETH

Symbolic Query Processing
Eric Lo
ETH Zurich
a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich),
Tamer Ozsu (U of Waterloo) and Peter Hass (IBM Almaden Research Center)
© ETH Zürich
Symbolic Query Processing
 Treat all data as symbols (think of variables)
 E.g., a1 represents any value under the domain of
attribute a
 Table R and S are called symbolic relations
ETH Zurich
2
Background – Symbolic Execution 1/3
 Borrow the concept from symbolic execution
 A well known program verification technique
 Represent values of program variables with
symbolic values instead of concrete data
 Manipulate expressions based on those
symbolic values
ETH Zurich
3
Background – Symbolic Execution 2/3
Find a test case for path 1236
Symbolic execution – start:
1. minsalary = read_input();
1. minsalary = ben
2. bensalary = minsalary + 2000;
2. bensalary = ben + 2000;
3. if (bensalary < 80000)
3. bensalary = ben + 2000
4.
output “no kidding!”;
Symbolic execution – end
5. else
6.
and !(bensalary < 80000);- (  )
output “that’s right”;
Instantiate ():
ben = 90000  expected input
“that’s right”  expected output
ETH Zurich
4
Background – Symbolic Execution 3/3
 Has been research for > 20 years
 Still have many limitations

E.g., cannot handle highly complex software
 However, many large software vendors still put
hope on this technique for program verifications

E.g., Microsoft Research
 No progress on database applications 
involve an external database and SQL
ETH Zurich
5
SQP Applications
 Extend program verification and symbolic
execution techniques to support database
applications
 For DBMS testing  focus of today
ETH Zurich
6
Symbolic Query Processing
 Query manipulates data
according to different needs
 R
b=c
b1
S
 Want the join results to have
one tuple? set c1=b1
 Want the join results to have:

four tuples

Zipf distribution (t1 joins more, t2
joins less)?
ETH Zurich
7
DBMS Testing
 To test a DBMS, we generate a lot of test
databases and execute a lot of test queries
 DBMS vendors are looking for a way to control
the intermediate results of a test query such that
we can test an individual component of a DBMS
under a particular test case
ETH Zurich
8
DBMS Testing Example
 Test the accuracy of a cardinality
estimation component of a query
optimizer under

a multi-way hash join query

a two-way join query with
aggregation
 If we can make sure executing the
test query on the test database
expected answer
gives
ETH Zurich
9
DBMS Testing
 The test query is given
 Physical join ordering can
be fixed (by testers)
 Evaluation algorithm (e.g.,
using hash-join) can be
fixed too
 However, the size of the
intermediate results
cannot be fixed easily
ETH Zurich
10
DBMS Testing Problem
Guarantee that executing a test query on a test
database can obtain the desired intermediate query
results (e.g.,. output cardinality, data distribution)
ETH Zurich
11
DBMS Testing Problem
 A test case T is:

a parametric query Qp

with a set of constraints C on each
intermediate result
 A good test database D means

Qp (D) satisfies C
- if the set of parameters p is properly
instantiated

Test case T
D covers test case T
ETH Zurich
12
Trial-and-error
 Generate Database 3, 2, and 1

Using traditional database generators
such as IBM Test DB generator, MSR
DB generator, etc

Search for parameters
 T2 is never covered
 The database generation process
does not care about the test queries
ETH Zurich
13
Latest approach – Finding query
parameters
 MSR realized this problem [TKDE06]
 Given the test database + the test query
Qp, search parameter values for p such that
Qp(D) (almost) fit the cardinality
requirements defined on the test case
 It is a NP-hard problem
 Same as the previous approach, T2 is never
covered
ETH Zurich
14
QAGen – Query Aware test database
Generator
 Based on symbolic query
processing
 We can control the output size
of each intermediate query
result (and even more)
ETH Zurich
15
QAGen – Generate a query-aware test
database for each test case
ETH Zurich
16
QAGen overview
ETH Zurich
17
QAGen overview – Query Analyzer
 Analyzer the query and assign
the knob to an operator
 A knob is a parameter of an
operator to control the output
(e.g., output cardinality,
distribution)
 A knob for an operator is not
always available for tuning
ETH Zurich
18
QAGen overview – Query Analyzer
A knob for an operator is not always available for tuning
join distribution?
Yes
ETH Zurich
join distribution?
No
19
QAGen overview – Query Analyzer
The available knob(s) for an operator depends on its input characteristics
Definition: pre-grouping data
Definition: non pre-grouping data
ETH Zurich
20
QAGen overview – Query Analyzer
ETH Zurich
21
Symbolic Query Engine and
Symbolic Database
ETH Zurich
22
Symbolic Query Engine and
Symbolic Database (SDB)
σa>p
 An SQL operator:

Add predicates to a symbol

Replace a symbol with another other
>p
<=p
symbol (e.g., joining)
 E.g., SELECT a FROM R
WHERE a > p;
 1 output
ETH Zurich
23
Symbolic Query Engine and
Symbolic Database (SDB)
 How to physically store the
symbolic data?
>p
<=p
 Options:

Implement a native symbolic database

Use relational database
s
- How to represent “a1 > p”?
- Stores all predicates that are associated
with a symbol s in a separate relation called
Pred.
a1 a1>p
a2 a2<=p
PTable
PTable
ETH Zurich
24
Data Instantiator
ETH Zurich
25
Data Instantiation
• Data instantiator uses a constraint solver:
• Input: a (propositional) constraint (e.g., A + B > 50)
• Output: any concrete values for the constraint (e.g., A=99,
B=12)
ETH Zurich
26
Symbolic Query Engine
ETH Zurich
27
Symbolic Query Engine
 Iterator-based

open(), getNext(), close()
 No naughty user

ETH Zurich
Contradicting knob values
28
SQP – Table operator
 Fill up the table with
symbols
ETH Zurich
29
SQP – σ operator
ETH Zurich
30
SQP –
operator (with FK constraint)
Action: join key replacement
ETH Zurich
31
SQP –
operator (with FK constraint)
Action: join key replacement
ETH Zurich
32
SQP –
operator (with FK constraint)
 When the input of the join is
pre-grouped, the world has
changed
 It sometimes happen, e.g.,

2-way join
 Base tables A, B and C with
foreign key relationships
 A  B, B C
ETH Zurich
33
SQP –
operator (with FK constraint)
 Do not support join distribution (the knob is disabled by
the analyzer)
 Controlling the output cardinality is a subset-sum problem
(weakly NP-hard)
 Subset-sum has a
pseudo-polynomial time exact
solution using
dynamic programming
ETH Zurich
34
SQP –
operator (with FK constraint)
 Blocking
 During open()

Materialize Table S in a temporary relation

SELECT COUNT(k)
From S
GROUP BY k

Solve the subset-sum
ETH Zurich
35
SQP – χ operator
Action 1:
Aggregation attribute replacement
• o_date3  o_date1
• o_date4  o_date2
1st output group (o_date1)
2nd output group (o_date2)
ETH Zurich
36
SQP – χ operator
Action 2 (base case version):
- Adding aggregation constraints to PTable, base case:
<l_price1, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7>
<l_price2, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7>
<l_price3, aggsum1 = ‘’> <l_price4, aggsum1=‘’>
<l_price7, aggsum1 = l_price1+ l_price2 + l_price3+l_price4 + l_price7>
<l_price5, aggsum2 = l_price5+ l_price6 + l_price8>
<l_price6, aggsum2 = l_price5+ l_price6 + l_price8>
<l_price8, aggsum2 = ‘’>
ETH Zurich
37
SQP – χ operator
Action 2 (optimized version):
- A constraint solver call is exponential to the size of predicates
- Adding 2 aggregation constraints to PTable:
<l_price1, aggsum1 = l_price1 x 5>
<l_price5, aggsum2 = l_price5 x 3>
and do l_price replacement
ETH Zurich
38
Data Instantiation
ETH Zurich
39
Data Instantiation
 Use a constraint solver to instantiate the symbolic
database
 for each symbolic relation r
for each tuple t
for each symbol s
load the related predicates P
instantiate P
cache P
ETH Zurich
40
Experiment 1 – Operator Performance
 Study the performance (and scalability) of


Individual operator during SQP
The data instantiation phase
 Use TPC-dbgen to generate 3 TPCH-DB

10M, 100M, 1G
 Q8(TPCH-DB) to collect the intermediate results
R for each operator
 QAGen(Q8, R)  Q8 query aware database
ETH Zurich
41
Experiments –
TPC-H Query 8
ETH Zurich
42
Experiment 1 – TPC-H Query 8
ETH Zurich
43
Experiment 2 – Effects of knob values
 Use TPCH Q8
 6 sets of knob values



TPCH-Uniform, TPCH-Zipf
Min-Uniform, Min-Zipf
Max-Uniform, Max-Zipf
ETH Zurich
44
Experiment 2 – Effects of knob values
ETH Zurich
45
Experiment 3 – System Scalability
ETH Zurich
46
Related Work, Future Work, Conclusions
 Reverse Query Processing (ICDE07)

Given the result R, the query Q, reversely process Q
to generate D  for function testing database
applications, view maintenance, debugging SQL
 Multiple SQL statements (to ACM TSE journal)
ETH Zurich
47
ETH Zurich
48
Current approach 2 – Stochastically
generate many test queries
 Based on a given test database,
RAGS/QGen generates many
valid SQL queries to test the
system
 No guarantee that T1 can be
covered
 Same as the previous approach,
T2 is never covered
ETH Zurich
49
QAGen overview – Query Analyzer
 Each knob combination
(e.g., output cardinality + join
distribution) for an operator
may have different ways to
implement it
 The output is an knobannotated execution plan
ETH Zurich
50