Transcript ppt slides
An Annotation Management System
for Relational Databases
Laura Chiticariu
University of California, Santa Cruz
Joint work with
Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya
Annotation Management System
a1
A system that is able to propagate meta-data along
with the data as the data is being moved around
Main motivation
To trace the provenance and flow of data
a2
a3
transformation
b2
b1
a1 a2
transformation
b3
b1 b2 b3
a1 a2
transformation step: a query, an ETL rule, etc.
Many other uses
2
Our Vision
Serves fine French Cuisine
in elegant setting. Formal
attire.
NYRestaurants
Cost
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Extensive wine list!
Type
Zip
$$$
$$$
French 10022
Seafood 10022
$
$
Chinese 10013
American 10022
Yummy chicken curry!!
Cheap Restaurants
All Restaurants
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Cost
Type
$$$
$$$
French
Seafood
$
$
Chinese
American
Restaurant
Pacifica
Soho Kitchen & Bar
Cost
$
$
Type
Chinese
American
3
Other Applications
Keep information that cannot be otherwise
stored in the current database design
Highlight wrong data
Erroneous data may be copied around but the
comment that it is wrong goes along with it
Security and quality metric
Annotate security or quality levels of data items
4
Some Related Work
Idea is not new though propagation of annotations
was never explicitly stated as provenance-based:
Wang & Madnick [VLDB 90],
Lee, Bressan & Madnick [WIDM 98],
Bernstein & Bergstraesser [IEEE Data Eng. 99]
Superimposed Information. Maier and Delcambre
[WebDB 99]
Annotations of Web documents
Annotations on genomic sequences
Why-Provenance
Cui, Widom, & Wiener [CWW00]
5
Outline
pSQL queries
Semantics
CUSTOM propagation scheme
DEFAULT propagation scheme
DEFAULT-ALL propagation scheme
Implementation
System architecture
Experimental results
6
pSQL – an extension of SQL
A pSQL fragment:
SELECT DISTINCT
FROM
WHERE
PROPAGATE
|
|
selectlist
fromlist
wherelist
DEFAULT
DEFAULT-ALL
r1.A1 TO B1, …, rn.An TO Bn
A pSQL query is a union of pSQL fragments
7
The CUSTOM Scheme
R
Propagate annotations
according to
user specification
A
1
2
3
SELECT DISTINCT A
FROM R r
PROPAGATE r.A TO A
UNION
SELECT DISTINCT A
FROM R r
PROPAGATE r.B TO A
a
b
B
2
c
3
5
h
Result1
a
1
2
b
Result
3
Result2
c
1
annotation
UNION
1
2
3
a c
b
h
2
3
h
8
The DEFAULT Scheme
Propagate annotations
according to
where data is copied from
A
R
B
1
a
2
2
b
3
3
SELECT DISTINCT B
FROM R r
PROPAGATE DEFAULT r.B TO B
UNION
SELECT DISTINCT B
FROM S s
PROPAGATE DEFAULT s.B TO B
5
A
S
c
2
g
5
3
f
6
4
e
4
h
d
B
Result
2
c
g
3
f
4
e
5
h
9
natural semantics for tracing the provenance of data
Annotation Propagation under
the DEFAULT Scheme
R
A
1
B
2
a
S
B
2
C
b
3
Q1:
SELECT DISTINCT r.A, r.B, s.C
FROM R r, S s
Ans1
WHERE r.B =a s.B
PROPAGATE DEFAULT
1
versus
Q2:
SELECT DISTINCT *
FROM R NATURAL JOIN S
PROPAGATE DEFAULT
Ans2
1
2
a
3
equivalent queries,
but different
annotated output
2
a b
3
10
The DEFAULT-ALL scheme
Propagate annotations according to where data is
copied from according to all equivalent formulations
of the given query
User Query Q:
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE DEFAULT-ALL
(*)
the SQL query
corresponding to Q
Compute the results of Q on a database D – idea:
E(Q) denotes the set of all queries that are equivalent to Q
(more precisely, (*)).
Execute each query in E(Q) on the database D under the 11
DEFAULT scheme, then combine the results under a.
Computing the results of a
DEFAULT-ALL query
Question:
Given a pSQL query Q with DEFAULT-ALL
propagation scheme and a database D, can
we compute the result of Q(D)?
Problem:
There are infinitely many queries in E(Q). It is
therefore impossible to execute every query in
E(Q) in order to obtain the result of Q(D).
Solution: Compute a finite basis of E(Q) first
12
A Query Basis of Q
A query basis of Q, denoted as B(Q), is a finite set
of pSQL queries (with default propagation scheme)
such that:
Ua q(D)
qB(Q)
=a
Ua q(D)
qE(Q)
Given B(Q), we can execute each query in B(Q) and
combine the results to obtain the result of Q(D).
Question: Given Q, does B(Q) always exist and how
can we compute B(Q)?
13
Generating a Query Basis of Q
Given R(A,B) and S(B,C)
User query Q:
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE DEFAULT-ALL
The representative
query propagates
annotations according
to where data is copied
from or equivalently
copied from.
Representative Query Q0 : Ans(x,y,z) :- R(x,y), S(y,z).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B
Propagations under the
default propagation scheme
Additional propagation
due to the equality
14
r.B = s.B
Generating a Query Basis of Q
Auxiliary Queries:
Q1:
Ans(x,y,z) :- R(x,y), S(y,z), R(x,w).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, R r’
WHERE r.B = s.B, r’.A = r.A
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, r’.A TO A
Q2:
Ans(x,y,z) :- R(x,y), S(y,z), S(w,z).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, S s’
WHERE r.B = s.B, s’.C = s.C
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, s’.C TO C
15
Generating a Query Basis of Q
Auxiliary Queries:
Q3:
Ans(x,y,z) :- R(x,y), S(y,z), R(w,y).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, R r’
WHERE r.B = s.B, r’.B = r.B
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, r’.B TO B
Q4:
Ans(x,y,z) :- R(x,y), S(y,z), S(y,w).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, S s’
WHERE r.B = s.B, s’.B = s.B
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, s’.B TO B
16
Correctness of the Algorithm
For the example, a query basis of Q
consists of Q0, Q1, Q2, Q3, and Q4.
Theorem:
Given a pSQL query Q with DEFAULT-ALL propagation
scheme, the algorithm generates a query basis of Q.
Proof Idea:
Every query in B(Q) is an equivalent query of Q
Every equivalent query of Q is annotation-contained in
Ua q(D)
qB(Q)
17
Outline
pSQL queries
Semantics
CUSTOM propagation scheme
DEFAULT propagation scheme
DEFAULT-ALL propagation scheme
Implementation
System architecture
Experimental results
18
System Architecture
USER pSQL
query
Translator
SQL
query
RDBMS
sorted
tuples
Postprocessor
final
result
Translator Module
Input: a pSQL query Q
Output: an SQL query Q’ written against the naïve storage
scheme
Q’ is sent to the RDBMS and executed
Postprocessor Module
Input: sorted tuples (returned by the RDBMS)
Output: An annotated set of tuples.
Annotations for the same output location are collected together
19
Duplicate tuples are removed
A Naïve Storage Scheme
For every attribute of every relation there is an
additional attribute for storing the annotations
R
A’
B
B’
2
1
a
2
c
4
1
d
2
-
3
b
4
-
B
a d
1
R’
A
A
c
b
3
Conceivably, there are other possible storage schemes
20
The Translator module
pSQL query
default-all scheme
Generate a
Query Basis
set of pSQL queries
with custom scheme
pSQL query
default scheme
Translate
default pSQL
to custom pSQL
pSQL query
custom scheme
Translate
custom pSQL
to SQL
SQL query
default pSQL query
custom pSQL query
SELECT DISTINCT r.A AS A,
r.B AS B
FROM R r
PROPAGATE DEFAULT
SELECT DISTINCT r.A AS A,
r.B AS B
FROM R r
PROPAGATE r.A TO A, r.B TO B
21
Experiments
Goals
compare the performance of pSQL queries under
different propagation schemes (DEFAULT,
DEFAULT-ALL, or no propagation scheme)
compare the performance of pSQL queries when
the number of annotations in a database is varied
22
Experimental setup
Implemented on top of Oracle 9i
Datasets
100MB, 500MB, 1GB TPCH database
Unannotated database on original schema
30%, 60%, 100% annotations on naïve schema
buffer size: 256Mb
Test queries
SPJ queries
Varied the number of joins (0 to 4 joins)
Varied the number of selected attributes (1,3 or 5 attributes)
23
100MB dataset – 100% annotated
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
1000
seconds (log scale)
100
10
1
0.1
0.01
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated DB
pSQL DEFAULT
pSQL DEFAULT-ALL
Qi(j) denotes a query with i joins and j output attributes.
24
500MB dataset – 100% annotated
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
100000
seconds (log scale)
10000
1000
100
10
1
0.1
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated DB
pSQL DEFAULT
pSQL DEFAULT-ALL
Qi(j) denotes a query with i joins and j output attributes.
25
1GB dataset – 100% annotated
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
100000
seconds (log scale)
10000
1000
100
10
1
0.1
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated DB
pSQL DEFAULT
pSQL DEFAULT-ALL
Qi(j) denotes a query with i joins and j output attributes.
26
100MB dataset annotated in various degrees
pSQL Default on 30%, 60% 100% annotated DBs
1000
seconds (log scale)
100
10
1
0.1
0.01
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated BD
30% Annotated DB
60% Annotated DBt
100% Annotated DB
Qi(j) denotes a query with i joins and j output attributes.
27
Contributions
an annotation management system
pSQL query language for propagation annotations
for carrying annotations along as data is being transformed
based on provenance
CUSTOM – user defined
DEFAULT – where data was copied from?
DEFAULT-ALL – invariant under equivalent queries
Generate-Query-Basis algorithm
an initial implementation
28
Future work
Performance of our annotation management
system on other storage schemes
pSQL extensions
Aggregates
Bag Queries
29
END
30
The CUSTOM Scheme - Example
R
A
1
SELECT DISTINCT B
FROM R r
PROPAGATE r.A TO B,
r.B TO B
2
B
a
2
b
c
3
3
5
h
Result
2
3
5
a c
b
h
31
Terminology
A location is a triple (R, t, A)
R
A
1
B
2
a
The annotation “a”
is attached to the
location (R,(1,2),B)
Definition:
A query Q1 is annotation contained in a query Q2 if:
•
•
Q1 Q2
for every database D, the set of annotations attached to every
output location in Q1(D) is a subset of the set of annotations
associated with the same location in the output of Q2(D).
32
In a More Concise Notation
Ans(x,y,z) :- R(x,y), S(y’,z), y = y’.
{ x ! 1, y ! 2,
Ans(x,y,z) :- R(x,y), S(y,z).
{ x ! 1, y ! 2,
a
y’ ! 2,
b
z!3}
a b
z!3}
Annotations of values that reside in different source locations but are
bound to the same variable are unioned together.
Ans(y) :- R(x,y).
Ans(y) :- S(y,z).
a b
Ans(2
).
Annotations that belong to the same output location are unioned
together.
33
Containment vs.
annotation-containment
R
A
B
C
1
a
2
3
1
b
4
5
1
8
4
8
9
5
c
d
Q1
Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z).
Ans1
1
a b
Q2
Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z).
5
c
Ans2
Q1 Q2 but…
Q1 a Q2 and Q2 a Q1
1
b
5
c d
34
Translating a CUSTOM pSQL to SQL
custom pSQL query:
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE s.B TO B, s.C TO C, r.B TO B
SQL query:
SELECT DISTINCT *
FROM ( Q1 UNION Q2 ) t
ORDER BY t.A, t.B, t.C
Q1:
SELECT r.A,
s.B,
s.C,
FROM R r, S
WHERE r.B =
NULL,
s.B’,
s.C’
s
s.B
Q2:
SELECT r.A,
s.B,
s.C,
FROM R r, S
WHERE r.B =
NULL,
r.B’,
NULL
s
s.B
35