Transcript ppt slides

An Annotation Management System
for Relational Databases
Laura Chiticariu
University of California, Santa Cruz
Joint work with
Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya
Annotation Management System


a1
A system that is able to propagate meta-data along
with the data as the data is being moved around
Main motivation
 To trace the provenance and flow of data
a2
a3
transformation
b2
b1
a1 a2
transformation
b3
b1 b2 b3
a1 a2
transformation step: a query, an ETL rule, etc.

Many other uses
2
Our Vision
Serves fine French Cuisine
in elegant setting. Formal
attire.
NYRestaurants
Cost
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Extensive wine list!
Type
Zip
$$$
$$$
French 10022
Seafood 10022
$
$
Chinese 10013
American 10022
Yummy chicken curry!!
Cheap Restaurants
All Restaurants
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Cost
Type
$$$
$$$
French
Seafood
$
$
Chinese
American
Restaurant
Pacifica
Soho Kitchen & Bar
Cost
$
$
Type
Chinese
American
3
Other Applications

Keep information that cannot be otherwise
stored in the current database design

Highlight wrong data


Erroneous data may be copied around but the
comment that it is wrong goes along with it
Security and quality metric

Annotate security or quality levels of data items
4
Some Related Work

Idea is not new though propagation of annotations
was never explicitly stated as provenance-based:







Wang & Madnick [VLDB 90],
Lee, Bressan & Madnick [WIDM 98],
Bernstein & Bergstraesser [IEEE Data Eng. 99]
Superimposed Information. Maier and Delcambre
[WebDB 99]
Annotations of Web documents
Annotations on genomic sequences
Why-Provenance

Cui, Widom, & Wiener [CWW00]
5
Outline

pSQL queries

Semantics




CUSTOM propagation scheme
DEFAULT propagation scheme
DEFAULT-ALL propagation scheme
Implementation


System architecture
Experimental results
6
pSQL – an extension of SQL

A pSQL fragment:
SELECT DISTINCT
FROM
WHERE
PROPAGATE
|
|

selectlist
fromlist
wherelist
DEFAULT
DEFAULT-ALL
r1.A1 TO B1, …, rn.An TO Bn
A pSQL query is a union of pSQL fragments
7
The CUSTOM Scheme

R
Propagate annotations
according to
user specification
A
1
2
3
SELECT DISTINCT A
FROM R r
PROPAGATE r.A TO A
UNION
SELECT DISTINCT A
FROM R r
PROPAGATE r.B TO A
a
b
B
2
c
3
5
h
Result1
a
1
2
b
Result
3
Result2
c
1
annotation
UNION
1
2
3
a c
b
h
2
3
h
8
The DEFAULT Scheme

Propagate annotations
according to
where data is copied from
A
R
B
1
a
2
2
b
3
3
SELECT DISTINCT B
FROM R r
PROPAGATE DEFAULT r.B TO B
UNION
SELECT DISTINCT B
FROM S s
PROPAGATE DEFAULT s.B TO B
5
A
S
c
2
g
5
3
f
6
4
e
4
h
d
B
Result
2
c
g
3
f
4
e
5
h
9
natural semantics for tracing the provenance of data
Annotation Propagation under
the DEFAULT Scheme
R
A
1
B
2
a
S
B
2
C
b
3
Q1:
SELECT DISTINCT r.A, r.B, s.C
FROM R r, S s
Ans1
WHERE r.B =a s.B
PROPAGATE DEFAULT
1
versus
Q2:
SELECT DISTINCT *
FROM R NATURAL JOIN S
PROPAGATE DEFAULT
Ans2
1
2
a
3
equivalent queries,
but different
annotated output
2
a b
3
10
The DEFAULT-ALL scheme


Propagate annotations according to where data is
copied from according to all equivalent formulations
of the given query
User Query Q:
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE DEFAULT-ALL

(*)
the SQL query
corresponding to Q
Compute the results of Q on a database D – idea:


E(Q) denotes the set of all queries that are equivalent to Q
(more precisely, (*)).
Execute each query in E(Q) on the database D under the 11
DEFAULT scheme, then combine the results under a.
Computing the results of a
DEFAULT-ALL query

Question:
Given a pSQL query Q with DEFAULT-ALL
propagation scheme and a database D, can
we compute the result of Q(D)?

Problem:
There are infinitely many queries in E(Q). It is
therefore impossible to execute every query in
E(Q) in order to obtain the result of Q(D).

Solution: Compute a finite basis of E(Q) first
12
A Query Basis of Q

A query basis of Q, denoted as B(Q), is a finite set
of pSQL queries (with default propagation scheme)
such that:
Ua q(D)
qB(Q)
=a
Ua q(D)
qE(Q)

Given B(Q), we can execute each query in B(Q) and
combine the results to obtain the result of Q(D).

Question: Given Q, does B(Q) always exist and how
can we compute B(Q)?
13
Generating a Query Basis of Q


Given R(A,B) and S(B,C)
User query Q:
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE DEFAULT-ALL

The representative
query propagates
annotations according
to where data is copied
from or equivalently
copied from.
Representative Query Q0 : Ans(x,y,z) :- R(x,y), S(y,z).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B
Propagations under the
default propagation scheme
Additional propagation
due to the equality
14
r.B = s.B
Generating a Query Basis of Q
Auxiliary Queries:
Q1:

Ans(x,y,z) :- R(x,y), S(y,z), R(x,w).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, R r’
WHERE r.B = s.B, r’.A = r.A
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, r’.A TO A
Q2:
Ans(x,y,z) :- R(x,y), S(y,z), S(w,z).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, S s’
WHERE r.B = s.B, s’.C = s.C
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, s’.C TO C
15
Generating a Query Basis of Q
Auxiliary Queries:
Q3:

Ans(x,y,z) :- R(x,y), S(y,z), R(w,y).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, R r’
WHERE r.B = s.B, r’.B = r.B
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, r’.B TO B
Q4:
Ans(x,y,z) :- R(x,y), S(y,z), S(y,w).
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s, S s’
WHERE r.B = s.B, s’.B = s.B
PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B, s’.B TO B
16
Correctness of the Algorithm

For the example, a query basis of Q
consists of Q0, Q1, Q2, Q3, and Q4.
Theorem:
Given a pSQL query Q with DEFAULT-ALL propagation
scheme, the algorithm generates a query basis of Q.
Proof Idea:

Every query in B(Q) is an equivalent query of Q

Every equivalent query of Q is annotation-contained in
Ua q(D)
qB(Q)
17
Outline

pSQL queries

Semantics




CUSTOM propagation scheme
DEFAULT propagation scheme
DEFAULT-ALL propagation scheme
Implementation


System architecture
Experimental results
18
System Architecture
USER pSQL
query

Translator
SQL
query
RDBMS
sorted
tuples
Postprocessor
final
result
Translator Module


Input: a pSQL query Q
Output: an SQL query Q’ written against the naïve storage
scheme

Q’ is sent to the RDBMS and executed

Postprocessor Module


Input: sorted tuples (returned by the RDBMS)
Output: An annotated set of tuples.
 Annotations for the same output location are collected together
19
 Duplicate tuples are removed
A Naïve Storage Scheme

For every attribute of every relation there is an
additional attribute for storing the annotations
R
A’
B
B’
2
1
a
2
c
4
1
d
2
-
3
b
4
-
B
a d
1
R’
A
A
c
b
3

Conceivably, there are other possible storage schemes
20
The Translator module
pSQL query
default-all scheme
Generate a
Query Basis
set of pSQL queries
with custom scheme
pSQL query
default scheme
Translate
default pSQL
to custom pSQL
pSQL query
custom scheme
Translate
custom pSQL
to SQL
SQL query
default pSQL query
custom pSQL query
SELECT DISTINCT r.A AS A,
r.B AS B
FROM R r
PROPAGATE DEFAULT
SELECT DISTINCT r.A AS A,
r.B AS B
FROM R r
PROPAGATE r.A TO A, r.B TO B
21
Experiments

Goals

compare the performance of pSQL queries under
different propagation schemes (DEFAULT,
DEFAULT-ALL, or no propagation scheme)

compare the performance of pSQL queries when
the number of annotations in a database is varied
22
Experimental setup


Implemented on top of Oracle 9i
Datasets


100MB, 500MB, 1GB TPCH database
 Unannotated database on original schema
 30%, 60%, 100% annotations on naïve schema
 buffer size: 256Mb
Test queries



SPJ queries
Varied the number of joins (0 to 4 joins)
Varied the number of selected attributes (1,3 or 5 attributes)
23
100MB dataset – 100% annotated
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
1000
seconds (log scale)
100
10
1
0.1
0.01
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated DB
pSQL DEFAULT
pSQL DEFAULT-ALL
Qi(j) denotes a query with i joins and j output attributes.
24
500MB dataset – 100% annotated
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
100000
seconds (log scale)
10000
1000
100
10
1
0.1
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated DB
pSQL DEFAULT
pSQL DEFAULT-ALL
Qi(j) denotes a query with i joins and j output attributes.
25
1GB dataset – 100% annotated
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
100000
seconds (log scale)
10000
1000
100
10
1
0.1
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated DB
pSQL DEFAULT
pSQL DEFAULT-ALL
Qi(j) denotes a query with i joins and j output attributes.
26
100MB dataset annotated in various degrees
pSQL Default on 30%, 60% 100% annotated DBs
1000
seconds (log scale)
100
10
1
0.1
0.01
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
SQL on Unannotated BD
30% Annotated DB
60% Annotated DBt
100% Annotated DB
Qi(j) denotes a query with i joins and j output attributes.
27
Contributions

an annotation management system


pSQL query language for propagation annotations




for carrying annotations along as data is being transformed
based on provenance
CUSTOM – user defined
DEFAULT – where data was copied from?
DEFAULT-ALL – invariant under equivalent queries
 Generate-Query-Basis algorithm
an initial implementation
28
Future work

Performance of our annotation management
system on other storage schemes

pSQL extensions


Aggregates
Bag Queries
29
END
30
The CUSTOM Scheme - Example
R
A
1
SELECT DISTINCT B
FROM R r
PROPAGATE r.A TO B,
r.B TO B
2
B
a
2
b
c
3
3
5
h
Result
2
3
5
a c
b
h
31
Terminology

A location is a triple (R, t, A)
R
A
1

B
2
a
The annotation “a”
is attached to the
location (R,(1,2),B)
Definition:
A query Q1 is annotation contained in a query Q2 if:
•
•
Q1  Q2
for every database D, the set of annotations attached to every
output location in Q1(D) is a subset of the set of annotations
associated with the same location in the output of Q2(D).
32
In a More Concise Notation
Ans(x,y,z) :- R(x,y), S(y’,z), y = y’.
{ x ! 1, y ! 2,
Ans(x,y,z) :- R(x,y), S(y,z).
{ x ! 1, y ! 2,

a
y’ ! 2,
b
z!3}
a b
z!3}
Annotations of values that reside in different source locations but are
bound to the same variable are unioned together.
Ans(y) :- R(x,y).
Ans(y) :- S(y,z).
a b
Ans(2

).
Annotations that belong to the same output location are unioned
together.
33
Containment vs.
annotation-containment
R
A
B
C
1
a
2
3
1
b
4
5
1
8
4
8
9
5
c
d
Q1
Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z).
Ans1
1
a b
Q2
Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z).
5
c
Ans2
Q1  Q2 but…
Q1 a Q2 and Q2 a Q1
1
b
5
c d
34
Translating a CUSTOM pSQL to SQL
custom pSQL query:
SELECT DISTINCT r.A, s.B, s.C
FROM R r, S s
WHERE r.B = s.B
PROPAGATE s.B TO B, s.C TO C, r.B TO B
SQL query:
SELECT DISTINCT *
FROM ( Q1 UNION Q2 ) t
ORDER BY t.A, t.B, t.C
Q1:
SELECT r.A,
s.B,
s.C,
FROM R r, S
WHERE r.B =
NULL,
s.B’,
s.C’
s
s.B
Q2:
SELECT r.A,
s.B,
s.C,
FROM R r, S
WHERE r.B =
NULL,
r.B’,
NULL
s
s.B
35