chicago-prov

Download Report

Transcript chicago-prov

Computing Provenance and
Annotations of Derived Data
Wang-Chiew Tan
UC Santa Cruz
Provenance of data
• When you see some data on the Web, do you
know
– where it came from?
– why it is there?
• This information (provenance) is typically lost
in the process of
copying/transcribing/transforming databases
• Loss of provenance is an acute problem in
some scientific databases
2
Complex interdependencies
(Example from scientific databases)
GERD
Various problems:
•Trace provenance of data
•Propagate annotations
TRRD
EpoDB
BEAD
Swissprot
GAIA
EMBL
GenBank
DDBJ
Transfac
flow of data
3
Two kinds of provenance
NYRestaurants (Source table)
NYHotels (Source table)
Rating
Zip
Waldorf Astoria 10022 4.5
Holiday Inn DT 10013 4.0
Hotel
Cost
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Type
Zip
$$$
$$$
French 10022
Seafood 10022
$
$
Chinese 10013
American 10022
JOIN, PROJECT
View
Hotel
Rating Restaurant
Cost
$$$
Waldorf Astoria
Peacock Alley
4.5
Bull & Bear
4.5
$$$
Waldorf Astoria
Waldorf Astoria
$
Soho Kitchen & Bar
4.5
Holiday Inn DT
4.0
Pacifica
$
(Why-provenance)
Why?
Where?
(Where-provenance)
4
SDSS - Sloan Digital Sky Server
Select Specobj.z, photoobj.g, photoobj.r
From Specobj, photoobj
Where Specobj.objid = photoobj.objid
and Specobj.specclass = 3
and Specobj.zconf > .95
5
Compute provenance
• Question: Suppose a database is created by
a query. Can we compute the why and where
provenance of an element?
• Answer: Computing provenance (both why
and where) is NP-hard in general.
6
Annotations
• Adds value to data
– knowledge sharing : annotations can be read & reviewed by
independent parties
• Annotations are loosely structured
– Annotations on data at various levels of granularity, annotations on
annotations
• Source Data:
– proprietary
– fixed schema
• A system that overlays annotations on existing data
• Useful tool for scientific databases
• Annotations should spread back to the source and forward to
other databases
7
Propagating annotations
Serves fine French Cuisine
in elegant setting. Jackets
required.
NYRestaurants (Source Table)
Cost
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Extensive wine list!
Type
Zip
$$$
$$$
French 10022
Seafood 10022
$
$
Chinese 10013
American 10022
Yummy chicken curry!!
Cheap Restaurants (View 2)
All Restaurants (View 1)
Restaurant
Peacock Alley
Bull & Bear
Pacifica
Soho Kitchen & Bar
Cost
Type
$$$
$$$
French
Seafood
$
$
Chinese
American
Restaurant
Pacifica
Soho Kitchen & Bar
Cost
$
$
Type
Chinese
American
8
Location and Propagation Rules
• A location is a triple: (R, t, A)
relation name
tuple in R
A is an attribute in schema of R
• Propagation Rules:
– Select: R
– Project: R
– Join:
– Union:
A1
A2
A3
A1
A2
A3
A1
A2
A2
R2
A2
A3
A1
A2
A3
A1
A2
A3
A3
A3
R2
R1
R1
A1
A1
A2
A3
A1
A2
A3
9
Computing annotation propagation
Model:
Source:
Relational
Database
Query
View : result of
query applied
on source
• Question: Suppose a database is created by a query
over some source data, can we compute how to
propagate an annotation on a data element back to
the source with minimum side-effects?
• Answer: Computing the minimum side-effect
annotation is NP-hard in general
10
Related Work on Annotations
(not exhaustive!)
• Superimposed Information (D. Maier, L. Delcambre [WebDB’99])
– data “placed over” existing information
e.g. bookmark files, schema of a database
• Annotation Systems
– Annotea (W3C)
• annotate web pages
– Multivalent Browser (R. Wilensky, T. A. Phelps. UC Berkeley DL Project)
• annotate on PDF files, HTML, etc.
– BioDAS (Distributed Annotation Server) (L.Stein et al. )
• annotate on genome sequences
• No one has formally studied annotation placement problem
11
Provenance and Annotations
• Where-provenance & annotation placement
– where should the annotation be placed in the source in order
to propagate the annotation to view data d ?
• Annotate the source data in one of the source locations in the
where-provenance of d
• Provenance & Archiving
– trace a piece of data to its correct source version
• Why-provenance & view deletion
• which source data should be deleted in order to delete view
data d ?
A combination of source data that altogether “disable” every
witness for d
12
How do we attach annotations to data?
• Relational tables: Identify a particular column
of a particular table of a particular relation:
(R, t, A)
A
R
t
• Tree-like data: Need a canonical path to the
data element
13
Lots more to do!
• Further study on provenance for queries that involve
negation, aggregates
select sum(sal)
from Employee
where sal > 50K
• Handle “irregular” annotations and on tree-like data.
• How about databases which are manually constructed
and annotated?
– Organize data with keys
• Use of constraints and special cases to derive efficient
algorithms for propagating annotations back
• Language specific issues
14
Inconsistencies in
“annotation-aware” language(s)
•
The same query in different languages, but different annotation behavior
Emp
Department
Name Sal Dept
Dept
Manager
Joe
50K Marketing
Marketing Jane
Relational Algebra:
Emp JOIN Department
[Name:”Joe”, Sal:50K , Dept:”Marketing”
SQL:
SELECT e.Name, e.Sal, e.Dept, d.Manager
FROM Emp e, Department d
WHERE e.Dept ==a d.Dept
, Manager:”Jane”]
[Name:”Joe”, Sal:50K , Dept:”Marketing” , Manager:”Jane”]
•
Equivalent queries in the same language, but different annotation behavior
Q1 = SELECT e.Name, e.Sal
FROM Emp e
WHERE e.Sal = “50K”
[Name:”Joe”, Sal:50k ]
Q2 = SELECT e.Name, “50K” AS Sal
FROM Emp e
WHERE e.Sal = “50K”
[Name:”Joe”, Sal:50k]
15
Do we need an “annotation-aware” QL?
• Relational algebra suggests a natural set of propagation rules
• SQL suggests another natural propagation rule
– based on variable bindings
• Question: Can we extend/design the the query language(s) so
that
– Equivalent queries have the same annotation behavior
– Translation of a query from one language (e.g. SQL) into another
(e.g. relational algebra) yields the same annotation behavior
• Perhaps a more fundamental question...
– Should a query language be “annotation-aware” ?
– Perhaps we should have language constructs to allow the user to
explicitly control annotation propagation?
16
End
17