Update Exchange with Mappings and Provenance

Download Report

Transcript Update Exchange with Mappings and Provenance

Update Exchange with
Mappings and Provenance
Todd J. Green Grigoris Karvounarakis
Zachary G. Ives Val Tannen
University of Pennsylvania
VLDB 2007 Vienna, Austria
September 26, 2007
Adoption of data integration tools
• Structured information is pervasive in the
Internet age, as is the need to access and
integrate it…
– Need to collect, transform, aggregate information
– Need to import related data into an existing
database instance
• … but after many years of research, few users
of data integration tools
• Why?
2
Not because the problem is too hard!
• People are doing it anyway! (Just without help
from DB research)
– e.g., bioinformatics
• Ad-hoc solutions (Perl scripts) developed for
specific domains
– e.g., at Penn, a large staff of programmers maintains
the Genomics Unified Schema (GUS)
• Point-to-point exchange between peers /
collaborating sites
• To be adopted, data integration tools need to
offer significant additional value...
3
Needs unmet by data integration tools
• Previous data integration tools do not offer:
– Complete local control of data
• Decide which data is import / integrated
• Ability to modify any data, even data from elsewhere!
– Support for different points of view
• Disagreements about data, mappings, schemas...
• Which sources are trusted / distrusted
– Tracking of data provenance
– Support for incremental updates
• Changes to data, mappings, schemas...
• Our system, ORCHESTRA, addresses these needs
4
Requirements for ORCHESTRA, a Collaborative
Data Sharing System (CDSS) [Ives+05]
• Give peers full control using local instance
∆B+/−
• Support different needs / perspectives
• Relate peers by mappings and trust
policies
Peer B
• Support update exchange
• Maintain data provenance
∆C+/−
Peer C
Peer A
DBMS
Queries, edits
∆A+/−
PUBLISH
5
How ORCHESTRA addresses
CDSS requirements
From one peer’s perspective:
+
Updates from
other peers
∆Pother
m
σ
r
Translate
Apply trust
through
policies
Resolve
mappings with
using
conflicts
provenance provenance
1
2
∆Pc
Local
curation
∆Pf
−
Apply final
updates to
peer
Produce
candidate
updates
[TaylorIves06]
3
Contributions of this paper
6
Roadmap
• Update exchange in a CDSS:
– Schema mappings
– Tracking of data provenance
– Incremental propagation of updates
– Provenance-based trust policies
– Local curation via insertions / deletions
• Prototype implementation
• Experimental evaluation
7
Mappings and updates
• CDSS setting: set of peers; set of declarative mappings (tgds)
m1
B
m3
∙
G
m2
(𝑚1 ) 𝐺 𝑖, 𝑐, 𝑛 → 𝐵 𝑖, 𝑛
(𝑚2 ) 𝐺 𝑖, 𝑐, 𝑛 → 𝑈(𝑛, 𝑐)
(𝑚3 ) 𝐵 𝑖, 𝑐 ∧ 𝑈 𝑛, 𝑐 → 𝐵 𝑖, 𝑛
U
• Given: setting, base data, updates
• Goal: local instance at each peer cf. data exchange paradigm
[Fagin+03]
– Universal solution yields the “certain answers” to queries
– Can be computed using the chase
• Our contribution: how to do it incrementally, with provenance...
8
Incremental insertion
+
+
m2
B
G
(3, 5, 2)
m1
(1, 3, 3)
m1
(3, 5)
(3, 2)
(1, 3)
m3
m3
U
(2, 5)
(3, 3)
+
m2
(𝑚1 ) 𝐺 𝑖, 𝑐,` 𝑛 → 𝐵 𝑖, 𝑛
(𝑚2 ) 𝐺 𝑖, 𝑐, 𝑛 → 𝑈(𝑛, 𝑐)
(𝑚3 ) 𝐵 𝑖, 𝑐 ∧ 𝑈 𝑛, 𝑐 → 𝐵 𝑖, 𝑛
+
This graph represents the
provenance information
that ORCHESTRA maintains
9
Incremental deletion
+
+
+
m2
B
G
(3, 5, 2)
m1
(1, 3, 3)
m1
(3, 5)
(3, 2)
(1, 3)
+
• Step 1: Use provenance graph to find
derived tuples which can also be deleted
• Step 2: Test other affected tuples for
derivability, and delete any not derivable
• Step 3: Repeat
m3
m3
m2
U
(2, 5)
(3, 3)
+
10
Other approaches to
incremental deletion
• Many strategies (both research and commercial) for
incremental deletions...
• ... but we have to support recursion
– Mappings can have cycles
– Count-based algorithms don’t work (infinite counts)
• Incremental maintenance for recursive datalog
programs – DRed [GuptaMumick95]
– DRed (“delete and re-derive”) computes superset of
deletions, then corrects if needed
– We use provenance to compute exact set of deletions
11
Trust policies (not every update
should be propagated)
• Updates can be filtered automatically based on
provenance and content
– “Peer A distrusts any tuple U(i,n) if the data came
from Peer B and n ≥ 3, and trusts any tuple from
Peer C”
– “Peer A distrusts any tuple U(i,n) that came from
mapping m4 if n ≠ 2”
• Local curation: user can also manually
accept/reject updates, or introduce new ones...
12
Local curations
• Extra tables for local insertions and deletions:
+
(Mappings,
trust policies,
etc.)
Local
curation
∆Pc
Candidate
updates
−
∆Pf
Final
updates
• Contribution: conforms to data exchange paradigm by using
internal mappings with local insertions/deletions:
(𝑚2 ) 𝐺 𝑥, 𝑦, 𝑧 → 𝑈(𝑥, 𝑦)
↦
(𝑚2′ )
𝐺 𝑓 𝑥, 𝑦, 𝑧 → 𝑈 𝑐 (𝑥, 𝑦)
(𝑚𝑈 + ) 𝑈 + 𝑥, 𝑦 → 𝑈𝑓 (𝑥, 𝑦)
(𝑚𝑈 − ) 𝑈 𝑐 𝑥, 𝑦 ∧ ¬𝑈 − 𝑥, 𝑦 → 𝑈𝑓 (𝑥, 𝑦)
13
Prototype implementation
•
•
•
•
Middleware layer on top of relational DBMS
Mappings converted to datalog rules (as in Clio)
Separate tables for provenance info
Engine option 1: based on commercial DBMS (DB2)
– Datalog fixpoints in Java and SQL (only linear recursion in DB2)
– Labeled nulls supported via encoding scheme
• Engine option 2: using in-house query engine (Tukwila)
– BerkeleyDB for auxiliary storage and indexes
– Custom operators for fixpoints, built-in labeled nulls
• 30,000 lines of Java and C++ code
14
Experimental evaluation
• DB2-based and Tukwila-based implementations
• Workload typical of bioinformatics setting (at most 10s of
peers, GBs of data)
• Synthetic update workload sampled from SWISS-PROT
biological data set
– Randomly-generated schemas and mappings
• Dual Xeon 5150 server, 8 GB RAM (2 GB for DB)
• Variables: number of peers, complexity of mappings, volume
of data, type of data, size of updates
• Measured: time to join system, time to propagate updates,
size of updated database
15
Time to propagate deletions (sec)
Incremental deletion algorithm yields
significant speedup
DRed
Incremental
Non-incremental
Parameters: 5 peers, full acyclic mappings, string data, 1 GB database
16
Time to propagate insertions (sec)
Time (sec)
System scales to realistic #s of peers
80
1% insertions (DB2)
10% insertions (DB2)
10% insertions (DB2)
60
40
1% insertions (Tukwila)
1% insertions (DB2)
10% insertions (Tukwila)
10% insertions (Tukwila)
20
1% insertions (Tukwila)
0
2
5
10
20
Number of peers
Parameters: full acyclic mappings, integer data, up to 1 GB database
17
Contributions
• Orchestra innovatively performs update
exchange (not just mediated/federated query
answering)
• Tracks data provenance across a network of
schema mappings
• Supports provenance-based trust policies
• Features algorithms for incremental
propagation of updates
• Solutions have been validated by experimental
prototype for typical bioinformatics settings
18
Related work
• Peer data management systems Piazza
[Halevy+03, 04], Hyperion [Kementsietsidis+04],
[Bernstein+02], [Calvanese+04], ...
• Data exchange [Haas+99, Miller+00, Popa+02,
Fagin+03], peer data exchange [Fuxman+05]
• Provenance / lineage [CuiWidom01],
[Buneman+01], Trio [Widom+05], Spider
[ChiticariuTan06], [Green+07], ...
• Incremental maintenance [GuptaMumick95], …
19
CDSS as a research platform:
promising future directions
• Ranking-based trust with provenance
– Numeric weights and “accumulation of evidence”
• More expressive mappings
– e.g., “looking inside” attributes using regular expressions
• Compact representations of provenance
• Mixing virtual and materialized peers
– Related to view selection problem
• Supporting key dependencies / egds
– Deletion propagation becomes challenging
• Incorporating probabilistic mappings / data
20
Ongoing work at Penn
• Deploying ORCHESTRA in the real world
– Pilot project with Penn Center for Bioinformatics
• Bidirectional mappings
– Propagating updates in both directions
• Mapping evolution problem
– Handling updates to mappings (not just data)
• Fully distributed implementation
– Using P2P database engine
21
Bioinformatics mappings example
(𝑚1 )
(𝑚2 )
(𝑚3 )
(𝑚4 )
𝐺
𝐺
𝐺
𝐵
𝑖, 𝑐, 𝑛 → 𝐵 𝑖, 𝑛
𝑖, 𝑐, 𝑛 → 𝑈(𝑛, 𝑐)
𝑖, 𝑐, 𝑛 → ∃𝑐 𝑈 𝑛, 𝑐
𝑖, 𝑐 ∧ 𝑈 𝑛, 𝑐 → 𝐵 𝑖, 𝑛
23
Delta rules for insertions
• As in DRed [GuptaMumick95]:
(𝑚4 )
𝐵 𝑖, 𝑐 ∧ 𝑈 𝑛, 𝑐 → 𝐵 𝑖, 𝑛
(𝑚4 ′)
(𝑚4 ′′)
𝐵+(𝑖, 𝑐) ∧ 𝑈 𝑛, 𝑐 → 𝐵+ 𝑖, 𝑛
𝐵𝜈 𝑖, 𝑐 ∧ 𝑈 + 𝑛, 𝑐 → 𝐵+ 𝑖, 𝑛
(𝑚𝐵 + )
(𝑚𝐵 )
𝐵+ 𝑖, 𝑐 → 𝐵𝜈 𝑖, 𝑐
𝐵 𝑖, 𝑐 → 𝐵𝜈 𝑖, 𝑐
24