Measuring referential Integrity in Distributed Databases
Download
Report
Transcript Measuring referential Integrity in Distributed Databases
Measuring referential Integrity
in Distributed Databases
Dhara Shah
Introduction
Distributed database: multiple databases
residing at different locations which are
communicated through the Internet.
Violation of referential integrity due to similar
content from different sources.
Goal: Identify referential integrity problem to
detect and avoid inconsistency or
incompleteness.
Promising alternative to detect and fix data
quality issues in scientific database.
Assumptions
Same tables but different content.
Rows may have null values for primary key.
Metadata has been integrated before.
Content may be inconsistent due to both local
and global issues.
Broadcasting updates happens
independently and asynchronously.
Column Metrics
Metrics are measured on scale of [0…1] (1
being the optimal)
lrcom(Ti.K) = |Ti KTj | / |Ti|
grcom(Ti.K) = |Ti KTj | / |Ti|
lrcon(Ti.F) = |Ti K,F Tj | / |Ti|
grcon(Ti.K, Ti.F) = |Ti K,F Tj | / |Ti|
Table Metrics
gcur(Ti) = |D1.Ti ∩ D2.Ti ∩ ・ ・ ・ ∩ Dn.Ti| /
|D1.Ti ∪ D2.Ti ∪ ・ ・ ・ ∪
Dn.Ti|
grcom(Ti) = Σkj=1|Ti|grcom(Ti.Kj ) / k|Ti|
grcon(Ti) = Σfj=1|Ti|grcon(Ti.Fj ) / f|Ti|
Database Metrics
lrcom(Di) = Σmj=1|Tj |lrcom(Tj ) / Σj|Tj |
lrcon(Di) = Σmj=1|Tj |lrcon(Tj ) / Σj|Tj |
grcom(D) = Σmj=1|Tj |grcom(Tj ) / Σj|Tj |
grcon(D) = Σmj=1|Tj |grcon(Tj ) / Σj|Tj |
Query Optimization
Local metrics in a single database
Aggregations grouping by FK before joins for table
with several FKs.
Creating secondary index on each FK.
Global metrics in distributed database
Transfer n-1 copies to central site
Compute metrics at one site and then
incrementally update
Compute metrics for each pair of tables linked by
a FK
Smallest table is transferred when join is required
for two tables at different sites
Applications
Applications w/ Scientific Databases
Central database: need fast connection and
should be available all time
Local database: flexible and faster, many have
more referential errors
Program:
uses Logical data model (LDM) to calculate
metrics.
Has graphical user interface, list which explains
why errors happend
Conclusion
Related work:
MOCHA: middleware system to integrate
distributed data sources.
Metrics that measure absolute and relative
error w/ respect to referential integrity.
Measures completeness and consistency.
Raises new issues such as distributed query
optimizations.
Citation
Authors: Carlos Ordonez, Javier Garcia-Garcia, Zhibo Chen
Title: Measuring Referential Integrity in Distributed Databases
Name of Journal: CIMS '07 Proceedings of the ACM first
workshop on CyberInfrastructure: information management in
eScience
Publication Date: November 2007
Page Range: 61-66