Measuring referential Integrity in Distributed Databases

Download Report

Transcript Measuring referential Integrity in Distributed Databases

Measuring referential Integrity
in Distributed Databases
Dhara Shah
Introduction




Distributed database: multiple databases
residing at different locations which are
communicated through the Internet.
Violation of referential integrity due to similar
content from different sources.
Goal: Identify referential integrity problem to
detect and avoid inconsistency or
incompleteness.
Promising alternative to detect and fix data
quality issues in scientific database.
Assumptions





Same tables but different content.
Rows may have null values for primary key.
Metadata has been integrated before.
Content may be inconsistent due to both local
and global issues.
Broadcasting updates happens
independently and asynchronously.
Column Metrics





Metrics are measured on scale of [0…1] (1
being the optimal)
lrcom(Ti.K) = |Ti KTj | / |Ti|
grcom(Ti.K) = |Ti KTj | / |Ti|
lrcon(Ti.F) = |Ti K,F Tj | / |Ti|
grcon(Ti.K, Ti.F) = |Ti K,F Tj | / |Ti|
Table Metrics



gcur(Ti) = |D1.Ti ∩ D2.Ti ∩ ・ ・ ・ ∩ Dn.Ti| /
|D1.Ti ∪ D2.Ti ∪ ・ ・ ・ ∪
Dn.Ti|
grcom(Ti) = Σkj=1|Ti|grcom(Ti.Kj ) / k|Ti|
grcon(Ti) = Σfj=1|Ti|grcon(Ti.Fj ) / f|Ti|
Database Metrics




lrcom(Di) = Σmj=1|Tj |lrcom(Tj ) / Σj|Tj |
lrcon(Di) = Σmj=1|Tj |lrcon(Tj ) / Σj|Tj |
grcom(D) = Σmj=1|Tj |grcom(Tj ) / Σj|Tj |
grcon(D) = Σmj=1|Tj |grcon(Tj ) / Σj|Tj |
Query Optimization

Local metrics in a single database



Aggregations grouping by FK before joins for table
with several FKs.
Creating secondary index on each FK.
Global metrics in distributed database




Transfer n-1 copies to central site
Compute metrics at one site and then
incrementally update
Compute metrics for each pair of tables linked by
a FK
Smallest table is transferred when join is required
for two tables at different sites
Applications

Applications w/ Scientific Databases



Central database: need fast connection and
should be available all time
Local database: flexible and faster, many have
more referential errors
Program:


uses Logical data model (LDM) to calculate
metrics.
Has graphical user interface, list which explains
why errors happend
Conclusion

Related work:




MOCHA: middleware system to integrate
distributed data sources.
Metrics that measure absolute and relative
error w/ respect to referential integrity.
Measures completeness and consistency.
Raises new issues such as distributed query
optimizations.
Citation

Authors: Carlos Ordonez, Javier Garcia-Garcia, Zhibo Chen

Title: Measuring Referential Integrity in Distributed Databases

Name of Journal: CIMS '07 Proceedings of the ACM first
workshop on CyberInfrastructure: information management in
eScience

Publication Date: November 2007

Page Range: 61-66