Schema Summary A

Download Report

Transcript Schema Summary A

Schema Summarization
cong Yu
Department of EECS
University of Michigan
[email protected]
H. V. Jagadish
Department of EECS
University of Michigan
[email protected]
VLDB 2006
Example :ER Schema
Example: Hierarchical and
Relational schema
Schema
element
structural
link
SetOf type
attribute
value links
Introduction
Endeavors and contributions in this paper

Consider both schema structure and data distribution

define a schema summary.

introduce summary importance and summary
coverage as desirable properties for a good schema
summary.

develop algorithms to automatically generate schema
summaries based on these two goals.

propose a Query Discovery Cost for objectively
evaluating the quality of schema summary
Example: Schema Summary
Example: Schema Summary
Full
schema
summary
Expanded
schema
summary
Abstract
element
Abstract
link
Value
link
Schema Summary

Definition 1: Schema
A schema is defined to be a labelled directed graph, SG =< E, S, V, r >

Definition 2: Schema Summary
A summary of schema SG =< E, S, V, r > is defined to be a labelled directed
graph, SS =< E’, S’, V’, M, AE, AL, r >
 r’ = r;
 E’ ⊆ E, S’ ⊆ S, V’ ⊆ V;
 M is a finite set of correspondences between original schema elements and
abstract elements, each (e e’ ) ∈ M, where e ∈ E and e’ ∈ AE, means that e is
directly or indirectly represented by e’ in the summary;
 AE is a finite set of abstract elements;
 AL is a finite set of abstract links;
Schema Summary Quality

Summary Importance : selected elements should be more
representative of the schema; (connectivity&&cardinality)

Summary Coverage : selected elements should be broadly
distributed throughout the schema (closeness).
Summary Element Importance
 Formula 1 (Schema Element Importance).
 the relative cardinality RC(e1 → e2) is calculated from the
database as the average number of e2 data nodes connected to
each e1 data node.
 r is the number of iterations.
 p is a tuning parameter called neighborhood factor.

 is set to the cardinality of the element in the database.
Summary Importance
Definition 3 (Summary Importance).
 the denominator of the fraction for importance is really the sum of
all element cardinalities in the database.
Summary Element Coverage
Formula 2 (Schema Element Affinity).
Formula 3 (Summary Element Coverage).
Summary Coverage
Definition 4 :Summary Coverage
 the denominator equals to the total cardinality of all the schema
elements.
Efficient Schema Summarization
Annotating Schema Graph
Output: SG
Maximizing Summary Importance
Output: E, the set of K most important elements
Maximizing Summary Coverage
Output : E, the set of K elements with highest summary coverage among
mutually non-dominant elements;
DS, the set of element pairs with dominance relationship.
Balancing Importance and Coverage
Output: E, the set of K elements for the balanced summary.
EVALUATING SCHEMA SUMMARY
Datasets
A suitable dataset must:
1) come with both data and schema information;
2) be complex enough for its summaries to be meaningful, and
3) have a set of known queries associated with it.
we chose two benchmarks, one XML and one relational, and one real dataset
1)
XMark ( XML benchmark dataset )
2)
TPC-H (relational benchmark dataset )
3)
MiMI (real word scientific dataset )
EVALUATING SCHEMA SUMMARY
 comparison with Expert Summaries
Agreement between automatic summaries and expert summaries on XMark and MiMI
datasets. User agreement measures the percentage of elements all three experts agree on.
EVALUATING SCHEMA SUMMARY
 Query Discovery Cost
Query Discovery: models how an ordinary user, defined as one with little
schema knowledge about the database, formulates a concrete query
specification given the information need she has in mind.
EVALUATING SCHEMA SUMMARY
Summary Benefits for Query Discovery: Impact of
Summary Size
EVALUATING SCHEMA SUMMARY
Summary Benefits for Query Discovery: Impact of
Schema Structure and Data Distribution
EVALUATING SCHEMA SUMMARY
Summary Benefits for Query Discovery: Impact
of Balancing Importance and Coverage
EVALUATING SCHEMA SUMMARY
Summary Benefits for Query Discovery: Impact of
Data Evolution
Agreement between summaries on different versions of MiMI
dataset. Current version is Jan 2006.
EVALUATING SCHEMA SUMMARY
Summary Benefits for Query Discovery: Comparison
with ER Model Abstraction
Comparing against ER model abstraction techniques on MiMI.
Summaries are of size 10.