Transcript Slides

Schema Summarization
Cong Yu and H. V. Jagadish
University of Michigan, Ann Arbor
VLDB 2006, Seoul, Korea
September 13th, 2006
Many Databases Are Complex
Source
Type
Number of
Elements*
BioWarehouse
Relational
382
Reactome
Relational
679
MAGE-ML
XML
1581
MiMI
XML
267
*Number of elements
= #tables + #columns (relational)
= #elements + #attributes (XML)
2 / 39
Reactome Schema
3 / 39
What’s the Problem ?
• Why are complex schemas difficult to
deal with ?
 For data integration administrators (DIAs):
Difficult to grasp the major topics of a
complex schema
 For ordinary users: Difficult to identify the
small subset of relevant schema elements
• Can we avoid them ?
 Probably not: scientific databases are in fact
getting more and more complex – MiMI is
an example
4 / 39
Existing Approaches
• Ignore the schema
 Keyword-based search over relational and
XML databases
• Guess the schema
 Schema-Free XQuery, FleXPath, etc.
• Limitations:
 Provide imprecise (and sometimes incorrect)
answers
 No help in understanding the schema (and
the database) itself
5 / 39
Our Approach
• Summarize the schema
 Represent the original complex schema
with a simpler schema, i.e., a summary
of the original schema
 Help users explore the schema via the
summary


Illustrates the main topics of the database
Filters away irrelevant parts of the schema
Challenge: how to create a good summary ?
6 / 39
Talk Outline
• Motivation
• Background Definitions
• Desiderata of Schema Summary
• Efficient Schema Summarization
• Evaluation
• Conclusion and Related Work
7 / 39
Schema
• A labeled, directed graph
• Nodes:
warehouse
state*
authors
store*
@name
author*
contact
book*
@id @name
@name
isbn
title
@address
author*
price
 Relational: table and
column
 Hierarchical: element and
attribute
• Links:
 Structural links:
parent/child constraints
 Value links: inclusion
constraints (key / foreign
key)
8 / 39
Schema Summary
• A schema itself, but:
warehouse
state*
authors
@name
store*
author*
author*
contact
book*
book*
@id @name
@name
isbn
title
@address
author*
price
 Fewer number of
elements  Simpler
 Contains abstract
elements and links
• Abstract element:
 Represents a group of
original elements
• Abstract link:
 Connects at least one
abstract element
9 / 39
Talk Outline
• Motivation
• Background Definitions
• Desiderata of Schema Summary
• Efficient Schema Summarization
• Evaluation
• Conclusion and Related Work
10 / 39
What Makes a Good Schema Summary ?
warehouse
warehouse
warehouse
state*
authors
store*
@name
store*
author*
contact
book*
author*
book*
@id @name
book*
@name
isbn
title
@address
price
author*
• Which one should be the summary ?
11 / 39
What Information Do We Need ?
• Schema summary is not only a summary of
the “schema,” but also in fact a summary of
the “database” !
warehouse
state*
schema structure
and
data distribution
authors
store*
@name
author*
contact
book*
@id @name
@name
isbn
title
@address
price
author*
12 / 39
Desired Properties of Schema Summary
• Small enough (in terms of number of elements)
•
•
to comprehend – Summary Complexity
Show elements in which users are more likely
to be interested – Summary Importance
Show elements that represent the entire
database well – Summary Coverage
• Importance and Coverage calculation will need
to consider both schema structure and data
distribution
13 / 39
Intuition Behind Importance
warehouse
• Not all schema
•
elements are created
equal !
First Observation:
state*
authors
store*
@name
author*
 more links, more
important - schema
contact
book*
@id @name
• Second Observation:
@name
isbn
title
@address
 more popular, more
important - data
author*
14 / 39
price
Compute Summary Importance
• Schema Element Importance
I  p I
r
e
r 1
e
 (1  p )   We j e  I
j
r 1
ej
We j e 
RC e j e

k
RC e j ek
• W: Neighbor Weight – the percentage of ej’s
•
information flows into e, estimated using
relative cardinalities
Summary Importance
Summary Element Importance
RSS 
Total Importance
15 / 39
Intuition Behind Coverage
warehouse
• Important ≠
Inclusion in the
summary
state*
authors
 Elements can be too
“close” to each other
• Two basic notions
store*
@name
author*
contact
book*
@id @name
@name
 Element Affinity
 Element Coverage
isbn
title
@address
author*
17 / 39
price
Intuition Behind Coverage, cont’d
• Element Affinity:
warehouse
 less hops, higher
affinity
 higher relative
cardinality, lower
affinity
state*
authors
store*
@name
author*
contact
• Element Coverage:
 Element Affinity
 Neighbor Weight
book*
@id @name
@name
isbn
title
@address
author*
18 / 39
price
Compute Summary Coverage
• Schema element affinity from ea to eb
pathi
ea eb
A
1

ni
ni
(
j 2
1
RC e j1 e j
)
Aea eb  max ( A
pathi
ea eb
i
)
• Schema element coverage of eb by ea
ni
path
i
C

Card

max
(
C
Cepath

(
A

W
)
e
e e )
 e j1 e j e j e j1 e e
a  eb
i
a
j 2
b
b
i
a
• Summary Coverage
Combined Summary Element Coverage
CSS 
Total Coverage
19 / 39
b
What makes a good schema summary ?
data
distribution
schema
structure
summary
importance
20 / 39
summary
coverage
Talk Outline
• Motivation
• Background Definitions
• Desiderata of Schema Summary
• Efficient Schema Summarization
• Evaluation
• Conclusion and Related Work
21 / 39
Overview
Schema
K
Database
(1) Annotating
Schema Graph
(2.1) Calculating
Importance
List L of
elements sorted
by Importance
(Computing statistics)
(2.2) Calculating
Coverage
(Algorithms MaxImportance and
MaxCoverage)
Set of K elements with
high coverage;
Set S of Coverage
Domination Pairs
(3) Determine K
summary elements
(4) Cluster Original
Schema Elements
(Algorithm BalanceSummary)
Balanced
Summary of Size K
22 / 39
Algorithm MaxImportance
• MaxImportance generates a summary of
a given size k, maximizing summary
importance
Compute steadystate element
importance values
Sort and pick top-k
important elements
• Complexity: O(N2 + NlogN)
* Convergence is proved in [MGR02].
23 / 39
Compute assignments
of remaining elements
Algorithm MaxCoverage
• MaxCoverage generates a summary of a
given size k, maximizing summary
coverage in a heuristic way
Compute coverage
dominance (bottom
up with A/D pairs)
Eliminate elements
being dominated;
Compute summary
coverage for all
element set of size-k
Pick the set with
highest coverage
• Complexity: O(kN2nk)
* See paper for details on coverage dominance
24 / 39
Generate Balanced Summary
• No single optimal criteria to balance the
two desired properties
• A heuristic approach:
 Pick elements in the order of their
importance
 Ignore elements that are dominated by
elements already in the summary
• Works well in practice
27 / 39
Talk Outline
• Motivation
• Background Definitions
• Desiderata of Schema Summary
• Efficient Schema Summarization
• Evaluation
• Conclusion and Related Work
28 / 39
Evaluation Strategies
• Observation
 Comparing automatic summaries with
summaries generated by human experts
 In general, automatic summaries agree well
with human (~ 80%)
• An objective evaluation framework
 Models schema exploration based query
behavior
 Query discovery cost: the number of extra
elements visited in order to construct a
correct query from a query intention
29 / 39
Query Discovery Cost Example
• Query Intention: Retrieve ISBN of all books
• Query: for $b in doc()/state/store/book return $b/isbn
warehouse
warehouse
Cost = 5
Cost = 3
state*
state*
authors
author*
contact
store*
@name
store*
@name
book*
contact
book*
author*
book*
@id @name
@name
@name
isbn
title
@address
price
isbn
title
@address
author*
author*
30 / 39
price
Data Sets
XMark
TPC-H
MiMI
# schema
elements
327
70
155
# data
nodes (000)
1,573
12,500
7,055
# queries
20
22
52
Avg. query
size
3.65
13.4
3.35
31 / 39
Summary Benefits
XMark
TPC-H
MiMI
# summary
elements
10
5
10
# schema
elements
327
70
155
Avg. cost w/o
summary
11.90
18.41
10.38
Avg. cost with
summary
6.65
12.05
3.90
savings %
44.1%
34.5%
62.4%
# rounds to
converge
160
45
426
32 / 39
Contributions of Schema
Structure and Data Distribution
33 / 39
Impact of Balancing Importance
and Coverage
XMark TPC-H MiMI
*
Avg. cost w/o
summary
11.90
18.41
10.38
Avg. cost with
balanced summary
6.65
12.05
3.90
Avg. cost with
importance-only
8.35
(32.4%*)
12.36
(4.6%)
5.56
(25.6%)
Avg. cost with
coverage-only
10.20
(67.6%)
12.18
(2.0%)
5.78
(29.0%)
Percentage in parenthesis shows the reduction in savings
34 / 39
Talk Outline
• Motivation
• Background Definitions
• Desiderata of Schema Summary
• Efficient Schema Summarization
• Evaluation
• Conclusion and Related Work
35 / 39
Related Work
• First study on summarizing schemas
• Related to ER model abstraction
• Limitations of ER model abstraction
 Does not reflect the data distribution
 ER models may not be available and may be out-ofdate
 For most database schemas, structure or value
links are semantics-free, ER model abstraction
methods are ineffective in this case (tagging those
links involve significant amount of manual effort)
36 / 39
Related Work, cont’d
• Summary element importance
calculation is partially inspired by
PageRank
• Summary element affinity calculation
(used in summary coverage) is partially
inspired by similar measurements in
social network analysis
37 / 39
Conclusions and Contributions
• Introduced concept of schema summary
• Defined summary importance and
summary coverage as desiderata of
schema summary
• Emphasized both schema structure and
data distribution as essential features
for importance and coverage calculation
• Designed and implemented efficient
schema summarization algorithms
• An objective evaluation framework
38 / 39
Questions ?
39 / 39