Data-Driven Understanding and Refinement of Schema Mappings

Download Report

Transcript Data-Driven Understanding and Refinement of Schema Mappings

DATA-DRIVEN UNDERSTANDING
AND REFINEMENT OF SCHEMA
MAPPINGS
Data Integration and Service
Computing
ITCS 6010
INTRODUCTION
•
USER
– Difficult finding correct mappings for applications
– Schema mappings are complex, effectively communicating subtleties
involved
– Understanding source data difficult, hence provide facility for schema
and data exploration
– Complexities of mapping and subtle difference between alternative
mappings
– Reasoning about complex non-associative operators
– Increase of data and necessity to integrate data from multiple source
– Mappings between these schemas
– But Still some issues need to be addressed
ILLUSTRATIONS
“ The Ultimate goal of schema is not building correct
queries but to extract correct data from source to populate
target schema”
• The user is expected to
• have thorough understanding of data
• Debug complex SQL queries or procedural transformations
• Clio makes it easy
ILLUSTRATIONS
Source: Ling Ling Yan, Ren\&\#233;e J. Miller, Laura M. Haas, and Ronald Fagin. 2001. Data-driven understanding and refinement of schema mappings.
SIGMOD Rec. 30, 2 (May 2001), 485-496. DOI=10.1145/376284.375729 http://doi.acm.org/10.1145/376284.37572
Source: Ling Ling Yan, Ren\&\#233;e J. Miller, Laura M. Haas, and Ronald Fagin. 2001. Data-driven understanding and refinement of schema mappings.
SIGMOD Rec. 30, 2 (May 2001), 485-496. DOI=10.1145/376284.375729 http://doi.acm.org/10.1145/376284.37572
MAPPINGS
• Mapping is a query on source schema that produces
subset of target relation
• Mapping involves three main activities
• Determining Correspondences
• Data Linking
• Data trimming
• A be set of attributes
•A
A

• A relation on schema S is named finite set of tuples on S
• t[A]
dom(A) value of t on A
Assumption:
 Relation in source database do not contain
any
tuple that are null on any attribute
• Predicate P over schema S maps tuples on S to true or
false
– Join Predicate
– Selection predicate
• A predicate is strong if it evaluates to false for every tuple
that is null for all attributes in S
• Join Predicate is strong predicate
• Selection predicate is not required to be strong
Correspondence to Target
• What attribute and how it should appear in target relation
• E.g: Kids.FamilyIncome = parents.salary +
parents2.salary (ref
DATA LINKING
DATA LINKING
DATA LINKING
DATA LINKING
DATA TRIMMING
• All tuples in Query Graph G may not be semantically
meaningful
• Data associations in some category may be too
incomplete to include
• User decides some categories are excluded as they have
incomplete coverage
MAPPING DEFINITION
MAPPING DEFINITION
• Mapping defines the relationship between
a target relation and set of source
relations, defined with three main
components:
– Query graph G
– Set V of Value Components
defining conditions
source and target should satisfy
– Two sets of filter Cs and CT
MAPPING EXAMPLES
• Positive example states how source tuples contribute
successfully to target relation
• Negative example states how source tuples are combined
correctly but fails to contribute
MAPPINGS OPERATORS
• Correspondence Operators
Permit users to change value of correspondences
• Data Trimming Operators
Modify the source and target filters of a mapping.
They do not change the query graph of a mapping.
• Data Linking Operators
Directly change the query graph of mapping.
They are of two type:
• Data Walk
• Data Chase
DATA WALK
• In a data walk, the user knows where the missing data
resides in the source or more specifically what source
relation(s) contain this data.
• A data walk makes use of Clio’s knowledge of the source
schema (which is gathered from schema and constraint
definitions and from mining the source data, views, stored
queries and metadata).
DATA CHASE
• In a data Chase, the user does not know where the
missing data resides. The chase permits the user to
explore the source data incrementally to locate the
desired data.
• The user may not know which relations to include in the
extended query graph.
CLIO FOR LARGE MAPPINGS
• Manage and manipulate multiple (possible) mappings
while the user explores the data, creates
correspondences and extends the query graph.
new
• More complex the relationship between source and target,
the more (possible) mappings we must handle.
• Large schemas are a source of complexity. Large
volumes of data need to be transformed.
• Unfamiliar data sources the amount of data itself might be
an obstacle for mapping.
CLIO MAPPING FRAMEWORK
• Clio provides
• Target Viewer
• “What You Is What You Get” flavor to the mapping.
• Source Viewer
• Serves as a palette from which users can choose the relations with
which they want to work or explicitly select an edge to follow.
• Provides a visualization of the query graph being constructed.
• A set of workspaces, each associated with a single mapping
alternative.
COMPLEX MAPPINGS
• Many single target mappings create will have great deal
of overlap, differing only in a few correspondences or a
small portion of query graph.
• The decisions made in creating one mapping can be
stored and made available to the user in order reduce the
burden and overhead of re-creating the bulk of each
mapping from scratch.
CLIO FOR COMPLEX MAPPINGS
• Clio automatically computes both possible mappings and
the user can accept one or several, adding filters as
needed.
• Clio’s rich framework supports the user in specifying
complex target mappings.
SUMMARY
• presents a new framework that uses examples drawn
from source data to illustrate complex schema mappings.
• Provides
formal definitions of mappings, mapping
examples and mapping operators and shows how they
can be used to help a user understand the data and
develop mappings.
QUESTIONS?