Integrating Multiple Data Sources using a Standardized XML
Download
Report
Transcript Integrating Multiple Data Sources using a Standardized XML
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integrating Multiple Data Sources
using a Standardized XML Dictionary
Ramon Lawrence
University of Manitoba
[email protected]
Supervisor: Dr. Ken Barker
TRLabs - Winnipeg
Page 1
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Outline
Introduction,
Motivation, and Background
Integration architecture components
Integration architecture
Example integration
Applications to the WWW
Future work and conclusions
Demonstration of Unity
Page 2
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Introduction
Integration
of data is required when accessing
multiple databases within an organization or on
the WWW.
Our focus is automatically combining
database schema using schema integration.
Schema integration requires knowledge of data
semantics and use of metadata.
Page 3
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Motivation
Organizations
have several database systems
which must interoperate.
Users often access multiple Web databases
whose knowledge must be integrated and
presented in a useful form.
Data warehouses and OLAP systems require
data semantics to be understood and data to
be cleansed and summarized.
Page 4
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Background
Schema
integration involves combining
diverse database schema into an integrated
view by resolving conflicts.
Schema conflicts include naming, structural,
and semantic conflicts.
Schema integration is required for database
interoperability, but it is currently a manual
process.
Page 5
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
MDBS Architecture
Global Transactions
Global Transaction Manager (GTM)
•processes global transactions
•insures information in all LDBSs is
consistent
•submits subtransactions to the
GTSs for each LDBS
GTM
subtransactions
GTS GTS
LDBS LDBS
GTS
Global Transaction Servers (GTSs)
GTS
•one for each LDBS
•converts subtransactions from the
GTM into a form usable by the
LDBS and vice versa
LDBS LDBS Local Database Systems (LDBSs)
Local Transactions
•databases combined into MDBS
•unchanged as still process local
transactions
Page 6
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Previous Work
Research
systems:
integrating
systems by logical rules (Sheth)
defining global dictionaries (Castano)
Carnot Project using the Cyc knowledge base
Industrial
systems and standards:
Metadata
Interchange Specification (MDIS)
XML, BizTalk, E-commerce portals
Page 7
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Objective
The
objective of our architecture is to provide a
system for automatically integrating diverse
relational schemas into a multidatabase
Desirable properties:
individual
mappings - information sources integrated
one-at-a-time and independently
global view constructed for query transparency
handles schema conflicts - including semantic,
structural, and naming conflicts
automated global integration - global view
constructed efficiently and automatically
Page 8
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
The Idea
The
major idea is that schema conflicts can be
resolved if we:
eliminate
all naming conflicts
define a language capable of determining schema
equivalence and performing transformations
With
these two properties, schema conflicts
can be resolved automatically at the global
level
Page 9
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
The Global Dictionary
A
global dictionary (GD) provides
standardized terms to capture data
semantics.
Hierarchy
of terms related by IS-A or Has-A links
Contains base set of common database
concepts, but new concepts can be added
A
GD term is a single, unambiguous
semantic definition.
Several
GD entries for a single English word are
required if the word has multiple definitions.
Page 10
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
Using the Global Dictionary
GD
terms are used to build semantic names
to describe the semantics of schema
elements.
Semantic names have the form:
semantic
name = “[“CT [[;CT] | [,CT]] “]” CN
CT = context term, CN = concept name
each CT and CN is a single term from the GD
Semantic
names are included in
specifications describing a data source.
Page 11
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
X-Specs
Database
metadata and semantic names are
combined into specifications called X-Specs:
stored
and transmitted using XML
contains information on a relational schema
organized into database, table, and field levels
stores semantic names to describe and integrate
schema elements
Page 12
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
Integrating X-Specs
Each
database to be integrated is described
using a X-Spec.
Identical concepts in different databases are
identified by similar semantic names.
Concepts with identical (or hierarchially
related) semantic names are combined
regardless of their physical representation in
the individual databases.
Page 13
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture
Our
integration architecture consists of two
separate phases:
capture
process: X-Specs are constructed for each
data source independently
integration process: X-Specs are combined using the
integration algorithm which matches semantic
names using the global dictionary
Page 14
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Capture Process
Capture
process involves:
automatically
extracting the schema information
and metadata using a specification editor
assigning semantic names to each schema
element (tables and fields) to capture their
semantics
Page 15
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Capture Process
Relational
Schema
Automatic
Extraction
Specification
Editor
Global
Dictionary
X-Spec
DBA Lookup
of terms
Page 16
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Integration Process
Integration
process involves:
automatically
identifying identical concepts by
matching semantic names
constructing a global view of database concepts
consisting of a hierarchy of concept terms
resolving structural differences during query
generation and submission (e.g. a concept may
be represented as a table in one database and a
field (attribute) in another)
Page 17
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Integration Process
Client
………….
Client
Integration Site
Subtransactions
X-Spec
X-Spec
RDBMS …….. RDBMS
Page 18
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture Benefits
The
benefits of the two phase architecture
are:
Dynamic
integration: schemas integrated as
needed
X-Specs are constructed only once and
independent of each other
Automatic conflict resolution by integrating based
on semantic name rather than physical structure
Users are isolated from system names and
organization by querying through a global view
using semantic names for concepts
Page 19
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Example
Two
claims databases to be integrated:
ABC
Company: Claims_tb(claim_id, claimant,
net_amount, paid_amount)
XYZ Company: T_claims(id, customer, claim_amt),
T_payments(cid, pid, amount)
First
step is to construct X-Specs for each
database.
Page 20
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Example:
ABC Database X-Spec
Type
System Name
Semantic Name
Table
Field
Claims_tb
Claim_id
[Claim]
[Claim] Id
Field
Field
Field
Claimant
Net_amount
Paid_amount
[Claim;Claimant] Name
[Claim] Amount
[Claim;Payment] Amount
Page 21
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Example:
XYZ Database X-Spec
Type
System Name
Table T_claims
Field id
Field customer
Field claim_amt
Table
Field
Field
Field
T_payments
cid
pid
amount
Semantic Name
[Claim]
[Claim] Id
[Claim;Customer] Name
[Claim] Amount
[Claim;Payment]
[Claim] Id
[Claim;Payment] Id
[Claim;Payment] Amount
Page 22
Integration Example:
Integrated View
Global
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
view after integration:
[Claim]
Id
Net amount
[Customer]
name
[Payment]
id
amount
Page 23
Integration Example:
Discussion
Important
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
points:
system
and field names are not presented to the
user who queries based on semantic names
database structure is not shown to the user
different physical representations for the same
concept are combined (e.g. payment (attribute) in
ABC with payment table in XYZ database)
hierarchially related concepts (customer vs.
claimant) are combined based on their IS-A
relationship in the global dictionary
Page 24
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Applications to the WWW
Integrating
diverse data sources is involved in
constructing a data warehouse and other
operational systems.
The WWW is a diverse organizations of
databases which users access.
Automatically integrating web data sources by
a browser or portal reduces query complexity
and integration of results for the user.
Page 25
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Conclusions
Automatic
integration of database schema is
possible by using a global dictionary of terms
and constructing semantic names for schema
elements.
Integration of data sources has applications to
the WWW and construction of data
warehouses.
Page 26
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Future Work
The
integration architecture is evolving with
standards on XML and captures metadata
information in XML documents.
The system is being tested on sample
problems, and a query mechanism is work-inprogress.
We are refining a prototype of the system
called Unity.
Page 27