Integrating Multiple Data Sources using a Standardized XML

Download Report

Transcript Integrating Multiple Data Sources using a Standardized XML

Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integrating Multiple Data Sources
using a Standardized XML Dictionary
Ramon Lawrence
University of Manitoba
[email protected]
Supervisor: Dr. Ken Barker
TRLabs - Winnipeg
Page 1
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Outline
 Introduction,
Motivation, and Background
 Integration architecture components
 Integration architecture
 Example integration
 Applications to the WWW
 Future work and conclusions
 Demonstration of Unity
Page 2
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Introduction
 Integration
of data is required when accessing
multiple databases within an organization or on
the WWW.
 Our focus is automatically combining
database schema using schema integration.
 Schema integration requires knowledge of data
semantics and use of metadata.
Page 3
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Motivation
 Organizations
have several database systems
which must interoperate.
 Users often access multiple Web databases
whose knowledge must be integrated and
presented in a useful form.
 Data warehouses and OLAP systems require
data semantics to be understood and data to
be cleansed and summarized.
Page 4
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Background
 Schema
integration involves combining
diverse database schema into an integrated
view by resolving conflicts.
 Schema conflicts include naming, structural,
and semantic conflicts.
 Schema integration is required for database
interoperability, but it is currently a manual
process.
Page 5
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
MDBS Architecture
Global Transactions
Global Transaction Manager (GTM)
•processes global transactions
•insures information in all LDBSs is
consistent
•submits subtransactions to the
GTSs for each LDBS
GTM
subtransactions
GTS GTS
LDBS LDBS
GTS
Global Transaction Servers (GTSs)
GTS
•one for each LDBS
•converts subtransactions from the
GTM into a form usable by the
LDBS and vice versa
LDBS LDBS Local Database Systems (LDBSs)
Local Transactions
•databases combined into MDBS
•unchanged as still process local
transactions
Page 6
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Previous Work
 Research
systems:
 integrating
systems by logical rules (Sheth)
 defining global dictionaries (Castano)
 Carnot Project using the Cyc knowledge base
 Industrial
systems and standards:
 Metadata
Interchange Specification (MDIS)
 XML, BizTalk, E-commerce portals
Page 7
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Objective
 The
objective of our architecture is to provide a
system for automatically integrating diverse
relational schemas into a multidatabase
 Desirable properties:
 individual
mappings - information sources integrated
one-at-a-time and independently
 global view constructed for query transparency
 handles schema conflicts - including semantic,
structural, and naming conflicts
 automated global integration - global view
constructed efficiently and automatically
Page 8
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
The Idea
 The
major idea is that schema conflicts can be
resolved if we:
 eliminate
all naming conflicts
 define a language capable of determining schema
equivalence and performing transformations
 With
these two properties, schema conflicts
can be resolved automatically at the global
level
Page 9
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
The Global Dictionary
A
global dictionary (GD) provides
standardized terms to capture data
semantics.
 Hierarchy
of terms related by IS-A or Has-A links
 Contains base set of common database
concepts, but new concepts can be added
A
GD term is a single, unambiguous
semantic definition.
 Several
GD entries for a single English word are
required if the word has multiple definitions.
Page 10
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
Using the Global Dictionary
 GD
terms are used to build semantic names
to describe the semantics of schema
elements.
 Semantic names have the form:
 semantic
name = “[“CT [[;CT] | [,CT]] “]” CN
 CT = context term, CN = concept name
 each CT and CN is a single term from the GD
 Semantic
names are included in
specifications describing a data source.
Page 11
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
X-Specs
 Database
metadata and semantic names are
combined into specifications called X-Specs:
 stored
and transmitted using XML
 contains information on a relational schema
 organized into database, table, and field levels
 stores semantic names to describe and integrate
schema elements
Page 12
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Architecture Components:
Integrating X-Specs
 Each
database to be integrated is described
using a X-Spec.
 Identical concepts in different databases are
identified by similar semantic names.
 Concepts with identical (or hierarchially
related) semantic names are combined
regardless of their physical representation in
the individual databases.
Page 13
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture
 Our
integration architecture consists of two
separate phases:
 capture
process: X-Specs are constructed for each
data source independently
 integration process: X-Specs are combined using the
integration algorithm which matches semantic
names using the global dictionary
Page 14
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Capture Process
 Capture
process involves:
 automatically
extracting the schema information
and metadata using a specification editor
 assigning semantic names to each schema
element (tables and fields) to capture their
semantics
Page 15
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Capture Process
Relational
Schema
Automatic
Extraction
Specification
Editor
Global
Dictionary
X-Spec
DBA Lookup
of terms
Page 16
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Integration Process
 Integration
process involves:
 automatically
identifying identical concepts by
matching semantic names
 constructing a global view of database concepts
consisting of a hierarchy of concept terms
 resolving structural differences during query
generation and submission (e.g. a concept may
be represented as a table in one database and a
field (attribute) in another)
Page 17
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture:
The Integration Process
Client
………….
Client
Integration Site
Subtransactions
X-Spec
X-Spec
RDBMS …….. RDBMS
Page 18
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Architecture Benefits
 The
benefits of the two phase architecture
are:
 Dynamic
integration: schemas integrated as
needed
 X-Specs are constructed only once and
independent of each other
 Automatic conflict resolution by integrating based
on semantic name rather than physical structure
 Users are isolated from system names and
organization by querying through a global view
using semantic names for concepts
Page 19
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Example
 Two
claims databases to be integrated:
 ABC
Company: Claims_tb(claim_id, claimant,
net_amount, paid_amount)
 XYZ Company: T_claims(id, customer, claim_amt),
T_payments(cid, pid, amount)
 First
step is to construct X-Specs for each
database.
Page 20
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Example:
ABC Database X-Spec
Type
System Name
Semantic Name
Table
Field
Claims_tb
Claim_id
[Claim]
[Claim] Id
Field
Field
Field
Claimant
Net_amount
Paid_amount
[Claim;Claimant] Name
[Claim] Amount
[Claim;Payment] Amount
Page 21
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Integration Example:
XYZ Database X-Spec
Type
System Name
Table T_claims
Field id
Field customer
Field claim_amt
Table
Field
Field
Field
T_payments
cid
pid
amount
Semantic Name
[Claim]
[Claim] Id
[Claim;Customer] Name
[Claim] Amount
[Claim;Payment]
[Claim] Id
[Claim;Payment] Id
[Claim;Payment] Amount
Page 22
Integration Example:
Integrated View
 Global
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
view after integration:
 [Claim]
 Id
 Net amount
 [Customer]

name
 [Payment]


id
amount
Page 23
Integration Example:
Discussion
 Important
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
points:
 system
and field names are not presented to the
user who queries based on semantic names
 database structure is not shown to the user
 different physical representations for the same
concept are combined (e.g. payment (attribute) in
ABC with payment table in XYZ database)
 hierarchially related concepts (customer vs.
claimant) are combined based on their IS-A
relationship in the global dictionary
Page 24
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Applications to the WWW
 Integrating
diverse data sources is involved in
constructing a data warehouse and other
operational systems.
 The WWW is a diverse organizations of
databases which users access.
 Automatically integrating web data sources by
a browser or portal reduces query complexity
and integration of results for the user.
Page 25
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Conclusions
 Automatic
integration of database schema is
possible by using a global dictionary of terms
and constructing semantic names for schema
elements.
 Integration of data sources has applications to
the WWW and construction of data
warehouses.
Page 26
Integrating Multiple Data Sources using a Standardized XML Dictionary
Ramon Lawrence
Future Work
 The
integration architecture is evolving with
standards on XML and captures metadata
information in XML documents.
 The system is being tested on sample
problems, and a query mechanism is work-inprogress.
 We are refining a prototype of the system
called Unity.
Page 27