Transcript Slide 1

Research Topics in Computing
Data Modelling for Data Schema
Integration
1 March 2005
David George
Modelling & Data Integration
Key Elements of today’s Presentation
David George

Key Drivers for Data Integration

Dimensions and Issues in Integration

Three Integration Approaches
Data Integration
2
Drivers for Data Integration
David George
Data Integration
3
Drivers for Data Integration



Organisations evolving as global entities with
distributed data.
Systems characterised by mix of legacy and new
databases and applications.
Organisational change :



David George
(1)
Organic growth – size and diversity.
Business re-engineering.
Corporate mergers and acquisitions.
Data Integration
4
Drivers for Data Integration


Organisations evolved as collections of distinct,
autonomous departments with disconnected systems
e.g. in financial services.
Trends in Business Intelligence initiatives:




David George
(2)
Decision-making support.
Customer segmentation.
Marketing strategies.
Development of distributed or multidatabase
systems.
Data Integration
5
Dimensions and Issues in
Integration
David George
Data Integration
6
Architecture & Design Issues

Multidatabase systems can be classified in two ways:

Homogeneous systems – local databases having same
techniques and language.


Heterogeneous systems – local databases demonstrating
diverse data models and language.
Key Dimensions in systems heterogeneity


David George
System heterogeneity – hardware, OS, DBMS
Semantic heterogeneity - models and data
Data Integration
7
Why Heterogeneity/Conflict?
Conceptualisation by
Representation by
Design >>
>>>>
Real World
Database World
(representation)
Conceptual World
<< Check
<<<<
Denotation of
Interpretation of
Translating conceptualisations of the real world
into database world representations
David George
Data Integration
8
Research Work Conceptualised

Books Model (a)
The data of interest is about Books, their
Publishers and adopting Universities.

Publications Model (b)
The data of interest is about
Publications and their Types
David George
Data Integration
9
Books
Name
Title
Publisher
Name
Published by
Address
Book
Adopted by
City
University
Refer to
Name
Topics
---------------------------------------------------
Publications
Title
Publication
Code
Publisher
contains
Word
Keywords
David George
Data Integration
Code
Research Area
10
A
Name
Title
Publisher
Name
Published by
Address
Book
Adopted by
City
University
Refer to
Name
Topics
B
--------------------------------------------------Title
Publisher
Name
Published by
Publication
contains
Word
Name
Keywords
Topics
David George
Code
Data Integration
Code
Research Area
11
Books and Publications Integrated
Name
Publisher
Name
Title
Address
Published by
Published by
Book
Adopted by
City
University
Refer to
Name
Publication
contains
Topics
Code
Research Area
Title
David George
Code
Data Integration
12
Semantic Heterogeneity/Conflict
Structural Conflicts
 Generalisation versus Specialisation Conflicts.
 Entity versus attributes.
 Naming conflicts.
Attribute (Domain) Conflicts
 Data Type conflicts.
 Measure and Scale conflicts.
 Integrity, Presence & Absence.
 Data Values
David George
Data Integration
13
Semantic Heterogeneity/Conflict

Generalisation/Specialisation Conflicts.
Person
Gender
(i.e. Structural)
Person
Gender
Female

Male
Naming conflicts.

Synonyms
e.g.
Customer

Homonyms
e.g.
Market (Products) vs
vs
Data Integration
Client
Market (Customers)
14
Semantic Heterogeneity/Conflict

Data Type (representation) conflicts.



-
26254006 (integer or string)
No vs Name (integer or string)
Measure and Scale etc conflicts.





David George
Student
Student
Dimension Measure Scale
Precision Date
-
volume vs weight
light years vs miles
miles vs kilometres
1:100 versus A:E
dd/mm/yyyy vs mm-dd-yy ???
Data Integration
15
Semantic Heterogeneity/Conflict

Integrity Constraints e.g.
Age Range <21 vs Age >18
 Referential conflict 1:1 vs 1:M
(e.g. 1 invoice for 1/ M orders)


Presence/Absence.



Data Values

David George
No null, nulls – e.g. optional
No corresponding attribute
Same items different values
Data Integration
16
Integration Approaches
David George
Data Integration
17
Integration Approaches



David George
Federated Database (Multidatabase) Systems.
Data Warehouse (Materialised in house)
Systems.
Mediators (Virtual integration) Systems.
Data Integration
18
Federated Database Systems
David George
Data Integration
19
Federated Databases
(1)
FDBMS
Component DB 1
(Centralised DBMS)
Comp DB 1
David George
Component DB 2
(Distributed DBMS)
Comp DB 2.1
Component DB n
(another FDBMS)
Comp DB 2.2
Data Integration
20
Federated Databases

A Class of heterogeneous databases that:




Consist of both new and old systems.
Previously existed in their own stand-alone
(autonomous) environments.
Integration is a consequence of distribution.
Organisation can adopt different architectures i.e. the
way databases are mapped together:


David George
(2)
Loosely Coupled integrations.
Tightly Coupled integrations.
Data Integration
21
Federated Databases

Tightly Coupled Federations


David George
(3)
Federation administrator determines schema view for all
component systems in the federation.
Negotiates export schemas (tables and attributes) from
federation participants who control exports of local schemas.

Local schema exports integrated as a federated schema.

Less autonomy at federation user level for view creation.
Data Integration
22
Federated Databases

Loosely Coupled Federations

The federated component databases have a greater degree
of autonomy.

No central schema view is imposed on users.

Federated user is effectively an administrator creating views.

David George
(4)
User employs a MDB Query Language (v TC schema
integration).
Data Integration
23
Federated Databases





David George
(5)
Sharing is made explicit by allowing export schemas
from the local or component database.
The export schemas are imported to the federation to
represent the shareable federated database.
Each source can call on others for information.
FDBMSs differ from homogeneous Distributed DBMSs
– they use the same data model and DBMS.
DDBMSs sharing is therefore implicit.
Data Integration
24
Data Warehousing Systems
David George
Data Integration
25
Data Warehousing
Network
Internet
Local Schema
Local Schema
Web
Sourc
e
O/RDB
(1)
Local
Operational
--------------------------------------------------Wrapper
Integration
&
Storage
R2
Wrapper
Data
Extraction
Global Schema
R3
Repository
Warehouse
Decision Support
& Mining
User Query
David George
Data Integration
26
Data Warehousing


Represents the physical separation of operational and
decision support environments.
Operational data provides the raw material for:




David George
(2)
Decision support systems.
Data-mining (DM).
E.g. identifying trends or characteristics.
DM = process of “non-trivial extraction of implicit,
previously unknown, and potentially useful
information”.
Data Integration
27
Data Warehousing
(3)

Warehouse integrates multiple, heterogeneous data
sources - e.g. Relational DBs, flat files.

Data is pre-fetched into a central or intermediate
warehouse repository by mediation process.

Data is “cleaned” and data integration techniques
applied e.g. filtered, joined or aggregated.


David George
Data may be transformed to conform to the
warehouse schema.
Provides consistency in naming conventions, data
structures, attributes, etc.
Data Integration
28
Data Warehousing


(4)
Data then stored (materialised) in warehouse
repository – possibly in separate data marts.
Result is a repository of synthesised data for
management decision-making.

Queries are made over the repository’s global
schema.

Information is independent from the source data.

Data extraction tends to be periodically.
David George
Data Integration
29
Mediator (+Wrapper) Systems
David George
Data Integration
30
Mediator Systems
(1)
Network
Internet
Local Schema
Local Schema
Web
Sourc
e
O/RDB
Data
Sources
Wrapper
Wrapper
Mediated Schema
Query1
Query Translation
Query 2
Mediator
Integration
System
User Query
David George
Data Integration
31
Mediator Systems
(2)

Global schema created and mapped to the source
schemas.

User makes queries over global, mediated schema.

Mappings can be either:
 Global-as-view (GAV).
 Local-as-view (LAV).

Mediator translates global schema query and
reformulates it into sub-queries of local schemas.

Wrappers execute and return.
David George
Data Integration
32
Mediator Systems




David George
(3)
Wrappers standardise how source information is
described and accessed (i.e. they translate or
adapt).
Query answers are returned to the user on
demand – after sources are interrogated.
Thus data is always up-to-date (v. Warehousing).
Mediators integrate information view, without
integrating the source data.
Data Integration
33
Mediator Systems




David George
(4)
Results in a homogeneous information source using
views - based on the mediated (global) schema.
Integration is virtual i.e. retrieved by the mediator
but not stored in any central repository.
Differs from Warehousing Queries – where made to
materialised data.
In short – provides virtual source schema integration
via schema mapping and integrated view.
Data Integration
34
Comparisons
David George
Data Integration
35
Federation versus
Warehousing & Mediation


David George
Federation represents a more “static” approach –
using agreed couplings to allow view creation.
Warehousing and Mediation addresses integration in
a more “dynamic” way – using extraction,
transformation and integration processes.
Data Integration
36
Warehousing vs. Mediation

Warehouse:



Heterogeneous data is integrated in advance and stored inhouse for direct query and analysis.
Mediation:



David George
Update-driven: i.e. in warehouse repository
Wrapper and Mediator layer on top of source DBs.
Query-driven: Query to mediated schema then translated
into queries appropriate to sources.
Results integrated into a global answer set.
Data Integration
37
Summary
David George
Data Integration
38
Summary

Drivers for Data Integration



Integration Issues



Different Conceptual Model representations.
Resulting Semantic Heterogeneities.
Integration Approaches


David George
Organisational change.
Business Intelligence and Strategies.
Federated Systems.
Data Warehousing and Mediator Systems.
Data Integration
39
Next step ……
David George
Data Integration
40
Research Resources
Reference Material



Journals
Books
Presentation slides
UCLAN Website

Internal:
http://janus/dgeorge/integration/journals.asp

External:
http://www.janus.computing.uclan.ac.uk/dgeorge/integration/journals.asp
David George
Data Integration
41