Data Integration - Lab of Web and Mobile Data Management

Download Report

Transcript Data Integration - Lab of Web and Mobile Data Management

Data Integration:
Achievements and Perspectives
in the Last Ten Years
AiJing
Outline






Motivation & Background
Best Paper: Information Manifold
Building on the Foundation
Data Integration Industry
Future Challenges
Conclusion
Motivation & Background

Data integration is a pervasive challenge
faced in applications that need to query across
multiple autonomous and heterogeneous data
sources.

Data integration is crucial in large enterprises
that own a multitude of data sources.

For better cooperation among agencies, each
with their own data sources.
Data Integration
Enterprise Databases
Legacy Databases
Services and Applications
Outline






Motivation & Background
Best Paper: Information Manifold
Building on the Foundation
Data Integration Industry
Future Challenges
Conclusion
Ten-Year Best Paper
Querying Heterogeneous Information Sources
using Source Descriptions. VLDB96
Alon Halevy
a principal member of technical staff at AT&T Bell
Laboratories, and then at AT&T Laboratories.
• Main idea: the Information Manifold
• led to tremendous progress on data integration
and to quite a few commercial data integration
products.
The Information Manifold

An implemented data integration system

Goal: provide a uniform query interface to a
heterogeneous collection of Web data sources

Main contribution: the way it described the contents of
the data sources it knew about.

IM contains declarative descriptions of the contents and
capabilities of the information sources. (Source
Description)
An example of complex query
find reviews of movie directed by Woody
Allen playing in my area
three web sites join!
1. a movie site containing actor and director
information (IMDB)
2. movie playing sources(e.g.,777film.com)
3. movie review sites (e.g., a newspaper)
Design time
Run time
Mediated Schema
query
reformulation
Semantic mappings
optimization &
execution
wrapper
wrapper
wrapper
wrapper
wrapper
Semantic Mappings
Mediated Schema
CD: ASIN, Title, Genre,…
Artist: ASIN, name, …
Mapping logic
CDs
Album
ASIN
Price
DiscountPrice
Studio
CDCategories
ASIN
Category
Books
Title
ISBN
Price
DiscountPrice
Edition
BookCategories
ISBN
Category
Authors
ISBN
FirstName
LastName
Artists
ASIN
ArtistName
GroupName
Informatio
n sources
Global-as-View (GAV)
(Previous approaches)
Mapping:
Mediated Schema
CD: ASIN, Title, Genre,…
Artist: ASIN, name, …
Source
R1
Source
R2
Source
R3
Source
R4
Source
R5
Local-as-View (LAV)
Mapping:
Mediated Schema
CD: ASIN, Title, Genre, Year
Artist: ASIN, Name, …
Mediated
View
Source
R1
Mediated
View
Source
R2
Mediated
View
Source
R3
Mediated
View
Source
R4
Mediated
View
Source
R5
benefits of LAV

Describing information sources became easier
a data integration system could
accommodate new sources easily

The descriptions of the information sources
could be more precise
describe precise constraints on the
contents of the sources become easier
Query reformulation
Mediated Schema
CD: ASIN, Title, Genre,…
CD(A,T,G)
A query
posed over
a set of queries on
the data sources
CDs
Album
ASIN
Price
DiscountPrice
Studio
Books
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
Artists
CDCategories
ASIN
Category
BookCategories
ISBN
Category
ASIN
ArtistName
GroupName
Query Answering in LAV =
Answering queries using views (AQUV)

a problem which was earlier considered in the
context of query optimization
Given a set of views V1,…,Vn,
And a query Q,
Can we answer Q using only the answers to
V1,…,Vn?
AQUV

Query optimization & Supporting physical
data independence

AQUV for data integration:
 Not necessarily equivalent rewriting
 Find maximally contained rewriting
Main AQUV Algorithms:
 Bucket
 Inverse rules
 Minicon

Outline






Motivation & Background
Best Paper: Information Manifold
Building on the Foundation
Data Integration Industry
Future Challenges
Conclusion
Building on the Foundation






Generating Schema mappings
Adaptive query processing
XML
Model management
Peer-to-Peer Data Management
The Role of Artificial Intelligence
Generating Schema Mappings
Look at that observation:


Who’s going to write all these LAV/GAV formulas
(the semantic mappings between the sources
and the mediated schema)?
1.create the source descriptions
2. writing the semantic mappings

This was the main bottleneck.
Techniques for Schema Mapping



semi-automatically generating schema mappings
Goal: create tools that speed up the creation of
the mappings and reduce the amount of human
effort involved.
Compare schema elements based on:
 Linguistic similarities
 overlaps in data values or data types

schema mapping tasks are often repetitive.
A Machine Learning Approach
Mediated schema


Map multiple schemas in the same domain to the
same mediated schema.
Learn from previous experience:


the manually created schema mappings as training data
generalize from them to predict mappings between
unseen schemas.
Building on the Foundation






Generating Schema mappings
Adaptive query processing
XML
Model management
Peer-to-Peer Data Management
The Role of Artificial Intelligence
Adaptive query processing

look at that observation:



Once we have mappings, how can we execute
queries?
Traditional plan-then-execute doesn’t work.
Root: the dynamic nature of data integration
contexts
Adaptive query processing


data integration system:
the context is very dynamic and the optimizer
has much less information than the traditional
setting.
Two results:



the optimizer can’t decide a good plan
a plan may be arbitrarily bad.
Dynamic adjust query plan
Building on the Foundation






Generating Schema mappings
Adaptive query processing
XML
Model management
Peer-to-Peer Data Management
The Role of Artificial Intelligence
XML characters for data integration



XML offered a common syntactic format for
sharing data among data sources.
since it appeared as if data could actually be
shared
integration systems using XML as the
underlying data Model and XML query
languages (XQuery)
Building on the Foundation






Generating Schema mappings
Adaptive query processing
XML
Model management
Peer-to-Peer Data Management
The Role of Artificial Intelligence
Model Management

Goal: provide an algebra for manipulating
schemas and mappings

With such an algebra:


complex operations on data sources
simple sequences of operators in the algebra
Some of the operators in Model Management

create & compose mappings, merge & diff models
Building on the Foundation






Generating Schema mappings
Adaptive query processing
XML
Model management
Peer-to-Peer Data Management
The Role of Artificial Intelligence
Peer Data Management Systems
Q3
UW (Wisconsin)
Stanford
Q1
Q4
Berkeley
Q5
LAV, GLAV
Q
UW (Washington)
DBLP
Q2
UW (Waterloo)
Q6
CiteSeer
Two Additional Benefits

A P2P architecture offers a truly distributed
mechanism for sharing data.



Every data source only provide semantic mappings to a set
of neighbors.
complex integrations emerge follows semantic paths
P2P architecture is more appropriate than a single
mediated schema in data sharing context.


there is never a single global mediated schema
data sharing occurs in local neighborhoods of the network.
Building on the Foundation






Generating Schema mappings
Adaptive query processing
XML
Model management
Peer-to-Peer Data Management
The Role of Artificial Intelligence
The Role of Artificial Intelligence

Description Logics describe relationships between
data sources




data sources need to be represented declaratively
the mediated schema of IM was based on Classic
Description Logic
Description Logics offered more flexible
mechanisms for representing a mediated schema
Recent work: combine the expressive power of
Description Logics with the ability to manage large
amounts of data.
Outline






Motivation & Background
Best Paper: Information Manifold
Building on the Foundation
Data Integration Industry
Future Challenges
Conclusion
The Data Integration Industry




Late 90’s——commercialization
Enterprise Information Integration (EII):
without having to first load all the data into a central
warehouse
the development of the EII industry
 Technologies from research labs matured enough
 The needs of data management
 XML
Inappropriate:
data warehousing solutions, ad-hoc solutions
A data
integration
scenario
Query
processing
data sources
build
semantic
Execute
withmappings
an engine
that create plans that
span multiple data
mediated schema
sources
will participate in
the application
a query posed
over over
the the
a
query
query reformulation
virtual
schema
data
sources
query
applications
applications
Other EII Products

XML data model and XQuery
Challenge: the research on integration for XML was
only in its infancy

customer-relationship management
Challenge: how to provide the customer-facing
worker a global view of a customer whose data is
residing in multiple sources, and track information
from multiple sources in real time.
Outline






Motivation & Background
Best Paper: Information Manifold
Building on the Foundation
Data Integration Industry
Future Challenges
Conclusion
Future Challenges

The factors of data integration challenges:


Social: Data integration is fundamentally about
getting people to collaborate and share data.
complexity of integration

Data integration has been referred to as a
problem as hard as AI, maybe even harder!

Our goal: create tools that facilitate data
integration in a variety of scenarios.
Several Specific Challenges

Dataspaces: Pay-as-you-go data management

Uncertainty and lineage

Reusing human attention
Dataspaces


database system: create the schema first!
data integration system: create the semantic
mappings first!
fundamental shortcoming: long setup time!

Dataspaces: the idea of pay-as-you-go data
management
Pay-as-you-go


offer some services immediately without any
setup time, and improve the services as more
investment is made into creating semantic
relationships.
A dataspace should offer keyword search
over any data in any source with no setup
time.
Pay-as-you-go Data Management
Dataspaces: Franklin, Halevy, Maier [see PODS 2006]
Benefit
Dataspaces
Data integration solutions
Investment (time, cost)
Several Specific Challenges

Dataspaces: Pay-as-you-go data management

Uncertainty and lineage

Reusing human attention
Uncertain data & data lineage

A necessity in data integration system

introspect about the certainty of the data

when not automatically determine its certainty, refer
the user to the lineage of the data

Web search engines provide URLs along with their
search results, so users can consider the URLs in
the decision of which results to explore further.
Several Specific Challenges

Dataspaces: Pay-as-you-go data management

Uncertainty and lineage

Reusing human attention
Reusing human attention




achieving tighter semantic integration among
data sources
Users’ any operation to data sources:
Giving a semantic clue about the data or
about relationships between data sources
Systems that leverage these semantic clues:
obtain semantic integration much faster
an area for additional research and development
Outline






Motivation & Background
Best Paper: Information Manifold
Building on the Foundation
Data Integration Industry
Future Challenges
Conclusion
Conclusion
time
not so long ago
today
data integration
a nice feature and an area
for intellectual curiosity
a necessity

Today’s economy further emphasize the need
for data integration solutions.

Thomas Friedman: The World is Flat.
A Framework for Deep Web Integration
Result Process Module
Web DB
RDB
Results
Annotation
Data
Merging
Results
Extraction
Web DB
Web DB
Deep Web
Integrated
Interface
Web DB
WDB
Selection
Query
Translation
Query
Submission
Query Process Module
Interface
Integration
WDB
Clustering
Interface Schema
Extraction
WDB
Discovery
Interface Integration Module
Developed issue
Developing issue
Undeveloped issue
Our focuses
Web DB
Q&A