Data/Information Integration/aggregation
Download
Report
Transcript Data/Information Integration/aggregation
Information Integration
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
1
Services
Source Trust
Webpages
Structured
data
Sensors
(streaming
Data)
Source Fusion/
Query Planning
Mediator
Executor
Answers
Monitor
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
2
Services
•
•
•
•
Source Trust
User queries refer to the
mediated schema.
Data is stored in the
sources in a local schema.
Content descriptions
provide the semantic
mappings between the
different schemas.
Mediator uses the
descriptions to translate
user queries into queries
on the sources.
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
Source Fusion/
Query Planning
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
DWIM
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
Information Integration on the Web (MA-1)
Monitor
3
Who is dying to have it?
(Applications)
• WWW:
– Comparison shopping
– Portals integrating data from multiple sources
– B2B, electronic marketplaces
• Science and culture:
– Medical genetics: integrating genomic data
– Astrophysics: monitoring events in the sky.
– Culture: uniform access to all cultural databases
produced by countries in Europe provinces in Canada
• Enterprise data integration
– An average company has 49 different databases and
spends 35% of its IT dollars on integration efforts
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
4
Why isn’t this just
• Search engines do text-based retrieval of URLS
– Works reasonably well for single document texts, or for finding sites
based on single document text
• Cannot integrate information from multiple documents
• Cannot do effective “query relaxation” or generalization
• Cannot link documents and databases
• The aim of Information integration is to support
query processing over structured and semistructured sources as well as services.
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
9
Is it like
Expedia/Travelocity/Orbitz…
• Surpringly, NO!
• The online travel sites don’t quite need to do data integration; they just
use SABRE
– SABRE was started in the 60’s as a joint project between American
Airlines and IBM
– It is the de facto database for most of the airline industry (who voluntarily
enter their data into it)
• There are very few airlines that buck the SABRE trend—SouthWest airlines is
one (which is why many online sites don’t bother with South West)
• So, online travel sites really are talking to a single database (not
multiple data sources)…
– To be sure, online travel sources do have to solve a hard problem. Finding
an optimal fare (even at a given time) is basically computationally
intractable (not to mention the issues brought in by fluctuating fares). So
don’t be so hard on yourself
• Check out http://www.maa.org/devlin/devlin_09_02.html
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
10
Why isn’t this just
Databases
Distributed Databases
• No common schema
– Sources with heterogeneous schemas (and ontologies)
– Semi-structured sources
• Legacy Sources
– Not relational-complete
– Variety of access/process limitations Query
(SQL)
Database Manager
(DBMS)
Database
• Autonomous sources
-Storage mgmt
(relational)
-Query processing
Answer
– No central administration
-View management
(relation)
-(Transaction processing)
– Uncontrolled source content overlap
• Unpredictable run-time behavior
– Makes query execution hard
• Predominantly “Read-only”
– Could be a blessing—less worry about transaction management
– (although the push now is to also support transactions on web)
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
11
Are we talking
“comparison shopping” agents?
• Certainly closer to the aims of
these
• But:
• Wider focus
• Consider larger range of
databases
• Consider services
• Implies more challenges
• “warehousing” may not
work
• Manual source
characterization/
integration won’t scale-up
Kambhampati & Knoblock
Junglee
Netbot
Information Integration on the Web (MA-1)
DealPilot.Com
12
11/7
Search
Engine
Reported
Size
Page Depth
Google
8.1 billion
101K
MSN
5.0 billion
150K
Yahoo
4.2 billion
(estimate)
500K
Ask Jeeves
2.5 billion
101K+
--Project part C due 11/30
--Homework 4 due 11/14
--Google guy talk on 11/18 (TBA)
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
14
Services
•
•
•
•
Source Trust
User queries refer to the
mediated schema.
Data is stored in the
sources in a local schema.
Content descriptions
provide the semantic
mappings between the
different schemas.
Mediator uses the
descriptions to translate
user queries into queries
on the sources.
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
Source Fusion/
Query Planning
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
DWIM
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
Information Integration on the Web (MA-1)
Monitor
15
Information Integration
Linkage
• Discovering information
sources (e.g. deep web
modeling, schema
learning, …)
• Gathering data (e.g.,
wrapper learning &
information extraction,
federated search, …)
Queries
• Querying integrated
• Cleaning data
(e.g., de-duping
and linking
records) to form
a single [virtual]
database
information sources (e.g.
queries to views, execution of
web-based queries, …)
• Data mining & analyzing
integrated information (e.g.,
collaborative
filtering/classification learning
using extracted data, …)
Different “Integration” scenarios
•
“Data Aggregation” (Vertical)
•
– All sources export (parts of a)
single relation
• No need for joins etc
• Could be warehouse or virtual
–
E.g. BibFinder, Junglee,
Employeds etc
– Challenges: Schema mapping; data
overlap
•
“Data Linking” (Horizontal)
– Joins over multiple relations stored
in multiple DB
• E.g. Softjoins in WHIRL
• Ted Kennedy episode
– Challenges: record linkage over text
fields (object mapping); query
reformulation
Kambhampati & Knoblock
•
•
“Collection Selection”
– All sources export text
documents
– E.g. meta-crawler etc.
Challenges: Similarity definition;
relevance handling
All together (vertical & horizontal)
– Many interesting research
issues
– ..but few actual fielded systems
Information Integration on the Web (MA-1)
17
Dimensions to Consider
•
•
•
•
How many sources are we accessing?
How autonomous are they?
Can we get meta-data about sources?
Is the data structured?
– Discussion about soft-joins. See slide next
• Supporting just queries or also updates?
• Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
– See slide next
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
18
Soft Joins..WHIRL [Cohen]
We can extend the notion of Joins to
“Similarity Joins” where similarity is
measured in terms of vector similarity
over the text attributes. So, the join tuples
are output n a ranked form—with the rank
proportional to the similarity
Neat idea… but does have some
implementation difficulties
Most tuples in the cross-product will
have non-zero similarities. So, need
query processing that will somehow
just produce highly ranked tuples
Also other similarity/distance metrics
may be used
E.g. Edit distance
19
20
Local Completeness Information
• If sources are incomplete, we need to look at each one of them.
• Often, sources are locally complete.
• Movie(title, director, year) complete for years after 1960, or for
American directors.
• Question: given a set of local completeness statements, is a query Q’ a
complete answer to Q?
Problems:
1. Sources may not be
interested in giving these!Advertised description
Need to learn
hard to learn!
2. Even if sources are willing to
give, there may not be any
“big enough” LCWs
Saying “I definitely have the car
with vehicle ID XXX is useless
Kambhampati & Knoblock
True source contents
Guarantees
(LCW; Inter-source comparisons)
Information Integration on the Web (MA-1)
21
Source Descriptions
Kambhampati & Knoblock
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Probing
Queries
Webpages
Structured
data
Sensors
(streaming
Data)
od
e
Pr
efe
re
nc
e/U
til
ity
M
e
Qu
Information Integration on the Web (MA-1)
Ca
rce
l
ry
So
u
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
lls
Source Fusion/
Query Planning
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainty,
replanning
tics
atis
g St
atin
Upd
ing
nn
pla ts
Re ques
Re
• Contains all meta-information about the
sources:
– Logical source contents (books, new
cars).
– Source capabilities (can answer SQL
queries)
– Source completeness (has all books).
– Physical properties of source and
network.
– Statistics about the data (like in an
RDBMS)
– Source reliability
– Mirror sources
– Update frequency.
Monitor
23
Source Access
• How do we get the “tuples”?
– Many sources give “unstructured” output
• Some inherently unstructured; while others
“englishify” their database-style output
– Need to (un)Wrap the output from the sources
to get tuples
– “Wrapper building”/Information Extraction
– Can be done manually/semi-manually
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
24
Source Fusion/Query Planning
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
fer
en
ce
/U
til
ity
M
od
all
ce
C
el
y
er
Qu
So
ur
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
s
Source Fusion/
Query Planning
Pr
e
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Information Integration on the Web (MA-1)
Needs to handle
Source/network
Interruptions,
Runtime uncertainty,
replanning
tics
atis
g St
atin
Upd
ing
nn
pla ts
Re ques
Re
• Accepts user query and generates a plan
for accessing sources to answer the query
– Needs to handle tradeoffs between cost
and coverage
– Needs to handle source access
limitations
– Needs to reason about the source
quality/reputation
Monitor
25
Monitoring/Execution
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
od
e
Ca
rce
l
ry
So
u
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
lls
Source Fusion/
Query Planning
fer
en
ce
/U
til
ity
M
e
Qu
Pr
e
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Information Integration on the Web (MA-1)
Needs to handle
Source/network
Interruptions,
Runtime uncertainty,
replanning
tics
atis
g St
atin
Upd
ing
nn
pla ts
Re ques
Re
• Takes the query plan and executes it on the
sources
– Needs to handle source latency
– Needs to handle transient/short-term
network outages
– Needs to handle source access
limitations
– May need to re-schedule or re-plan
Monitor
26
Models for Integration
Overview
Ser vi ces
Source Trust
Ont ol ogies;
Source /Servic e
Descripti ons
Probing
Queries
Webpages
Struc tured
data
Sensors
(streaming
Data)
M
ity
Pr
efe
r
Nee ds t o handle
Source/ net work
Int errupt ions,
Runt ime unce rt ainit y,
repl anning
Ca
rc e
Sou
tics
Executor
Answers
ing
lann s
R ep ue st
R eq
en
ce
/U
til
y
er
Qu
od
el
Ne eds t o handl e:
Multi pl e obje cti ves,
Se rvice composit ion,
Source qua lit y & ove rl ap
ll s
Source Fusion/
Query Planning
is
Stat
ing
Motivation for Information Integration [Rao]
Accessing Information Sources [Craig]
Models for Integration [Rao]
Query Planning & Optimization [Rao]
Plan Execution [Craig]
Standards for Integration/Mediation [Rao]
Ontology & Data Integration [Craig]
Future Directions [Craig]
at
Upd
•
•
•
•
•
•
•
•
Monitor
Modified from Alon Halevy’s slides
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
29
Solutions for small-scale
integration
•
Ontologies;
Source/Service
Descriptions
Kambhampati & Knoblock
Sensors
(streaming
Data)
od
el
M
it y
til
en
ce
/U
fer
Pr
e
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
ur
ce
So
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
Ca
lls
Source Fusion/
Query Planning
e
Qu
User queries
Webpages
Structured
data
ry
Mostly ad-hoc programming:
create a special solution for every
case; pay consultants a lot of
money.
Data warehousing: load all the data
periodically into a warehouse.
– 6-18 months lead time
– Separates operational DBMS
from decision support DBMS.
(not only a solution to data
integration).
– Performance is good; data may
not be fresh.
– Need to clean, scrub you data.
Probing
Queries
s
tistic
g Sta
atin
Upd
ing
nn
pla ts
Re ques
Re
•
Services
Source Trust
Monitor
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction
programs
Data
source
Information Integration on the Web (MA-1)
Data cleaning/
scrubbing
Data
source
Data
source
30
Services
• Leave the data in the sources.
• When a query comes in:
– Determine the relevant
sources to the query
– Break down the query into
sub-queries for the sources.
– Get the answers from the
sources, and combine them
appropriately.
• Data is fresh. Approach scalable
• Issues:
– Relating Sources & Mediator
– Reformulating the query
– Efficient planning & execution
Kambhampati & Knoblock
Ontologies;
Source/Service
Descriptions
Probing
Queries
Webpages
Structured
data
Sensors
(streaming
Data)
Ca
lls
Source Fusion/
Query Planning
ur
So
el
y
ce
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
fer
en
ce
/U
til
ity
M
od
er
Qu
Pr
e
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
s
tistic
g Sta
atin
Upd
ing
nn
pla ts
Re ques
Re
The Virtual Integration
Architecture
Source Trust
Monitor
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Garlic [IBM], Hermes[UMD];Tsimmis,
InfoMaster[Stanford]; DISCO[INRIA];
Information Manifold [AT&T];
SIMS/Ariadne[USC];Emerac/Havasu[ASU]
Information Integration on the Web (MA-1)
32
Desiderata for Relating
Source-Mediator Schemas
• Expressive power: distinguish
between sources with closely
related data. Hence, be able to
prune access to irrelevant
sources.
• Easy addition: make it easy to
add new data sources.
• Reformulation: be able to
reformulate a user query into a
query on the sources efficiently
and effectively.
• Nonlossy: be able to handle all
queries that can be answered by
directly accessing the sources
Kambhampati & Knoblock
User queries
Mediated schema
Mediator:
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Reformulation
• Given:
– A query Q posed over the mediated schema
– Descriptions of the data sources
• Find:
– A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers to Q given the
sources.
Information Integration on the Web (MA-1)
35
11/9
--Google Mashups—what type of
integration are they?
--Mturk.comcollaborative computing the
capitalist way
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
36
Information Integration
Data Integration
Text/Data Integration
Service integration
Collection Selection
Data aggregation
(vertical integration)
Data Linking
(horizontal integration)
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
38
Extremes of automation in Information
Integration
• Fully automated II (blue
sky for now)
–
–
–
–
–
–
–
• Fully hand-coded II
– Decide on the only query
Get a query from the user
you want to support
(most interesting
on the mediator schema
– Write a (java)script that
action is
Go “discover” relevant data
supports the query by
“in between”)
sources
accessing specific (predetermined) sources, piping
Figure out their “schemas”
Map the schemas on to the E.g. We may start with results (through known
APIs) to specific other
known sources and
mediator schema
sources
Reformulate the user query their known schemas,
• Examples include Google
do hand-mapping
into data source queries
Map Mashups
and
support
automated
Optimize and execute the
reformulation and
queries
optimization
Return the answers
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
39
Differences minor for data aggregation…
Approaches for relating source &
Mediator Schemas
“View” Refresher
• Global-as-view (GAV):
express the mediated
schema relations as a set
of views over the data
source relations
• Local-as-view (LAV):
express the source
relations as views over
the mediated schema.
• Can be combined…?
Kambhampati & Knoblock
CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer
We can later use the views:
Virtual vs
Materialized
SELECT name, store
FROM Seattle-view, Product
WHERE Seattle-view.product = Product.name AND
Product.category = “shoes”
Information Integration on the Web (MA-1)
40
Global-as-View
Mediated schema:
Express mediator schema
relations as views over
Movie(title, dir, year, genre),
source relations
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
42
Global-as-View
Mediated schema:
Express mediator schema
relations as views over
Movie(title, dir, year, genre),
source relations
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
Mediator schema relations are
where S3.title=S4.title
Virtual views on source relations
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
43
Local-as-View: example 1
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Express source schema
relations as views over
mediator relations
Create Source S1 AS
select * from Movie
S1(title,dir,year,genre)
Create Source S3 AS
select title, dir from Movie
S3(title,dir)
Create Source S5 AS
select title, dir, year
S5(title,dir,year), year >1960
from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” of
mediator schema
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
46
GAV vs. LAV
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create View Movie AS
Create Source S4
select NULL, NULL, NULL, genre
select cinema, genre
from S4
from Movie m, Schedule s
Create View Schedule AS
where m.title=s.title
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are playing
comedies?
Now if we want to find which cinemas are playing
comedies, there is hope!
Lossy mediation
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
49
GAV
vs.
• Not modular
– Addition of new sources
changes the mediated
schema
• Can be awkward to write
mediated schema without loss
of information
• Query reformulation easy
– reduces to view unfolding
(polynomial)
– Can build hierarchies of
mediated schemas
• Best when
– Few, stable, data sources
– well-known to the mediator
(e.g. corporate integration)
• Garlic, TSIMMIS,
HERMES
Kambhampati & Knoblock
LAV
• Modular--adding new sources is
easy
• Very flexible--power of the
entire query language available
to describe sources
• Reformulation is hard
– Involves answering queries
only using views (can be
intractable—see below)
• Best when
– Many, relatively unknown
data sources
– possibility of
addition/deletion of sources
• Information Manifold,
InfoMaster, Emerac,
Havasu
Information Integration on the Web (MA-1)
50