Information Integration
Download
Report
Transcript Information Integration
Information & Data Integration
Combining information from multiple
autonomous sources
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
1
The end-game: 3 options
• Have an in-class final exam
– 5/8 2:30pm is the designated time
• Have a take-home exam
• Make the final home-work into a take home
– ..and have a mandatory discussion class on 5/8
2:30pm
Also, note the change in demo schedule
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
2
Today’s Agenda
• Discuss Semtag/Seeker
• Lecture start on Information Integration
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
3
Information Integration
• Combining information from multiple autonomous
information sources
– And answering queries using the combined information
• Many variations depending on
– The type of information sources (text? Data? Combination?)
• Data vs. Information integration
• Horizontal vs. Vertical integration
– The level of eagerness of the integration
• Ad hoc vs. Pre-mediated integration
– Pre-mediation itself can be Warehouse vs online approaches
– Generality of the solution
• Mashup vs. Mediated
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
4
Information Integration
as making the database repository
of the web..
Linkage
• Discovering information
sources (e.g. deep web
modeling, schema
learning, …)
• Gathering data (e.g.,
wrapper learning &
information extraction,
federated search, …)
Queries
• Querying integrated
• Cleaning data
(e.g., de-duping
and linking
records) to form
a single [virtual]
database
information sources (e.g.
queries to views, execution of
web-based queries, …)
• Data mining & analyzing
integrated information (e.g.,
collaborative
filtering/classification learning
using extracted data, …)
Services
Source Trust
Webpages
Structured
data
Sensors
(streaming
Data)
Source Fusion/
Query Planning
Mediator
Executor
Answers
Monitor
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
6
Who is dying to have it?
(Applications)
• WWW:
– Comparison shopping
– Portals integrating data from multiple sources
– B2B, electronic marketplaces
• Science and culture:
– Medical genetics: integrating genomic data
– Astrophysics: monitoring events in the sky.
– Culture: uniform access to all cultural databases
produced by countries in Europe provinces in Canada
• Enterprise data integration
– An average company has 49 different databases and
spends 35% of its IT dollars on integration efforts
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
7
Is it like
Expedia/Travelocity/Orbitz…
• Surpringly, NO!
• The online travel sites don’t quite need to do data integration; they just
use SABRE
– SABRE was started in the 60’s as a joint project between American
Airlines and IBM
– It is the de facto database for most of the airline industry (who voluntarily
enter their data into it)
• There are very few airlines that buck the SABRE trend—SouthWest airlines is
one (which is why many online sites don’t bother with South West)
• So, online travel sites really are talking to a single database (not
multiple data sources)…
– To be sure, online travel sources do have to solve a hard problem. Finding
an optimal fare (even at a given time) is basically computationally
intractable (not to mention the issues brought in by fluctuating fares). So
don’t be so hard on yourself
• Check out http://www.maa.org/devlin/devlin_09_02.html
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
9
Are we talking
“comparison shopping” agents?
• Certainly closer to the aims of
these
• But:
• Wider focus
• Consider larger range of
databases
• Consider services
• Implies more challenges
• “warehousing” may not
work
• Manual source
characterization/
integration won’t scale-up
Kambhampati & Knoblock
Junglee
Netbot
Information Integration on the Web (MA-1)
DealPilot.Com
11
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
12
4/26
Information Integration –2
Focus on Data Integration
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
13
Information Integration
Text Integration
Data Integration
Soft-Joins
Collection Selection
Data aggregation
(vertical integration)
Data Linking
(horizontal integration)
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
14
Different “Integration” scenarios
•
“Data Aggregation” (Vertical)
•
– All sources export (parts of a)
single relation
• No need for joins etc
• Could be warehouse or virtual
–
E.g. BibFinder, Junglee,
Employeds etc
– Challenges: Schema mapping; data
overlap
•
“Data Linking” (Horizontal)
– Joins over multiple relations stored
in multiple DB
• E.g. Softjoins in WHIRL
• Ted Kennedy episode
– Challenges: record linkage over text
fields (object mapping); query
reformulation
Kambhampati & Knoblock
•
•
“Collection Selection”
– All sources export text
documents
– E.g. meta-crawler etc.
Challenges: Similarity definition;
relevance handling
All together (vertical & horizontal)
– Many interesting research
issues
– ..but few actual fielded systems
Information Integration on the Web (MA-1)
15
Collection Selection
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
16
Collection Selection/Meta Search
Introduction
• Metasearch Engine
• A system that provides unified access
to multiple existing search engines.
• Metasearch Engine Components
– Database Selector
• Identifying potentially useful
databases for each user query
– Document Selector
• Identifying potentially useful document
returned from selected databases
– Query Dispatcher and Result Merger
• Ranking the selected documents
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
17
Collection Selection
Collection
Selection
Query
Execution
WSJ
WP
FT
Kambhampati & Knoblock
Results
Merging
CNN
NYT
Information Integration on the Web (MA-1)
18
Evaluating collection selection
• Let c1..cj be the collections that are chosen to be accessed
for the query Q. Let d1…dk be the top documents returned
from these collections.
• We compare these results to the results that would have
been returned from a central union database
– Ground Truth: The ranking of documents that the retrieval
technique (say vector space or jaccard similarity) would have
retrieved from a central union database that is the union of all the
collections
• Compare precision of the documents returned by accessing
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
19
General Scheme & Challenges
• Get a representative of each of the database
– Representative is a sample of files from the database
• Challenge: get an unbiased sample when you can only access the
database through queries.
• Compare the query to the representatives to judge the
relevance of a database
– Coarse approach: Convert the representative files into a single file
(super-document). Take the (vector) similarity between the query
and the super document of a database to judge that database’s
relevance
– Finer approach: Keep the representative as a mini-database. Union
the mini-databases to get a central mini-database. Apply the query
to the central mini-database and find the top k answers from it.
Decide the relevance of each database based on which of the
answers came from which database’s representative
• You can use an estimate of the size of the database too
– What about overlap between
collections? (See ROSCO paper)
Information Integration on the Web (MA-1)
Kambhampati & Knoblock
21
Uniform Probing for Content
Summary Construction
• Automatic extraction of document frequency
statistics from uncooperative databases
– [Callan and Connell TOIS 2001],[Callan et al. SIGMOD 1999]
• Main Ideas
– Pick a word and send it as a query to database D
• RandomSampling-OtherResource(RS-Ord): from a dictionary
• RandomSampling-LearnedResource(RS-Lrd): from retrieved documents
– Retrieval the top-K documents returned
– If the number of retrieved documents exceeds a threshod T, stop, otherwise retart at
the beginning
– k=4 , T=300
– Compute the sample document frequency for each word that appeared in a retrieved
document.
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
22
CORI Net Approach
(Representative as a super document)
• Representative Statistics
– The document frequency for each term and each database
– The database frequency for each term
• Main Ideas
– Visualize the representative of a database as a super document, and the set
of all representative as a database of super documents
– Document frequency becomes term frequency in the super document, and
database frequency becomes document frequency in the super database
– Ranking scores can be computed using a similarity function such as the
Cosine function
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
23
ReDDE Approach
(Representative as a mini-collection)
• Use the representatives as mini collections
• Construct a union-representative that is the union of the
mini-collections (such that each document keeps
information on which collection it was sampled from)
• Send the query first to union-collection, get the top-k
ranked results
– See which of the results in the top-k came from which minicollection. The collections are ranked in terms of how much their
mini-collections contributed to the top-k answers of the query.
– Scale the number of returned results by the expected size of the
actual collection
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
24
Data Integration
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
25
Models for Integration
Overview
Ser vi ces
Source Trust
Ont ol ogies;
Source /Servic e
Descripti ons
Probing
Queries
Webpages
Struc tured
data
Sensors
(streaming
Data)
M
ity
Pr
efe
r
Nee ds t o handle
Source/ net work
Int errupt ions,
Runt ime unce rt ainit y,
repl anning
Ca
rc e
Sou
tics
Executor
Answers
ing
lann s
R ep ue st
R eq
en
ce
/U
til
y
er
Qu
od
el
Ne eds t o handl e:
Multi pl e obje cti ves,
Se rvice composit ion,
Source qua lit y & ove rl ap
ll s
Source Fusion/
Query Planning
is
Stat
ing
Motivation for Information Integration [Rao]
Accessing Information Sources [Craig]
Models for Integration [Rao]
Query Planning & Optimization [Rao]
Plan Execution [Craig]
Standards for Integration/Mediation [Rao]
Ontology & Data Integration [Craig]
Future Directions [Craig]
at
Upd
•
•
•
•
•
•
•
•
Monitor
Modified from Alon Halevy’s slides
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
26
Solutions for small-scale
integration
•
Ontologies;
Source/Service
Descriptions
Kambhampati & Knoblock
Sensors
(streaming
Data)
od
el
M
it y
til
en
ce
/U
fer
Pr
e
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
ur
ce
So
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
Ca
lls
Source Fusion/
Query Planning
e
Qu
User queries
Webpages
Structured
data
ry
Mostly ad-hoc programming:
create a special solution for every
case; pay consultants a lot of
money.
Data warehousing: load all the data
periodically into a warehouse.
– 6-18 months lead time
– Separates operational DBMS
from decision support DBMS.
(not only a solution to data
integration).
– Performance is good; data may
not be fresh.
– Need to clean, scrub you data.
Probing
Queries
s
tistic
g Sta
atin
Upd
ing
nn
pla ts
Re ques
Re
•
Services
Source Trust
Monitor
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction
programs
Data
source
Information Integration on the Web (MA-1)
Data cleaning/
scrubbing
Data
source
Data
source
27
Services
• Leave the data in the sources.
• When a query comes in:
– Determine the relevant
sources to the query
– Break down the query into
sub-queries for the sources.
– Get the answers from the
sources, and combine them
appropriately.
• Data is fresh. Approach scalable
• Issues:
– Relating Sources & Mediator
– Reformulating the query
– Efficient planning & execution
Kambhampati & Knoblock
Ontologies;
Source/Service
Descriptions
Probing
Queries
Webpages
Structured
data
Sensors
(streaming
Data)
Ca
lls
Source Fusion/
Query Planning
ur
So
el
y
ce
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
fer
en
ce
/U
til
ity
M
od
er
Qu
Pr
e
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
s
tistic
g Sta
atin
Upd
ing
nn
pla ts
Re ques
Re
The Virtual Integration
Architecture
Source Trust
Monitor
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Garlic [IBM], Hermes[UMD];Tsimmis,
InfoMaster[Stanford]; DISCO[INRIA];
Information Manifold [AT&T];
SIMS/Ariadne[USC];Emerac/Havasu[ASU]
Information Integration on the Web (MA-1)
29
Desiderata for Relating
Source-Mediator Schemas
• Expressive power: distinguish
between sources with closely
related data. Hence, be able to
prune access to irrelevant
sources.
• Easy addition: make it easy to
add new data sources.
• Reformulation: be able to
reformulate a user query into a
query on the sources efficiently
and effectively.
• Nonlossy: be able to handle all
queries that can be answered by
directly accessing the sources
Kambhampati & Knoblock
User queries
Mediated schema
Mediator:
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Reformulation
• Given:
– A query Q posed over the mediated schema
– Descriptions of the data sources
• Find:
– A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers to Q given the
sources.
Information Integration on the Web (MA-1)
32
Why isn’t this just
Databases
Distributed Databases
• No common schema
– Sources with heterogeneous schemas (and ontologies)
– Semi-structured sources
• Legacy Sources
– Not relational-complete
– Variety of access/process limitations Query
(SQL)
Database Manager
(DBMS)
Database
• Autonomous sources
-Storage mgmt
(relational)
-Query processing
Answer
– No central administration
-View management
(relation)
-(Transaction processing)
– Uncontrolled source content overlap
• Unpredictable run-time behavior
– Makes query execution hard
• Predominantly “Read-only”
– Could be a blessing—less worry about transaction management
– (although the push now is to also support transactions on web)
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
34
Differences minor for data aggregation…
Approaches for relating source &
Mediator Schemas
“View” Refresher
• Global-as-view (GAV):
express the mediated
schema relations as a set
of views over the data
source relations
• Local-as-view (LAV):
express the source
relations as views over
the mediated schema.
• Can be combined…?
Kambhampati & Knoblock
CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer
We can later use the views:
Virtual vs
Materialized
SELECT name, store
FROM Seattle-view, Product
WHERE Seattle-view.product = Product.name AND
Product.category = “shoes”
Information Integration on the Web (MA-1)
36
Global-as-View
Mediated schema:
Express mediator schema
relations as views over
Movie(title, dir, year, genre),
source relations
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
38
Global-as-View
Mediated schema:
Express mediator schema
relations as views over
Movie(title, dir, year, genre),
source relations
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
Mediator schema relations are
where S3.title=S4.title
Virtual views on source relations
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
39
Local-as-View: example 1
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Express source schema
relations as views over
mediator relations
Create Source S1 AS
select * from Movie
S1(title,dir,year,genre)
Create Source S3 AS
select title, dir from Movie
S3(title,dir)
Create Source S5 AS
select title, dir, year
S5(title,dir,year), year >1960
from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” of
mediator schema
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
42
GAV vs. LAV
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create View Movie AS
Create Source S4
select NULL, NULL, NULL, genre
select cinema, genre
from S4
from Movie m, Schedule s
Create View Schedule AS
where m.title=s.title
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are playing
comedies?
Now if we want to find which cinemas are playing
comedies, there is hope!
Lossy mediation
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
45
GAV
vs.
• Not modular
– Addition of new sources
changes the mediated
schema
• Can be awkward to write
mediated schema without loss
of information
• Query reformulation easy
– reduces to view unfolding
(polynomial)
– Can build hierarchies of
mediated schemas
• Best when
– Few, stable, data sources
– well-known to the mediator
(e.g. corporate integration)
• Garlic, TSIMMIS,
HERMES
Kambhampati & Knoblock
LAV
• Modular--adding new sources is
easy
• Very flexible--power of the
entire query language available
to describe sources
• Reformulation is hard
– Involves answering queries
only using views (can be
intractable—see below)
• Best when
– Many, relatively unknown
data sources
– possibility of
addition/deletion of sources
• Information Manifold,
InfoMaster, Emerac,
Havasu
Information Integration on the Web (MA-1)
46
Extremes of Laziness in Data
Integration
• Fully Query-time II (blue
sky for now)
–
–
–
–
–
–
–
• Fully pre-fixed II
– Decide on the only query
Get a query from the user
you want to support
(most interesting
on the mediator schema
– Write a (java)script that
action is
Go “discover” relevant data
supports the query by
“in between”)
sources
accessing specific (predetermined) sources, piping
Figure out their “schemas”
Map the schemas on to the E.g. We may start with results (through known
APIs) to specific other
known sources and
mediator schema
sources
Reformulate the user query their known schemas,
• Examples include Google
do hand-mapping
into data source queries
Map Mashups
and
support
automated
Optimize and execute the
reformulation and
queries
optimization
Return the answers
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
67
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
68
Services
•
•
•
•
Source Trust
User queries refer to the
mediated schema.
Data is stored in the
sources in a local schema.
Content descriptions
provide the semantic
mappings between the
different schemas.
Mediator uses the
descriptions to translate
user queries into queries
on the sources.
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
Source Fusion/
Query Planning
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
DWIM
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Needs to handle
Source/network
Interruptions,
Runtime uncertainity,
replanning
Information Integration on the Web (MA-1)
Monitor
69
Dimensions to Consider
•
•
•
•
How many sources are we accessing?
How autonomous are they?
Can we get meta-data about sources?
Is the data structured?
– Discussion about soft-joins. See slide next
• Supporting just queries or also updates?
• Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
– See slide next
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
70
Soft Joins..WHIRL [Cohen]
We can extend the notion of Joins to
“Similarity Joins” where similarity is
measured in terms of vector similarity
over the text attributes. So, the join tuples
are output n a ranked form—with the rank
proportional to the similarity
Neat idea… but does have some
implementation difficulties
Most tuples in the cross-product will
have non-zero similarities. So, need
query processing that will somehow
just produce highly ranked tuples
Also other similarity/distance metrics
may be used
E.g. Edit distance
71
72
Source Descriptions
Kambhampati & Knoblock
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Probing
Queries
Webpages
Structured
data
Sensors
(streaming
Data)
od
e
Pr
efe
re
nc
e/U
til
ity
M
e
Qu
Information Integration on the Web (MA-1)
Ca
rce
l
ry
So
u
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
lls
Source Fusion/
Query Planning
Executor
Answers
Needs to handle
Source/network
Interruptions,
Runtime uncertainty,
replanning
tics
atis
g St
atin
Upd
ing
nn
pla ts
Re ques
Re
• Contains all meta-information about the
sources:
– Logical source contents (books, new
cars).
– Source capabilities (can answer SQL
queries)
– Source completeness (has all books).
– Physical properties of source and
network.
– Statistics about the data (like in an
RDBMS)
– Source reliability
– Mirror sources
– Update frequency.
Monitor
75
Source Access
• How do we get the “tuples”?
– Many sources give “unstructured” output
• Some inherently unstructured; while others
“englishify” their database-style output
– Need to (un)Wrap the output from the sources
to get tuples
– “Wrapper building”/Information Extraction
– Can be done manually/semi-manually
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
76
Source Fusion/Query Planning
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
fer
en
ce
/U
til
ity
M
od
all
ce
C
el
y
er
Qu
So
ur
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
s
Source Fusion/
Query Planning
Pr
e
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Information Integration on the Web (MA-1)
Needs to handle
Source/network
Interruptions,
Runtime uncertainty,
replanning
tics
atis
g St
atin
Upd
ing
nn
pla ts
Re ques
Re
• Accepts user query and generates a plan
for accessing sources to answer the query
– Needs to handle tradeoffs between cost
and coverage
– Needs to handle source access
limitations
– Needs to reason about the source
quality/reputation
Monitor
77
Monitoring/Execution
Services
Source Trust
Ontologies;
Source/Service
Descriptions
Webpages
Structured
data
Sensors
(streaming
Data)
od
e
Ca
rce
l
ry
So
u
Needs to handle:
Multiple objectives,
Service composition,
Source quality & overlap
lls
Source Fusion/
Query Planning
fer
en
ce
/U
til
ity
M
e
Qu
Pr
e
Executor
Answers
Kambhampati & Knoblock
Probing
Queries
Information Integration on the Web (MA-1)
Needs to handle
Source/network
Interruptions,
Runtime uncertainty,
replanning
tics
atis
g St
atin
Upd
ing
nn
pla ts
Re ques
Re
• Takes the query plan and executes it on the
sources
– Needs to handle source latency
– Needs to handle transient/short-term
network outages
– Needs to handle source access
limitations
– May need to re-schedule or re-plan
Monitor
78