Slides - VLDB 2009
Download
Report
Transcript Slides - VLDB 2009
Data Integration
for the Relational Web
Michael J. Cafarella, Alon Halevy,
Nodira Khoussainova
Work done while at Google, Inc.
Presenter:
Michael J. Cafarella, University of Michigan
VLDB
August 27, 2009
Web Challenge
Try to create a database of all
“VLDB program committee members”
2
Data Integration for Web
Can we combine tables to create new
data sources?
Existing mashup, data integration
tools ignore realities of Web data
3
A lot of useful data is not in XML
User cannot know all sources in advance
Transient integrations
Data semantics semi-tied to src page
Octopus
Crawl Web
Extract Tables
Inst.
Year
Serge
Inria
1996
Michel
… Gren
1996
Anton..
… Pisa
1996
Serge
Inria
2005
Obtain Database
Our
system
uses data from:work, e.g.,
Lots
of table/list-extraction
Octopus
Our test system has over 200M src tables
4
Integrate Tables
Name
[VLDB09,
“Answering Table Augmentation…”, Gupta & Sarawagi]
WebTables
[WebDB08,
Cafarella et al]data…”, Michelson & Knoblock]
[JAIR08,“Uncovering…”,
“Creating relational
[VLDB08, “WebTables: Exploring…”, Cafarella et al]
[WWW07, “Towards domain-independent…”, Gatterbauer et al]
Harvesting Relational Tables from Lists
[WWW02,
“A machine
Wang
& Hu]
[VLDB09, “Harvesting
Relational learning
Tables from based…”,
Lists…”, Elmeleegy
et al]
Outline
Introduction
Data Sources
Octopus Operators
5
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
Outline
Introduction
Data Sources
Octopus Operators
6
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
7
List Extraction
8
List Extraction
9
What’s Opera Doc
Warner
1957
Duck Amuck
Warner
1953
The Band Concert
Disney
1935
Duck Dodgers…
Warner
1953
One Froggy Evening
Warner
1956
…
…
…
Outline
Introduction
Data Sources
Octopus Operators
10
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
Octopus
Provides “workbench” of data integration
operators to build target database
Three original operators
11
Most operators are not correct/incorrect, but
high/low quality
Some prosaic operators: project, select, …
SEARCH
CONTEXT
EXTEND
Under covers, each operator recovers
different aspect of implicit GLAV src desc.
Operator #1 - SEARCH
SEARCH(“VLDB program committee
members”)
serge abiteboul inria
michael adiba
…grenoble
antonio albano
…pisa
…
…
serge abiteboul inria
anastassia ail… carnegie…
12
gustavo alonso
etz zurich
…
…
Operator #2 - CONTEXT
Recover relevant data
serge abiteboul inria
CONTEXT()
michael adiba
…grenoble
antonio albano
…pisa
…
…
serge abiteboul inria
CONTEXT()
13
anastassia ail… carnegie…
gustavo alonso
etz zurich
…
…
Operator #2 - CONTEXT
Recover relevant data
CONTEXT()
CONTEXT()
14
serge abiteboul
inria
1996
michael adiba
…grenoble 1996
antonio albano
…pisa
1996
…
…
…
serge abiteboul inria
2005
anastassia ail… carnegie…
2005
gustavo alonso
etz zurich 2005
…
…
…
Prosaic Operator - Union
Combine datasets
serge abiteboul
Union()
15
inria
1996
michael adiba
…grenoble 1996
serge abiteboul inria
1996
antonio albano
…pisa
1996
michael adiba
…grenoble
1996
…
…
…
antonio albano …pisa
1996
serge abiteboul inria
2005
anastassia ail…
serge abiteboul
gustavo alonso
anastassia ail…
…
gustavo alonso
carnegie…
2005
inria
2005
etz zurich
2005
carnegie… 2005
…
…
etz zurich 2005
…
…
…
Operator #3 - EXTEND
Add column to data
Similar to “join” but join target is a topic
“publications”
EXTEND( “publications”,
col=0)
16
serge
inria
1996
1996 abiteboul
“Large Scale
P2P Dist…”
serge abiteboul
inria
michael adiba
adiba
…grenoble
1996
…grenoble michael
1996 “Exploiting
bitemporal…”
antonio albano
…pisa
antonio
albano Example
…pisa
1996
1996 “Another
of a…”
serge abiteboul
inria
serge
inria
2005
2005 abiteboul
“Large Scale
P2P Dist…”
anastassia ail…
ail… carnegie…
2005
carnegie… anastassia
2005 “Efficient
Use of the…”
gustavo alonso
alonso
2005
etz zurichgustavo
2005 “A
Dynamicetz
andzurich
Flexible…”
…
…
……
…
…
Straightforward Sequence
1.
2.
SEARCH(“VLDB program committee members”)
CONTEXT
serge abiteboul inria
michael adiba
…grenoble
antonio albano
…pisa
…
…
2.
CONTEXT
serge abiteboul inria
anastassia ail… carnegie…
gustavo alonso
etz zurich
…
…
17
Straightforward Sequence
2.
union
CONTEXT
serge abiteboul
inria
michael adiba
…grenoble 1996
antonio albano
…pisa
1996
…
…
…
2.
1996
CONTEXT
serge abiteboul inria
2005
anastassia ail… carnegie…
2005
gustavo alonso
etz zurich 2005
…
…
18
…
Straightforward Sequence
3.
union
EXTEND
serge abiteboul inria
1996 “Large Scale P2P Dist…”
michael adiba
…grenoble
1996 “Exploiting bitemporal…”
antonio albano
…pisa
1996 “Another Example of a…”
serge abiteboul inria
2005 “Large Scale P2P Dist…”
anastassia ail… carnegie…
2005 “Efficient Use of the…”
gustavo alonso
etz zurich 2005 “A Dynamic and Flexible…”
…
…
19
…
• User integrated data sources with 4 operations
• No wrappers; data was never intended for reuse
• User never visited source web pages
Outline
Introduction
Data Sources
Octopus Operators
20
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
Experiments
21
~50 queries, suggested and evaluated
by Amazon Mechanical Turk
SEARCH Algorithms - Ranking
SimpleRank - search engine ranking
SCPRank - symmetric conditional
probability between query, table data
Similar to Pointwise Mutual Information
[Lopes, DaSilva, 1999], multiword units
# docs containing a,b
p(a,b)
# total docs
2
22
p(query,cell)
scp(query,cell)
p(query) p(cell)
SEARCH Algorithms - Ranking
23
Top-2
Top-5
Top-10
SimpleRank
27%
51%
73%
SCPRank
47%
64%
81%
SEARCH Algorithms - Ranking
Top-2
Top-5
Top-10
SimpleRank
27%
51%
73%
SCPRank
47%
64%
81%
24
See paper for clustering results
CONTEXT Algorithms
Input: table and source page
Output: data values to add to table
SignificantTerms sorts terms in source
page by “importance” (tf-idf)
25
Related View Partners
26
Looks for different “views” of same data
CONTEXT Experiments
27
EXTEND Algorithms
Input: src table, src column, dst topic
EXTEND(t, col=0, “publications”)
JoinTest:
Tests a single table for join-compatibility
Rank all tables by relevance to query topic
Select tables that are joinable to query column
MultiJoin
Finds a join-target tuple for each src tuple
28
“City mayors”: yes
“VLDB publications”: no
“City mayors”: maybe
“VLDB publications”: yes
For each cell in src column, perform topic search
Cluster resulting tables, rank by column coverage
EXTEND Early Experiments
JoinTest
Join
Column
3
of 7 source
Topic Query
tables
countries
universities
60% of source tuples
Usstates
governors
Single extension for
each extended tuple
MultiJoin
Us cities
mayors
Film
titles
All 7
source tables characters
UKpolitical
MP
33% parties
of source tuples
Baseball
teams
players
Avg 45.5 extensions for each extended
Musical
bands
albums
tuple
29
113 NYC mayors
12 albums by Led Zeppelin
Related Work
Octopus relies on info extraction work
Substantial work in data integration
Mashup Tools
Yahoo! Pipes
Marmite - [Wong and Hong, 2007]
Karma - [Tuchinda, et al., 2007]
CIMPLE - [DeRose, et al., 2007]
Potter’s Wheel -
[Raman and Hellerstein, 2001]
30
Octopus Contributions
Basic operators that enable Web data
integration with very small user burden
Realistic and useful implementations for
all three operators
Future work:
31
Efficient large-scale implementation