Slides - VLDB 2009

Download Report

Transcript Slides - VLDB 2009

Data Integration
for the Relational Web
Michael J. Cafarella, Alon Halevy,
Nodira Khoussainova
Work done while at Google, Inc.
Presenter:
Michael J. Cafarella, University of Michigan
VLDB
August 27, 2009
Web Challenge

Try to create a database of all
“VLDB program committee members”
2
Data Integration for Web
Can we combine tables to create new
data sources?
Existing mashup, data integration
tools ignore realities of Web data






3
A lot of useful data is not in XML
User cannot know all sources in advance
Transient integrations
Data semantics semi-tied to src page
Octopus
Crawl Web
Extract Tables
Inst.
Year
Serge
Inria
1996
Michel
… Gren
1996
Anton..
… Pisa
1996
Serge
Inria
2005
Obtain Database


Our
system
uses data from:work, e.g.,
Lots
of table/list-extraction
Octopus

Our test system has over 200M src tables






4
Integrate Tables
Name
[VLDB09,
“Answering Table Augmentation…”, Gupta & Sarawagi]
WebTables
[WebDB08,
Cafarella et al]data…”, Michelson & Knoblock]
[JAIR08,“Uncovering…”,
“Creating relational
[VLDB08, “WebTables: Exploring…”, Cafarella et al]
[WWW07, “Towards domain-independent…”, Gatterbauer et al]
Harvesting Relational Tables from Lists
[WWW02,
“A machine
Wang
& Hu]
[VLDB09, “Harvesting
Relational learning
Tables from based…”,
Lists…”, Elmeleegy
et al]
Outline



Introduction
Data Sources
Octopus Operators





5
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
Outline



Introduction
Data Sources
Octopus Operators





6
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
7
List Extraction
8
List Extraction
9
What’s Opera Doc
Warner
1957
Duck Amuck
Warner
1953
The Band Concert
Disney
1935
Duck Dodgers…
Warner
1953
One Froggy Evening
Warner
1956
…
…
…
Outline



Introduction
Data Sources
Octopus Operators





10
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
Octopus

Provides “workbench” of data integration
operators to build target database



Three original operators




11
Most operators are not correct/incorrect, but
high/low quality
Some prosaic operators: project, select, …
SEARCH
CONTEXT
EXTEND
Under covers, each operator recovers
different aspect of implicit GLAV src desc.
Operator #1 - SEARCH

SEARCH(“VLDB program committee
members”)
serge abiteboul inria
michael adiba
…grenoble
antonio albano
…pisa
…
…
serge abiteboul inria
anastassia ail… carnegie…
12
gustavo alonso
etz zurich
…
…
Operator #2 - CONTEXT

Recover relevant data
serge abiteboul inria
CONTEXT()
michael adiba
…grenoble
antonio albano
…pisa
…
…
serge abiteboul inria
CONTEXT()
13
anastassia ail… carnegie…
gustavo alonso
etz zurich
…
…
Operator #2 - CONTEXT

Recover relevant data
CONTEXT()
CONTEXT()
14
serge abiteboul
inria
1996
michael adiba
…grenoble 1996
antonio albano
…pisa
1996
…
…
…
serge abiteboul inria
2005
anastassia ail… carnegie…
2005
gustavo alonso
etz zurich 2005
…
…
…
Prosaic Operator - Union

Combine datasets
serge abiteboul
Union()
15
inria
1996
michael adiba
…grenoble 1996
serge abiteboul inria
1996
antonio albano
…pisa
1996
michael adiba
…grenoble
1996
…
…
…
antonio albano …pisa
1996
serge abiteboul inria
2005
anastassia ail…
serge abiteboul
gustavo alonso
anastassia ail…
…
gustavo alonso
carnegie…
2005
inria
2005
etz zurich
2005
carnegie… 2005
…
…
etz zurich 2005
…
…
…
Operator #3 - EXTEND


Add column to data
Similar to “join” but join target is a topic
“publications”
EXTEND( “publications”,
col=0)
16
serge
inria
1996
1996 abiteboul
“Large Scale
P2P Dist…”
serge abiteboul
inria
michael adiba
adiba
…grenoble
1996
…grenoble michael
1996 “Exploiting
bitemporal…”
antonio albano
…pisa
antonio
albano Example
…pisa
1996
1996 “Another
of a…”
serge abiteboul
inria
serge
inria
2005
2005 abiteboul
“Large Scale
P2P Dist…”
anastassia ail…
ail… carnegie…
2005
carnegie… anastassia
2005 “Efficient
Use of the…”
gustavo alonso
alonso
2005
etz zurichgustavo
2005 “A
Dynamicetz
andzurich
Flexible…”
…
…
……
…
…
Straightforward Sequence
1.
2.
SEARCH(“VLDB program committee members”)
CONTEXT
serge abiteboul inria
michael adiba
…grenoble
antonio albano
…pisa
…
…
2.
CONTEXT
serge abiteboul inria
anastassia ail… carnegie…
gustavo alonso
etz zurich
…
…
17
Straightforward Sequence
2.
union
CONTEXT
serge abiteboul
inria
michael adiba
…grenoble 1996
antonio albano
…pisa
1996
…
…
…
2.
1996
CONTEXT
serge abiteboul inria
2005
anastassia ail… carnegie…
2005
gustavo alonso
etz zurich 2005
…
…
18
…
Straightforward Sequence
3.
union
EXTEND
serge abiteboul inria
1996 “Large Scale P2P Dist…”
michael adiba
…grenoble
1996 “Exploiting bitemporal…”
antonio albano
…pisa
1996 “Another Example of a…”
serge abiteboul inria
2005 “Large Scale P2P Dist…”
anastassia ail… carnegie…
2005 “Efficient Use of the…”
gustavo alonso
etz zurich 2005 “A Dynamic and Flexible…”
…
…
19
…
• User integrated data sources with 4 operations
• No wrappers; data was never intended for reuse
• User never visited source web pages
Outline



Introduction
Data Sources
Octopus Operators





20
SEARCH
CONTEXT
EXTEND
Algorithms & Experiments
Conclusions
Experiments

21
~50 queries, suggested and evaluated
by Amazon Mechanical Turk
SEARCH Algorithms - Ranking


SimpleRank - search engine ranking
SCPRank - symmetric conditional
probability between query, table data


Similar to Pointwise Mutual Information
[Lopes, DaSilva, 1999], multiword units
# docs  containing  a,b
p(a,b) 
# total docs
2

22
p(query,cell)
scp(query,cell) 
p(query) p(cell)
SEARCH Algorithms - Ranking
23
Top-2
Top-5
Top-10
SimpleRank
27%
51%
73%
SCPRank
47%
64%
81%
SEARCH Algorithms - Ranking
Top-2
Top-5
Top-10
SimpleRank
27%
51%
73%
SCPRank
47%
64%
81%

24
See paper for clustering results
CONTEXT Algorithms

Input: table and source page
Output: data values to add to table

SignificantTerms sorts terms in source

page by “importance” (tf-idf)
25
Related View Partners

26
Looks for different “views” of same data
CONTEXT Experiments
27
EXTEND Algorithms

Input: src table, src column, dst topic


EXTEND(t, col=0, “publications”)
JoinTest:

Tests a single table for join-compatibility





Rank all tables by relevance to query topic
Select tables that are joinable to query column
MultiJoin

Finds a join-target tuple for each src tuple




28
“City mayors”: yes
“VLDB publications”: no
“City mayors”: maybe
“VLDB publications”: yes
For each cell in src column, perform topic search
Cluster resulting tables, rank by column coverage
EXTEND Early Experiments

JoinTest
Join
Column
 3
of 7 source
Topic Query
tables
countries
universities
 60% of source tuples
Usstates
governors
Single extension for
each extended tuple
MultiJoin
Us cities
mayors
Film
titles
 All 7
source tables characters
UKpolitical
MP
33% parties
of source tuples
Baseball
teams
players
 Avg 45.5 extensions for each extended
Musical
bands
albums
tuple


29
113 NYC mayors
12 albums by Led Zeppelin
Related Work



Octopus relies on info extraction work
Substantial work in data integration
Mashup Tools





Yahoo! Pipes
Marmite - [Wong and Hong, 2007]
Karma - [Tuchinda, et al., 2007]
CIMPLE - [DeRose, et al., 2007]
Potter’s Wheel -
[Raman and Hellerstein, 2001]
30
Octopus Contributions



Basic operators that enable Web data
integration with very small user burden
Realistic and useful implementations for
all three operators
Future work:

31
Efficient large-scale implementation