context() - VideoLectures.NET

Download Report

Transcript context() - VideoLectures.NET

The Web’s Many Models
?
Michael J. Cafarella
University of Michigan
AKBC
May 19, 2010
Web Information Extraction

Much recent research in information
extractors that operate over Web pages






Web crawl + domain-independent IE should
allow comprehensive Web KBs with:



2
Snowball (Agichtein and Gravano, 2001)
TextRunner (Banko et al, 2007)
Yago (Suchanek et al, 2007)
WebTables (Cafarella et al, 2008)
DBPedia, ExDB, Freebase (make use of IE data)
Very high, “web-style” recall
“More-expressive-than-search” query processing
But where is it?
Web Information Extraction

Omnivore




3
“Extracting and Querying a Comprehensive Web
Database.” Michael Cafarella. CIDR 2009.
Asilomar, CA.
Suggested remedies for data ingestion, user
interaction
This talk says why ideas in that paper might
already be out of date, gives alternative ideas
If there are mistakes here, then you
have a chance to save me years of work!
Outline


Introduction
Data Ingestion



User Interaction



4
Previously: Parallel Extraction
Alternative: The Data-Centric Web
Previously: Model Generation for Output
Alternative: Data Integration as UI
Conclusion
Parallel Extraction

Previous hypothesis


5
Many data models for interesting data,
e.g., relational tables and E/R graphs, etc.
Should build large integration
infrastructure to consume many extraction
streams
Database Construction (1)

6
Start with a single large Web crawl
Database Construction (2)

Each of k extractors emits output that:


7
Has an extractor-dependent model
Has an extractor-and-Web-page-dependent
schema
Database Construction (3)

8
For each extractor output, unfold into
common entity-relation model
Database Construction (4)

9
Unify results
Database Construction (5)

10
Emit final database
Potential Problems

Pressing problems:




Tables, entities probably OK for now


Many data sources (DBPedia, Facebook,
IMDB) already match one of these two
pretty well
One possible different direction: the
Data-Centric Web

11
Recall
Simple intra-source reconciliation
Time
Addresses recall only
The Data-Centric Web
12
The Data-Centric Web
13
The Data-Centric Web
14
The Data-Centric Web
15
The Data-Centric Web
16
The Data-Centric Web
17
The Data-Centric Web
18
The Data-Centric Web
19
The Data-Centric Web
20
The Data-Centric Web
21
The Data-Centric Web
22
The Data-Centric Web
23
Data-Centric Lists

Lists of Data-Centric Entities give hints:

About what the target entity contains

That all members of set are DCEs, or not

24
That members of set belong to a class or
type (e.g., program committee members)
Build the Data-Centric Web
1.
2.
3.
4.
5.


25
Download the Web
Train classifiers to detect DCEs, DCLs
Filter out all pages that fail both tests
Use lists to fix up incorrect Data-Centric
Entity classifications
Run attr/val extractors on DCEs
Yields E/R dataset, for insertion into
DBPedia, YAGO, etc
In progress now… with student Ashwin
Balakrishnan, entity detector >95% acc.
Research Question 1

How many useful entities…




Put differently:

26
Lack a page in the Data-Centric Web?
(That means no homepage, no Amazon
page, no public Facebook page, etc.)
AND are otherwise well-described enough
online that IE can recover an entity-centric
view?
Does every entity worth extracting already
have a homepage on the Web?
Research Question 2

Does a single real-world entity have
more than one “authoritative” URL?

27
Note that Wikipedia provides pretty
minimal assistance in choosing the right
entity, but does a good job
Outline


Introduction
Data Ingestion



User Interaction



28
Previously: Parallel Extraction
Alternative: The Data-Centric Web
Previously: Model Generation for Output
Alternative: Data Integration as UI
Conclusion
Model Generation for Output

Previous hypothesis


29
Many different user applications built
against single back-end database
Difficult task is translating from back-end
data model to the application’s data model
Query Processing (1)

30
Query arrives at system
Query Processing (2)

31
Entity-relation database processor
yields entity results
Query Processing (3)

32
Query Renderer chooses appropriate
output schema
Query Processing (4)

33
User corrections are logged and fed into
later iterations of db construction
Potential Problems

Many plausible front-end applications,
none yet totally compelling and novel





34
Ad- and search-driven ones not novel
Freebase, Wolfram Alpha not compelling
Raw input to learners: useful, not an enduser application
Need to explore possible applications
rather than build multi-app
infrastructure
One possible different direction: data
integration as user primitive
Data Integration as UI


Can we combine tables to create new
data sources?
Many existing “mashup” tools, which
ignore realities of Web data




35
A lot of useful data is not in XML
User cannot know all sources in advance
Transient integrations
Dirty data
Interaction Challenge

Try to create a database of all
“VLDB program committee members”
36
Octopus

Provides “workbench” of data integration
operators to build target database




37
Most operators are not correct/incorrect, but
high/low quality (like search)
Also, prosaic traditional operators
Originally ran on WebTable data
[VLDB 2009, Cafarella, Khoussainova, Halevy]
Walkthrough - Operator #1

SEARCH(“VLDB program committee
members”)
serge abiteboul inria
michael adiba
…grenoble
antonio albano
…pisa
…
…
serge abiteboul inria
anastassia ail… carnegie…
38
gustavo alonso
etz zurich
…
…
Walkthrough - Operator #2

Recover relevant data
serge abiteboul inria
CONTEXT()
michael adiba
…grenoble
antonio albano
…pisa
…
…
serge abiteboul inria
CONTEXT()
39
anastassia ail… carnegie…
gustavo alonso
etz zurich
…
…
Walkthrough - Operator #2

Recover relevant data
CONTEXT()
CONTEXT()
40
serge abiteboul
inria
1996
michael adiba
…grenoble 1996
antonio albano
…pisa
1996
…
…
…
serge abiteboul inria
2005
anastassia ail… carnegie…
2005
gustavo alonso
etz zurich 2005
…
…
…
Walkthrough - Union

Combine datasets
serge abiteboul
Union()
41
inria
1996
michael adiba
…grenoble 1996
serge abiteboul inria
1996
antonio albano
…pisa
1996
michael adiba
…grenoble
1996
…
…
…
antonio albano …pisa
1996
serge abiteboul inria
2005
anastassia ail…
serge abiteboul
gustavo alonso
anastassia ail…
…
gustavo alonso
carnegie…
2005
inria
2005
etz zurich
2005
carnegie… 2005
…
…
etz zurich 2005
…
…
…
Walkthrough - Operator #3


Add column to data
Similar to “join” but join target is a topic
“publications”
EXTEND( “publications”,
col=0)
serge
inria
1996
1996 abiteboul
“Large Scale
P2P Dist…”
serge abiteboul
inria
michael adiba
adiba
…grenoble
1996
…grenoble michael
1996 “Exploiting
bitemporal…”
antonio albano
…pisa
antonio
albano Example
…pisa
1996
1996 “Another
of a…”
serge abiteboul
inria
serge
inria
2005
2005 abiteboul
“Large Scale
P2P Dist…”
anastassia ail…
ail… carnegie…
2005
carnegie… anastassia
2005 “Efficient
Use of the…”
gustavo alonso
alonso
2005
etz zurichgustavo
2005 “A
Dynamicetz
andzurich
Flexible…”
…
…
……
…
• User has integrated data sources with little effort
• No wrappers; data was never intended for reuse
42
…
CONTEXT Algorithms

Input: table and source page
Output: data values to add to table

SignificantTerms sorts terms in source

page by “importance” (tf-idf)
43
Related View Partners

44
Looks for different “views” of same data
CONTEXT Experiments
45
Data Integration as UI

46
Compelling for db researchers, but will
large numbers of people use it?
Conclusion

Automatic Web KBs rapidly progressing


Recall still not good enough for many
tasks, but progress is rapid
Not clear what those tasks should be, and
progress is much slower



47
Difficult to predict what’s useful
Sometimes difficult to write a “new app” paper
Omnivore’s approach not wrong, but
did not directly address these problems