Transcript slides

Building Data Integration
Systems for the Web
Alon Halevy
Google
NSF Information Integration Workshop
April 22, 2010
Without (too much) Loss of Generality
Web

Enterprise, Science projects, …
Information integration ≅ data management
A Few Principles
• Data management “in situ”
– Data meaning is derived from its context
– Manipulate data in its natural location
• Pay-as-you-go data management
– Provide services before modeling is done
– Data can be about any domain
• Collaboration should be built in
– Query answering is only step the first step
Alex Labrinidis
@via Facebook
Structured Data & The Web
Hard to find structured data via search engines
Discover
Requires
infrastructure,
concerns
about losing
control
Data is
embedded in
web page,
behind forms
Publish
Extract
Manage,
Analyze,
Combine
Hard to query, visualize, combine data across organizations
Outline
• Surfacing the Deep Web
• Searching tables on the surface Web
• Fusion Tables: a platform for data
management on the Web.
What is the Deep Web?
• Deep = not accessible through general purpose search
engines
– Major gap in the coverage of search engines.
used cars
store locations
radio stations
patents
recipes
Tree Search
Amish quilts
Parking tickets in India
Horses
Solution Constraints
• Can’t design a solution that requires
domain engineering
– (unless you can make money in that
domain!)
• Boundaries between domains are fuzzy
• Solution needs to be integrated into
general web search
– Can’t assume special query syntax
Surfacing the Deep Web
[Madhavan et al. VLDB 2008]
• Surfacing:
– Find high-quality forms
– Guess good queries to submit
– Put the resulting HTML pages in the index
• ~3M sites, 50 languages, 700 domains.
• 1000 queries per-second get results from the
deep web.
• 400K forms served per day, 800K per week
• Impact mostly on the long and heavy tail of
queries
Deep Web: The Future
• Still an opportunity to go deeper into the
deep web:
– E.g., map the user query into a form
submission.
• Key challenge: given a keyword query,
map it to forms in any domain
• Understanding the meaning of forms is
still hard (e.g. content, geo constraints).
Outline
Surfacing the Deep Web
Searching tables on the surface Web
• Fusion Tables: a platform for data
management on the Web.
Bad table
Vertical Tables
Sub-Header Rows
Winners of the Boston Marathon
(but that’s nowhere in the table)
Schema Ok, but context is subtle
(year = 2006)
WebTables: Exploring the Relational Web
[Cafarella et al., VLDB 2008, WebDB 08]
• In corpus of 14B raw tables, we estimate
154M are “good” relations
– Single-table databases; Schema = attr labels + types
– Largest corpus of databases & schemas we know of
• The Webtables system:
– Recovers good relations from crawl and enables search
– Builds novel apps on the recovered data
(Web-scale) Schema Collection
With 2.6 million schemas you can do some
very interesting things.
Synonym discovery
name
e-mail|email, phone|telephone,
e-mail_address|email_address, date|last_modified
instructor
course-title|title, day|days, course|course-#,
course-name|course-title
elected
candidate|name, presiding-officer|speaker
ab
k|so, h|hits, avg|ba, name|player
sqft
bath|baths, list|list-price, bed|beds, price|rent
“KR”-Based Table Search
[Wu, Madhavan, Miao, Pasca, Shen]
• Ideally, we describe every table:
– Class of entities it contains
– Properties being modeled
– Context, quality, …
• Use Web-extracted knowledge bases
– Extract isa-hierarchy using patterns:
– “cities such as Paris and London”
– “chemical elements including hydrogen
and oxygen”
Step 1: Find “Subject” of Table
Not always the left (or first non-number column)
Step 2: associate classes with subject
Chemical elements
Most of the time, the class labels are not in attribute name
Leveraging Web-extracted
Ontologies
• Given a query, e.g., (country, GDP)
– Rank tables about countries that have
GDP somewhere in the schema.
– Very high precision (~90%)
• Next challenge: understand binary
properties and binary relationships.
• Domain specialization:
– System should improve if given ontologies
in a particular domain.
Combine Search, Extraction,
Cleaning and Integration
[Cafarella, Koussainova, H., VLDB 2009],
• Try to create a database of all
“VLDB program committee members”
25
Outline
Surfacing the Deep Web
Searching tables on the surface Web
Fusion Tables: a platform for data
management on the Web.
Data Management for the Web Era
• Integrate seamlessly with the Web:
– Search, maps, …
• Easy to use:
– Much broader user base, pay-as-you-go
– Very simple data integration
• Provide incentives for sharing data
• Facilitate collaboration
Fusion Tables – our current attempt
[Madhavan, Gonzalez, Langen, Shapley, Shen]
Incentive
We store and leverage a large collection of tables.
Incentive, Pay-..-Go
Coffee Production
Coffee Consumption
Seamless integration with other web tools
Toilet heat map…
Database functionality on map
Collaboration
Table Search
Show up in
search results!
Data Integration
Merged Table
Carries attribution from both base tables. Owners maintain
control of their own data.
Fine Grained Discussions
Example Uses of Fusion Tables
•
•
•
•
•
•
•
•
Tracking potholes in Spain
Displaying bike routes (MTBGuru)
State of California statistics
Government data from data.gov
Data about voting locations in the USA
Brazilian beaches
Chicago homicides
Most requested pop songs by year
Conclusions
• Information integration “in situ”
– Blur the boundary between structured and
unstructured data
• Combine search, extraction, cleaning
and integration into a single experience
• Pay-as-you-go: introduce complexity as
needed
– Serve enterprises without IT depth
• OpenII – an open-source platform for
information integration.
References
• Fusion Tables:
– tables.googlelabs.com
– SIGMOD, SOCC, 2010
• Deep-web crawling:
– [Madhavan et al., VLDB 08]
• WebTables:
– [Cafarella et al., VLDB 08]
• Octopus:
– [Cafarella et al., VLDB 09],
– [Elmeleegy et al, VLDB 09]