Transcript 486

Tables to Linked Data
Zareen Syed, Tim Finin, Varish
Mulwad and Anupam Joshi
University of Maryland, Baltimore County
http://ebiquity.umbc.edu/paper/html/id/474/
0
Age of Big Data
• Availability of massive amounts of data is driving
many technical advances on the Web and off
• Extracting linked data from text and tables will help
• Databases & spreadsheets are obvious table sources,
but many are in documents and Web pages, too
• A recent Google study found over 14B HTML tables
M. Cafarella, A. Halevy, D. Wang, E. Wu, Y. Zhang, Webtables:
exploring the power of tables on the Web, VLDB, 2008.
• Only one in a 1000 had high-quality relational data,
but these could be reliably identified by a ML trained
classifier, resulting in ~150M tables
1
Problem: given a table of data
2
Goal: Generate linked data
@prefix dbp: <http://dbpedia.org/resource/> .
@prefix dbpo: <http://dbpedia.org/ontology/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#>
\
dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino;
cyc:partOf dbp:Massachusetts;
dbpo:populationTotal "610000"^^xsd:integer .
dbp:New_York_City …
...
• Use classes, properties and instances from a linked
data collection, e.g. DBpedia + Cyc + Geonames + ...
• Confirm existing facts and discover new ones
• Create new entities as needed
• Create new relations when possible (harder)
3
What data do we want
find relationships between columns
dbpo:largestCity
dbpo:Massachusettes
link cell values to entities
dbpo:Boston
link cell values to entities
4
What evidence can we find?
• Column one’s type is populated place, or is it US
city, or a reference to a NBA team?
5
What do we want to extract?
• Column one’s type is populated place, or is it US city,
or a reference to a NBA team?
• Column two’s type is person (or politician?) but is
‘mayor’ a type or a relation and if the later, to what?
5
What do we want to extract?
• Column one’s type is populated place, or is it US city,
or a reference to a NBA team?
• Column two’s type is person (or politician?) but is
‘mayor’ a type or a relation and if the later, to what?
• Rows give important evidence too: Menino has a
stronger connection to Boston than Massachusetts
5
What do we want to extract?
• Column one’s type is populated place, or is it US city,
or a reference to a NBA team?
• Column two’s type is person (or politician?) but is
‘mayor’ a type or a relation and if the later, to what?
• Rows give important evidence too: Menino has a
stronger connection to Boston than Massachusetts
• Both cities and states have populations, …
5
A Web of Evidence
• Table: Column headers, cell values, column
position, column adjacency
• Language: headers have meaning, synonyms, …
• Ontologies: capitalOf is a 1:1 relation between
a GPE region and a city
• Significance: pageRank-like metrics bias linking
• Facts: the LD KB asserts Boston is in MA and
that Boston’s population is close to 610K
• Graph analysis: PMI between Boston & Menino
is much higher than for Massachusetts
6
Approach
Input: Table
Headers and
Rows
Query Knowledge
base
Identify
Relationships
between
columns
Link cell value to an entity
using the new results
obtained
Predict Class for
Columns
Re query Knowledge base
using the new evidence
Output: Linked Data
7
Wikitology
• A hybrid KB of structured &
unstructured information
extracted from Wikipedia
• Augmented with knowledge
from DBpedia, Freebase, Yago
and Wordnet
• The interface via a specialized
IR index
• Good for systems that need to
do a combination of reasoning
over text, graphs and RDF data
8
Querying the Knowledge–Base
Wikitology
For every cell from the
table –
Cell Value + Column
Header + Row Content
Baltimore + City + MD
+ S.Dixon + 640,000
1.Baltimore_Maryland
2.Baltimore_County
3.John_Baltimore
Top N entities, Their
Types, Page Rank
(We use N = 5)
9
Predicting Classes for Columns
• Set of Classes per column
• Score the classes
• Choose the top class
from each of the four
vocabularies – Dbpedia,
Freebase, Wordnet and
Yago
Score = w x (dbpedia-owl:Place,
1 / R ) + (1 – w) Page Rank
dbpedia-owl:Area,
R: Entity’s Rank;
yago:AmericanConductors,
yago:LivingPeople,
E.g. [Baltimore,dbpedia:Area]
= 0.89
dbpedia-owl:PopulatedPlace,
Column:City
dbpedia-owl:Band,
Select the class
that maximizes its sum of
dbpedia-owl:Organisation,
score over
the
entire column
Dbpedia:PopulatedPlace
...
Wordnet:City
...
[Baltimore, dbpedia:Area]
+ [Boston,
Freebase:Location
dbpedia:Area]
+ [New York, dbpedia:Area] =
Yago:CitiesinUnitedStates
2.85
10
Linking table cell to entities
• Once the classes are predicted, we re-query the
knowledge–base with this new evidence
• Along with the original query, we also include the
predicted types
• We pick the highest ranking entity which matches
the predicted type from the new results
For every cell from the
table –
Cell Value + Column
Header + Row Content +
Predicted Column Type
Wikitology
Top N entities, Their
Types (We use N = 5)
11
Preliminary results: entity linking
• In a preliminary evaluation, we
used 5 Google Squared tables
comprising 23 columns and 39
rows, comparing our results
with human judgments
• The next will be on selected
tables from the Google collection of >2500 involving 6
domains: bibliography, car,
course, country, movie, people
Classes used
Accuracy
Class Prediction for
Columns: Dbpedia
85.7%
Class Prediction for
Columns : Freebase
90.5%
Class Prediction for
Columns : Wordnet
71.4%
Class Prediction of
Columns :Yago
71.4%
Entity Linking
76.6%
12
Ongoing and Future work
• Identifying relationships between columns
• Modules for common special cases, e.g.
numbers, acronyms, phone numbers, stock
symbols, email addresses, URLs, etc.
• Replace heuristics by machine learning
techniques for combining evidence and
clustering
• Strategy for dealing with errors
13
Conclusion
• There’s lots of data stored in tables: in spreadsheets, databases, Web pages and documents
• In some cases we can interpret them and
generate a linked data representation
• In others we can at least link some cell values
to LOD entities
• This can help contribute data to the Web in a
form that is easy for machines to understand
and use
14