Linked Open Data - UMBC ebiquity research group

Download Report

Transcript Linked Open Data - UMBC ebiquity research group

The Semantic Web:
there and back again
Tim Finin
University of Maryland, Baltimore County
Joint work with Lushan Han, Varish Mulwad, Anupam Joshi
http://ebiq.org/r/353
LOD 123: Making the
semantic web easier to use
Tim Finin
University of Maryland, Baltimore County
Joint work with Lushan Han, Varish Mulwad, Anupam Joshi
Semantic Web: then and now
• Ten years ago the we developed complex
ontologies used to encode and reason over small
datasets of 1000s of facts
• Recently the focus has shifted to using simple
ontologies and minimal reasoning over very large
datasets of 100s of millions of facts
• Major companies are moving: Google Know-ledge
Graph, Facebook Open Graph, Microsoft Satori,
Apple Siri KB, IMB Watson KB
 Linked open data or “Things, not strings”
Linked Open Data (LOD)
• Linked data is just RDF data, lots of it,
with a small schema
• RDF data is a graph of triples (subject, predicate
– URI URI String: dbr:Barack_Obama dbo:spouse “Michelle Obama”
– URI URI URI: dbr:Barack_Obama dbo:spouse dbpedia:Michelle_Obama
• Best linked data practice prefers 2nd pattern, using
nodes rather than strings for “entities”
– Things, not strings!
• Linked open data is just linked data freely accessible on the Web along with their ontologies
Semantic Web
Use Semantic Web Technology
to publish shared data &
knowledge
Semantic web technologies
allow machines to share data
and knowledge using common
web language and protocols.
~ 1997
Semantic Web beginning
Semantic Web => Linked Open Data
Use Semantic Web Technology
to publish shared data &
knowledge
2007
Data is interlinked to support integration and fusion of knowledge
LOD beginning
Semantic Web => Linked Open Data
Use Semantic Web Technology
to publish shared data &
knowledge
2008
Data is interlinked to support integration and fusion of knowledge
LOD growing
Semantic Web => Linked Open Data
Use Semantic Web Technology
to publish shared data &
knowledge
2009
Data is interlinked to support integration and fusion of knowledge
… and growing
Linked Open Data
Use Semantic Web Technology
to publish shared data &
knowledge
Data is interlinked to support integration and fusion of knowledge
LOD is the new Cyc: a common
source of background
knowledge
2010
…growing faster
Linked Open Data
Use Semantic Web Technology
to publish shared data &
knowledge
LOD is the new Cyc: a common
source of background
knowledge
Data is interlinked to support integration and fusion of knowledge
2011: 31B facts in 295 datasets interlinked by 504M assertions on ckan.net
Exploiting LOD not (yet) Easy
• Publishing or using LOD data has
inherent difficulties for the potential user
– It’s difficult to explore LOD data and to query it for
answers
– It’s challenging to publish data using appropriate
LOD vocabularies & link it to existing data
• Problem: O(104) schema terms, O(1011)
instances
• I’ll describe two ongoing research projects that
are addressing these problems
GoRelations:
Intuitive Query System
for Linked Data
Research with Lushan Han
http://ebiq.org/j/93
Dbpedia is the Stereotypical LOD
• DBpedia is an important example of Linked Open Data
–Extracts structured data from Infoboxes in Wikipedia
–Stores in RDF using custom ontologies Yago terms
• The major integration point for the entire LOD cloud
• Explorable as HTML, but harder to query in SPARQL
DBpedia
Browsing
DBpedia’s
Mark Twain
Why it’s hard to query LOD
• Querying DBpedia requires a lot of a user
– Understand the RDF model
– Master SPARQL, a formal query language
– Understand ontology terms: 320 classes & 1600 properties !
– Know instance URIs (>2M entities !)
– Term heterogeneity (Place vs. PopulatedPlace)
• Querying large LOD
sets overwhelming
• Natural language
query systems still
a research goal
Goal
• Let users with a basic understanding of RDF
query DBpedia and other LOD collections
– Explore what data is in the system
– Get answers to question
– Create SPARQL queries for reuse or adaptation
• Desiderata
– Easy to learn and to use
– Good accuracy (e.g., precision and recall)
– Fast
Key Idea
Structured keyword queries reduce
problem complexity:
– User enters a simple graph, and
– Annotates the nodes and arcs with
words and phrases
Structured Keyword Queries
• Nodes are entities and links binary relations
• Entities described by two unrestricted terms:
name or value and type or concept
• Outputs marked with ?
• Compromise between a natural language Q&A
system and formal query
– Users provide compositional structure of the question
– Free to use their own terms to annotate structure
Translation – Step One
finding semantically similar ontology terms
For each graph concept/relation, generate k most semantically
similar ontology classes/properties
Lexical similarity metric based on distributional similarity, LSA, and WordNet
Semantic similarity: http://bit.ly/SEMSIM
Semantic Textual Similarity task
• 2013 lexical and computational semantics conference
• Do two sentences have same meaning (0…5)
1: “The woman is playing the violin” vs. “The young lady enjoys
listening to the guitar”
4: "In May 2010, the troops attempted to invade Kabul” vs.
"The US army invaded Kabul on May 7th last year, 2010"
• 2012: 35 teams, 88 runs, 2013: 36 teams, 89 runs
Dataset
PairingWords Galactus
• 2250 sentence pairs
Headlines (750 pairs)
0.7642 (3)
0.7428 (7)
from four domains
OnWN (561 pairs)
0.7529 (5)
0.7053 (12)
• Our three runs
FNWN (189 pairs)
0.5444 (3)
0.5818 (1)
#1, #2 and #4
Saiyan
0.7838 (1)
0.5593 (36)
0.5815 (2)
SMT (750 pairs)
0.3804 (8)
0.3705 (11)
0.3563 (16)
Weighted mean
0.6181 (1)
0.5927 (2)
0.5683 (4)
Translation – Step Two
disambiguation algorithm
• Assemble best interpretation using statistics of
the data
• Use pointwise mutual informa-tion (PMI)
between RDF terms in the LOD collection
Measures degree to which two RDF terms co-occur in
knowledge base
• In a good interpretation, ontology terms
associate like their corresponding user terms
connect in the query
Translation – Step Two
disambiguation algorithm
Three aspects are combined to derive an overall
goodness measure for each candidate interpretation
Joint disambiguation
Resolving
direction
Link reasonableness
Translation result
Concepts: Place => Place, Author => Writer, Book => Book
Properties: born in => birthPlace, wrote => author (inverse direction)
SPARQL Generation
The translation of a semantic graph query to SPARQL is
straightforward given the mappings
Concepts
• Place => Place
• Author => Writer
• Book => Book
Relations
• born in => birthPlace
• wrote => author
Evaluation
• 33 test questions from 2011 Workshop on Question
Answering over Linked Data answerable using DBpedia
• Three human subjects unfamiliar with DBpedia translated
the test questions into semantic graph queries
• Compared with two top natural language QA systems:
PowerAqua and True Knowledge
Current work
• Baseline system works well for Dbpedia; we’re
testing a second use case now
• Current work
– Better entity matching
– Relaxing the need for type information
– A better Web interface with user feedback & advice
• See http://ebiq.org/93 for more information &
try our alpha version at http://ebiq.org/GOR
Generating Linked Data
by Inferring the
Semantics of Tables
Research with Varish Mulwad
http://ebiq.org/j/96
Goal: Table => LOD*
dbprop:team
Name
http://dbpedia.org/class/yago/Natio
nalBasketballAssociationTeams
Team
Position
Height
Michael Jordan
Chicago
Shooting guard
1.98
Allen Iverson
Philadelphia
Point guard
1.83
Yao Ming
Houston
Center
2.29
Tim Duncan
San Antonio
Power forward
2.11
http://dbpedia.org/resource/Allen_Iverson
* DBpedia
Player height in
meters
Goal: Table => LOD*
Name
Team
Position
Height
Michael Jordan
Chicago
Shooting guard
1.98
Allen Iverson
Philadelphia
Point guard
1.83
Yao Ming
Houston
Tim Duncan
San Antonio
@prefix dbpedia: <http://dbpedia.org/resource/> .
Center
@prefix dbo:
<http://dbpedia.org/ontology/>2.29
.
@prefix yago: <http://dbpedia.org/class/yago/> .
Power forward
2.11
RDF
Linked
Data
"Name"@en is rdfs:label of dbo:BasketballPlayer .
"Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .
dbpedia:Michael Jordan a dbo:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .
dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
All this in a completely automated way
* DBpedia
Tables are everywhere !! … yet …
The web – 154 million
high quality relational
tables
Fewer than 1% of the 400K tables at
data.gov have rich semantic schemas
A Domain Independent Framework
Pre-processing modules
Sampling
Acronym
detection
Query and generate
initial mappings
Joint
Inference/Assignment
Generate Linked RDF
Verify (optional)
2
1
Store in a knowledge
base & publish as LOD
Query and Rank
Rank(String
Similarity,
Popularity)
Chicago
Boston
1. Chicago_Bulls
2. Chicago
3. Judy_Chicago
Allen Iverson
possible
entities
for Chicago
Can be replaced by
Domain Specific /
other LOD knowledge
bases
Generating candidate ‘types’ for Columns
Class
Team
Chicago
Philadelphia
Houston
1. Chicago Bulls
2. Chicago
3. Judy Chicago
{dbpedia-owl:Place,dbpediaowl:City,yago:WomenArtist,yago:
LivingPeople,yago:NationalBaske
tballAssociationTeams }
{dbpedia-owl:Place, dbpediaowl:PopulatedPlace, dbpediaowl:Film,yago:NationalBasketball
AssociationTeams …. ….. ….. }
{…………………………………………………
…………. }
San Antonio
Instance
dbpedia-owl:Place, dbpedia-owl:City,
yago:WomenArtist, yago:LivingPeople,
yago:NationalBasketballAssociationTeams,
dbpedia-owl:PopulatedPlace, dbpedia-owl:Film ….
Joint Inference / Assignment
A graphical model for tables
Joint inference over evidence in a table
Class
C2
C1
C3
Team
Chicago
R11
R21
Philadelphia
R31
Houston
R12
R22
San Antonio
R32
R13
R23
Instance
R33
Parameterized Graphical Model
R11
R12
R13
R21
R22
R23
Captures interaction
between row values
R31
Function capturing
affinity between
column headers and
row values
C2
R33
Row value
Factor
Node
C1
R32
C3
Variable Node:
Column header
𝝍𝟓
Captures interaction
between column headers
Standard message passing
Joint Assignment :
Graphical Models :
Exploit Conditional
Independences
P(C1, C2, C3, R11, R12 ,R13, R21, R22, R23,
R31, R32, R33)
P(C1, R11, R12 ,R13)
P(C2,R21, R22, R23)
P(C3, R31, R32, R33)
P(R31, R32, R33)
C1
Still …
R11
R12
R13
Val
Semantic message passing
“No Change”
“Change”
Related to
Chicago_Bulls
R11:[Michael_I_Jor
dan]
Michael_I_Jordan
“Change”
BasketBall
Player
R12:
[Yao_Ming]
Yao_Ming
R13:[Allen_Ivers
on]
Allen_Iverson
R21:[Chicago_Bulls
]
Chicago_Bulls
“No Change”
……
R31:[Shooting_Gua
rd]
Shooting_guard
“No Change”
“No Change”
“No Change”
C1:[BasketballPlayer]
BasketballPlayer
BasketBallPositions
C3:[BasketBallPositions]
NBATeam
C2:[NBATeam]
“No Change”
“No Change”
“No Change”
Inference – Example
R11:[Michael_I_Jo
rdan]
Michael_I_Jordan
“Change”
R12:[Allen_Iverson]
Allen_Iverson
“No Change”
R13:[Yao_Ming]
Yao_Ming
“No Change”
R11
Michael Jordan
“BasketBallPlayer”
(Michael_I_Jordan, Yao_Ming, Allen_Iverson)
C1:[Name]
“BasketBallPlayer”
1. Michael_I_Jordan (Professor)
2. …..
3. Michael_Jordan (BasketballPlayer)
….
Column header – row value agreement
[Michael_I_Jordan,
Allen_Iverson,
Yao_Ming]
1: Majority Voting
Name
Michael_I_Jordan
LivingPeople
AI_Researchers
+1
BasketballPlayer
Athelete
LivingPeople
+1
+1
Allen_Iverson
Yao_Ming
LivingPeople
GeoPopulatedPlace
BasketBallPlayer
Art
Work
WomenArtist
City
PopulatedPlace
Athlete
Film
1.LivingPeople
2. BasketBallPlayer
3.GeoPopulatedPlace
….
1.
2.
3.
+1
+1
2: Choose the top Class
Yago Tie-breaker/Re-order : Choose more
‘descriptive’ class. E.g. BasketBallPlayer
better than LivingPeople
ClassGranularityScore =
1-[]
Top Yago : BasketBallPlayer
TopDBpedia : Athelete
Athlete
City
…
Column header – row value agreement
topClassScore = (numberOfVotes)/(numberofRows)
Compute topScoreYago & topScoreDBpedia
(topScoreYago || topScoreDBpedia) >= Threshold
(both)Score < Threshold
Update Column Header
Annotation = “No-Annotation”
Check for Alignment
Is Athelete sub/superClass of BasketBallPlayer ?
Columnn Header Annotation = BasketBallPlayer, Athlete
Name
Michael_I_Jordan
LivingPeople
AI_Researchers
Change
Allen_Iverson
Yao_Ming
BasketballPlayer
Athelete
LivingPeople
No - Change
Update row value entity annotations
𝝍𝟑
Candidate
Entities for R11
“CHANGE”
Entity Class :
BasketBallPlayer or
Athlete
1. Michael_I_Jordan
2. …..
3. Michael_Jordan
….
R11
Michael Jordan
R11
Michael Jordan
LivingPeople
AI_Researchers
BasketBallPlayer
Athlete
Michael_Jordan
Evaluation
• Dataset of 80 tables (Wikipedia tables; part of
larger dataset released by IIT-Bombay)
• Evaluated Column Header Annotation
Accuracy
– How good was the mapping Team to
NationalBasketballAssociationTeams
• Evaluated Entity Linking Accuracy
– Mapping Michael Jordan to Michael_Jordan
Column Header Annotation Accuracy
• System produced a ranked a list of Yago &
DBpedia classes
• Human judges evaluated each class
• For precision, judges scored each class
• 1 if the class was accurate
• 0.5 if the class ok, but not best (e.g., Place vs. City)
• 0 if it was incorrect
• For Recall, score 1 if accurate/correct, 0 for
incorrect
422
Incorrect
259
Okay
522
Accurate
Top K classes, F-measure
Column Header Annotations
0.80
F-Measure
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Yago
rank 1
F-Measure 0.69
Yago
rank 2
0.49
Yago
rank 3
0.44
dbp rank dbp rank dbp rank
1
2
3
0.68
0.63
0.61
GOOG
IIT-B
0.67
0.56
Entity linking accuracy
Allen Iverson
http://dbpedia.org/resource/Allen_Iverson
Correctly Linked
Entities
3022
Incorrectly Linked
Entities
959
Total Entities
3981
Other Challenges
• Using table captions and other text is
associated documents to provide context
• Size of some data.gov tables (> 400K rows!)
makes using full graphical model impractical
– Sample table and run model on the subset
• Achieving acceptable accuracy may require
human input
– 100% accuracy unattainable automatically
– How best to let humans offer advice and/or
correct interpretations?
Final Conclusions
• Linked data great for sharing structured and
semi-structured data
– Backed by machine-understandable semantics
– Uses successful Web languages and protocols
• Generating and exploring linked data resources
is challenging
– Schemas are too large, too many URIs
• New tools mapping tables to linked data and
translating structured natural language queries
reduce the barriers
http://ebiq.org/