The Schema Path - UMBC ebiquity research group

Download Report

Transcript The Schema Path - UMBC ebiquity research group

Schema Free Querying
of Semantic Data
Lushan Han
Advisor: Dr. Tim Finin
May 23, 2014
1
Road Map







Introduction
Related Work
SFQ Interface
Schema Network and Association Models
Query Interpretation
Evaluation
Conclusion
2
Part 1. Introduction
3
Semantic Data

A network of entities, which are annotated with types and
interlinked with properties.

Increasing amount
of Semantic Data

Examples:




RDF semantic data
LOD
DBpedia
Freebase
4
Objectives

Develop schema-free query interfaces




Two existing interfaces:



Works with “semantic data” in many forms, e.g., RDF, Freebase,
RDBMS
Allow casual users to freely query semantic data without learning
its schema
Queries should be in the user’s conceptual world
Natural Language Interface (NLI)
Keyword Interface
Three hard problems
5
P1. No Practical Interface

Natural language interface

NLP techniques are still not reliable to parse out the full relational
structure from natural language questions
(e.g. Who was the author of the Adventures of Tom Sawyer
and where was he born?)

Keyword interface

Ambiguity and limited expressiveness

(e.g. “president children spouse”)
6
SFQ Interface


Still in the user’s conceptual world
Make implicit structure of NL questions explicit
Who was the author of the Adventures of Tom Sawyer
and where was he born?
7
P2. Semantic Heterogeneity
Problem



Many different ways to express (model) the same
meaning
Vocabulary and structure mismatches between the
user’s query and the machine’s representation
Existing methods:

Labor-intensive and ad-hoc methods




Domain-specific syntactic or semantic grammars
Mapping Lexicons (Mapping rules)
Templates
Thesaurus (e.g. WordNet) is insufficient
8
P2. Examples
9
P2. More Examples
4
5
10
A purely computational
approach

Lexical Semantic similarity Measures


Statistical Association Measures


Carry out disambiguation
A novel “overall semantic similarity” or fitness metric that
combines




Capture flexible semantics
Lexical semantic similarity measures
statistical association measures
structure features
Context-sensitive mapping algorithms
11
P3. Heterogeneous or
unknown schema

Hard to reach consensus on a schema for the world

Open domain semantic data has heterogeneous or even
unknown schema (e.g. Semantic Web data, DBpedia)

Traditional NLI systems are difficult to apply

Some modern systems


Not produce formal queries (e.g. SQL or SPARQL).
Directly search into the entity network for matchings

Computationally expensive and has ad-hoc natures
12
The schema network

Learn a schema statistically from the entity network by
exploiting co-occurrences.


The schema itself is also represented as a network
Mapping the user’s query into the schema network,
instead of the entity network.



Much more scalable
Produce formal queries
Enable joint disambiguation and context-sensitive mapping
algorithm
13
Thesis Statement
We can develop an effective and efficient
algorithm to map a casual user's schemafree query into a formal knowledge base
query language that overcomes vocabulary
and structure mismatch problems by
exploiting lexical semantic similarity
measures, association degree measures
and structural features.
14
Contributions

An intuitive SFQ interface that avoids the problem of
extracting relations structure from NL queries

Novel algorithms mapping SFQ queries to KB queries
addressing both vocabulary and structure mismatches

A novel approach to handle heterogeneous or unknown
schemas by building a schema from an entity network

Define the probability of observing a path in a schema
network and develop two novel statistical association models

An improved PMI metric and new semantic text similarity
measures and algorithms
15
Part 2. Related Work
16
Natural Language Interface to
Database (NLIDB) Systems

Early Systems in 70s, (e.g. LUNAR and LADDER)



Domain-specific syntactic or semantic grammars
Heavily customized to a particular application
Later systems in 80s and 90s. (e.g. TEAM, ASK,
MASQUE)




More general parser
Require human-crafted lexicons, mapping rules and domain
knowledge to interpret the parse tree
Allow knowledge engineers or end users to enrich lexicons and
add new mapping rules through an interactive interface
More portable than early systems
17
Recent NLI Systems
System
Data NL Parsing
Vocabulary
Mismatch
Structure
Mismatch
Auto- Limitations
matic
PRECISE
DB
Tokenizer to get a collection
of tokens
Lexicon
bipartite
matching
Yes
SCISSOR
KB
Semantic parser
NaLIX
XML
Dependency parser to get
adjacent tokens
Lexicon
Adjacency
No
matching
ORAKEL
RDF
Syntactic parser to get logical
lambda-calculus query
Lexicon
Lexicon
FREyA
RDF
Syntactic parser to get a
collection of terms
Lexicon
No
Aqualog
RDF
Shallow parsing and pattern
rules to get relations
Lexicon
No
PANTO
Syntactic parser and a headRDF driven algorithm to get
relations
Machine learning
Lexicon
No
Yes
No
• Restricted domains
No
Yes
KB
PowerAqua
RDF
Shallow parsing and pattern
rules to get relations
Lexicon
Treo
RDF Syntactic parser to get a
Semantic
similarity
• Restricted domains
• Restricted domains
• Simple NL questions
Lexicon, 1,200 templates
and a very large repository Yes
of query rephrasing
ordered list of terms
• Very restricted domains
• Manually annotated training data
No
1,200 templates
A very large repository of
query rephrasing
True
Knowledge
(Evi)
• Very restricted domains
partial
matching
Yes
No
Yes
• Restricted domains
• Very restricted domains
• Simple NL questions
• Extremely laborious
• Directly match into the entity network
• Not produce formal queries
• Directly match into the entity network
• Not produce formal queries
• Queries must be a single path
• Produce no exact answers but triple
18
paths that may contain the answers
Part 3. SFQ Interface
19
SFQ Examples
1. Where was the author of the Adventures of Tom Sawyer born?
2. Give me authors in the CIKM conference
3. A more complicated one
20
Default Relations

The relation name can be left out

A stop word list for filtering relation names with words like in, of, has,
from, belong, part of, locate and etc.
21
Envisioned Web Interface
22
Output (1)
23
Output (2)
24
Part 4. Schema Network and
Association Models
25
Instance Data (ABox)

Two datasets



Integrate all RDF data types into five types that are
familiar to users



The relation dataset (all relations between instances)
The type dataset (all type definitions for instances)
ˆNumber, ˆDate, ˆYear, ˆText and ˆLiteral
ˆLiteral is the super type of the other four
We use DBpedia for examples in the following slides
26
Automatically enrich the set of
types
Automatically deduce types from relations

Infer attribute types from data type properties


e.g. <Beijing>, population, “20693000”
=> ˆPopulation
Infer classes from object properties

e.g. < Zelig>, director, <Woody Allen>
=> ˜Director
27
Counting Co-occurrence
28
The Schema Network

A statistical meta description of the underlying entity
network, which is a network itself.
29
The Schema Path

A path on the schema network is called a schema path
Example 1.
Example 2.

A schema path P represents a composite relation
30
The Schema Path Probability

Measure the reasonableness of a path

The probability of “observing” a path on the schema
network

(A1) we select the starting node c0 of the path randomly from all
the nodes in the schema network

(A2) observe the path in a random walk starting with c0
31
Compute Transition
Probability
0≤
≤1
32
A Property about Schema Path

A schema path P and its return path P’ represent the
same relation.
P
P’

Given a schema path P and its return path P’ we have P(P
) = P(P’).
33
Schema Path Model

Supposed to store and index all the schema paths with a
length no larger than a given threshold and their
probabilities

The only supported function is to return all the schema
paths and their probabilities between two given classes.

Put in memory for fast computation
34
Schema Path Model
Optimization
35
Concept Path

Group all the edges with the same direction between two
nodes into a single edge

By analogy to schema path, we have concept path
probability

Concept path frequency
36
Concept Association
Knowledge (CAK) model

Pairwise associations


(i) direct association between classes and properties
(ii) indirect association between two classes

PMI measure

Our improved PMI measure
37
Concept Association
Knowledge (CAK) model

Direct association between a directed class
property p
and a

Indirect association between two directed classes
38
CAK Examples
39
PMI* vs PMI
The most associated property for “Person” in DBpedia
PMI*
PMI
40
Part 5. Query Interpretation
41
SFQ Interpretation
42
Two Phase Mapping Algorithm
43
Generating Candidates via
Lexical Semantic Similarity
Disambiguation via
Optimization
Concept Mapping Optimization
Problem
46
A joint disambiguation
example
Time Complexity of Concept
Mapping Algorithm

A straightforward concept mapping algorithm

After exploiting locality – the optimal mapping choice of a
property can be determined locally when the two classes it links are
fixed
48
Relation Mapping Optimization
Problem

H* : the set of top k3 concept mapping hypotheses
The reduced mapping space for the SFQ

The optimization problem

49
Computing the fitness of a
mapping σ on a relation r

Let

Two features and one parameter β



Joint lexical semantic similarity between
The schema path frequency of P
and P
The parameter β adjusts the relative importance of the two
features
50
Align terms in P to terms in r

The relation

The path




We already know and are paired with c0 and cl
We ignore all the intermediate classes c1, …, cl-1



C = <c0, c1, …, cl-1, cl>
P = <p0, p1, …, pl-1>
Semantics in c1, …, cl-1 is overlapped with that in p0, p1, …, pl-1
the less terms we join, the less likely errors can occur
The only unaligned terms are
and p0, p1, …, pl-1
51
Semantic Stretch and
Heterogeneous Alignments
52
Cutting Function and Cutting
Objective Function

Each cutting defines a function, referred to as ω

The product of similarity of the pairs in the minimum pair
set that covers and every
53
Cutting Optimization Problem
and a Greedy Algorithm

Cutting Optimization



The cutting space Ω has a size of
Total running time
Greedy Algorithm: SmartCutter that run in


First find the property in P that is the most similar to and
assume it is in the predicate region.
Stretch to the left until we meet a property u that is more similar
to the and stretch to the right until meeting a property v that is
more similar to the
54
Joint Lexical Semantic
Similarity
tends to be biased towards small l, short paths.


α is a parameter in the range [0..1].

High similarities in the subject and object regions but low similarities
in the predicate region can still have a fairly high

The joint lexical semantic similarity between
and P
55
Deal with Default Relation

Combining

and
θ is a parameter in the range [0,
).
56
Formal Query Generating and
Entity Matching
57
Part 5. Evaluation
58
Evaluation Settings

Two very different datasets



Three similarity measures




LSA semantic similarity (purely statistical)
Hybrid semantic similarity (LSA + WordNet)
String similarity (bigrams + Dice coefficient)
Performance metrics



DBLP+ DBLP augmented with data from CiteSeerX and ArnetMiner (narrow domain)
DBpedia Structured data in Wikipedia (open domain)
Mean Reciprocal Rank (how high the first correct interpretation are in the top-10 list)
Mean Precision and Recall (evaluate the answers produced by the SPARQL queries)
Test environment

PC with 2.33GHz Intel Core2 CPU and 8GB memory
59
DBLP+ Dataset Statistics
60
Degree of connectivity
18 x 18 class pairs resulted from pairing every C with every C
Number of class pairs
Number of class pairs

Degree of connectivity
distribution of connectivity degree when distance = 1
Degree of connectivity
distribution of connectivity degree when distance ≤ 3
61
DBLP+ Query Set

64 test questions





31 Direct Single (DS) questions (e.g. Give me author x of the paper y )
15 Indirect Single (IS) questions (e.g. Show person x who cites the person y )
8 Direct Multiple (DM) questions (e.g. List person x who published the book y with
ISBN z )
10 Indirect Multiple (IM) questions (e.g. List the institutions u of the author y with
whom the person x from the organization z has co-authored )
Rephrased to 220 SFQ queries

for example, rephase “Give me author x of the paper y” to seven SFQ queries
62
Resolving Parameters

Use sufficiently large numbers to set k1, k2 and k3




k1 = 10 (the size of the class candidates list )
k2 = 20 (the size of the property candidates list )
k3 = 40 (the number of top hypotheses returned by the concept
mapping phase)
Resolving α, β, γ, and θ



First tune α and γ while fixing β = 0 and θ = 1
Next, tune β while still fixing θ = 1
Last, tune θ
63
Results of Tuning Parameters

Top-10 coverage of 220 SFQ



hybrid
LSA
string
99.5%
98.2%
56.4%
64
Cross-Validation
Using all the queries:
65
DBpedia Dataset Statistics
66
Degree of connectivity
249 x 249 class pairs resulted from pairing every C with every C
degree of connectivity
degree of connectivity

connectivity degree among 249 classes when distance = 1
connectivity degree among 249 classes when distance ≤ 2
67
DBpedia Query Set

2011 QALD (QA over Linked Data) workshop



33 questions from 50 QALD test questions that can be answered
using only the DBpedia ontology




50 training and 50 test questions on DBpedia 3.6
ground truth answers
Modify 7 questions due to unsupported operations and 1 question due to data
issue but we preserve the relational structure and vocabulary of the questions
27 DS questions (e.g. Which river does the Brooklyn Bridge cross?)
6 DM questions that contains two relations (e.g. Give me the official websites of
actors of the television show Charmed. )
Three graduate students who are unfamilar with DBpedia
independently translated them into 99 SFQ queries .
68
Results

Top-10 coverage of 99 SFQ




hybrid
LSA
string
88.9%
82.8%
51.5%
The coverage has an upper limit 91.9%


5 test cases due to ambiguty.
Translators’ interpretation changed the questions.
3 test cases due to an incorrect property name.
69
Generating SPARQL queries
and answers

Use a non-empty strategy to automatically generate
answers for a SFQ query



Run SPARQL generated for the best interpretation of a SFQ query.
If an empty result is returned, go to next interpretation and so on.
Results on 99 SFQ queries (33 NL questions) using the
parameters learned on the DBLP+ dataset.
70
Compare with QALD Systems

Compare with two QALD systems on 30 test questions
Our system (hybrid)


Our system (LSA)
FREyA
PowerAqua
Three questions are excluded because we made them easier by
dropping the aggregation functions.
Among 30 questions, PowerAqua modified 8 questions and FREyA
modified 4 questions.
71
Compare with QALD Systems

Compare with two QALD systems on 6 two-relation
questions

Our Differences




Our system (hybrid)
Our system (LSA)
FREyA
PowerAqua
Both PowerAqua and FREyA use lexicons
FREyA highly depends on the user’s interaction to perform mappings
PowerAqua and FREyA tuned their systems on 50 training questions
Both PowerAqua and FREyA use TBox data of DBpedia Ontology
72
Compare with Online Systems

Compare with two online systems on 33 test questions
Our system (hybrid)

Our system (LSA)
True Knowledge
PowerAqua
Both True Knowledge and PowerAqua online systems include
DBpedia data as part of their knowledge base.
73
Running time Comparison

QALD reported systems



36 seconds per question
N/A
Online systems



FREyA
PowerAqua
True Knowledge a few seconds
PowerAqua
143.7 seconds
Our systems


Hybrid
LSA
0.721 seconds
0.766 seconds
74
Conclusion and Future Work

The schema-free structured query approach allows people to query
semantic data without mastering formal queries or acquiring detailed
knowledge of the classes.

Our system uses statistical data about lexical semantics and semantic data
to generate most appropriate formal queries from a user’s intuitive query.

Our evaluation showed that the approach was both effective and efficient for
two very different, large datasets

Our next step is to make the approach easier to apply to new RDF data
collection and to a large LOD cloud and develop the envisioned web
interface
75
Contributions

The SFQ interface that works around the unsolved problem of parsing full
relational structure from natural language queries.

Novel context-sensitive and fully computation-based mapping algorithms
that address both vocabulary and structure mismatch problems.

A novel approach to build a schema network from the entity network to deal
with heterogenous or unknown schemas

Define the probability of observing a path on the schema network and
develop two novel statistical association models

Improve a popular statistical association measure, PMI

Develop state of art and novel semantic simialrity measures
76
End

Thank you!!!

Questions?
77