On the Efficiency and Effectiveness of Federated Semantic Data

Download Report

Transcript On the Efficiency and Effectiveness of Federated Semantic Data

On the Efficiency and Effectiveness of
Federated Semantic Data Management
ANAPSID An Adaptive Approach.
Maria-Esther Vidal
[email protected]
Universidad Simón Bolívar, Venezuela
Joint work with Maribel Acosta, Simón Castillo, Gabriela Montoya, Guillermo
Palma
CREDIBLE 2013
1
Agenda
1
Introduction
2
ANAPSID
3
Query Decomposition Problem
4
5
Empirical Study.
Conclusions and Future Directions
2
1
INTRODUCTIONSCIENTIFIC QUESTIONS
3
Motivation
Federations of Endpoints
ARQ
SPARQL-DQP
ANAPSID
CREDIBLE 2013
4
Motivation
Federations of Endpoints
Data
Fragmentation
Replication
DYNAMICITY
ARQ
Query Complexity
SPARQL-DQP
ANAPSID
Endpoint Configuration
NETWORK LATENCY
Platform
Data Correlations
Distribution
CREDIBLE 2013
5
Motivation
Properties of Federations of Endpoints
Data
Fragmentation
Replication
DYNAMICITY
Query Complexity
NETWORK LATENCY
Data Correlations
Benchmarks need to be able to measure characteristics of
Distribution
Federated SPARQL Query Processing Engines
Endpoint Configuration
Platform
Problems have to be formal analyzed to understand the
characteristics of Federated SPARQL Query Processing Engines
CREDIBLE 2013
6
SP2B-10M
SW Dog Food
CREDIBLE 2013
7
Accessing SPARQL Endpoints
q0:
{?d
?e
?e
“Kegg compound identifiers and among their drugs, those
that have a substrate that is an enzyme.”
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
Triple Bound to
rdf:type bio2rdf-kegg:Enzyme}
General Predicate
CREDIBLE 2013
8
Accessing SPARQL Endpoints
q0:
{?d
?e
?e
“Kegg compound identifiers and among their drugs, those
that have a substrate that is an enzyme.”
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
Triple Bound to
rdf:type bio2rdf-kegg:Enzyme}
General Predicate
q1:
{?d
?e
?e
?d
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
rdf:type bio2rdf-kegg:Enzyme.
owl:sameAs ?d1}
Triples Bound to
General Predicate
CREDIBLE 2013
9
Accessing SPARQL Endpoints
q0:
{?d
?e
?e
“Kegg compound identifiers and among their drugs, those
that have a substrate that is an enzyme.”
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
Triple Bound to
rdf:type bio2rdf-kegg:Enzyme}
General Predicate
q1:
{?d
?e
?e
?d
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
rdf:type bio2rdf-kegg:Enzyme.
owl:sameAs ?d1}
Triples Bound to
General Predicate
q2: Select * WHERE
{?d drugbank:keggCompoundId ?c.
?e bio2rdf-kegg:xSubstrate ?c. }
?e rdf:type bio2rdf-kegg:Enzyme.
?d owl:sameAs ?d1.
?d1 rdf:type dbpedia-owl:Drug}
CREDIBLE 2013
Triples Bound to
General Predicate
10
Accessing SPARQL Endpoints
q0:
{?d
?e
?e
“Kegg compound identifiers and among their drugs, those
that have a substrate that is an enzyme.”
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
Triple Bound to
rdf:type bio2rdf-kegg:Enzyme}
General Predicate
q1:
{?d
?e
?e
?d
Select * WHERE
drugbank:keggCompoundId ?c.
bio2rdf-kegg:xSubstrate ?c. }
rdf:type bio2rdf-kegg:Enzyme.
owl:sameAs ?d1}
q2: Select * WHERE
{?d drugbank:keggCompoundId ?c.
?e bio2rdf-kegg:xSubstrate ?c. }
?e rdf:type bio2rdf-kegg:Enzyme.
?d owl:sameAs ?d1.
?d1 rdf:type dbpedia-owl:Drug}
Triples Bound to
General Predicate
Triples Bound to
General Predicate
q3: Select * WHERE
{?d drugbank:keggCompoundId ?c. ?e bio2rdf-kegg:xSubstrate ?c. }
?e rdf:type bio2rdf-kegg:Enzyme.
Triples Bound to
?d owl:sameAs ?d1.?d1 rdf:type
dbpedia-owl:Drug.
CREDIBLE 2013
11
General Predicate
?d1 rdf:type ?t1. ?d1 rdfs:label ?d2}
R1) Is this observed behavior due to problems with existing engines? Or
R2) Is this behavior caused by the properties of the queries?
CREDIBLE 2013
12
q0:
?e rdf:type bio2rdfkegg:Enzyme
CREDIBLE 2013
13
q1:
?e rdf:type bio2rdfkegg:Enzyme
?d owl:sameAs ?d1
CREDIBLE 2013
14
q2:
q2:
?e rdf:type bio2rdfkegg:Enzyme
?d owl:sameAs ?d1
?d1 rdf:type dbpediaowl:Drug
CREDIBLE 2013
15
q3:
?e rdf:type bio2rdfkegg:Enzyme
?d owl:sameAs ?d1
?d1 rdf:type dbpediaowl:Drug
?d1 rdf:type ?t1.
?d1 rdfs:label ?d2
What are the minimal number of endpoints that need to be contacted to
ensure certain properties of the answer? For example, completeness, minimal
CREDIBLE 2013
data transfer.
16
Data Fragmentation
Fragment 1
Fragment 2
Fragment 3
Empty Intersection between fragments
CREDIBLE 2013
17
Horizontal Fragmentation
Fragment 1
Fragment 2
Fragment 3
CREDIBLE 2013
Each fragment may
contain triples of many
predicates.
Triples of one predicate
can belong to different
fragments.
It impacts on the
completeness of a
query answer.
18
Vertical Fragmentation
Fragment 1
Fragment 2
Each fragment
contains all the triples
of at least one predicate
in the original dataset.
It impacts on query
performance.
Fragment 3
CREDIBLE 2013
19
Data Fragmentation and
Replication
Effects
Triples of three predicates were stored in fragments.
– skos:subject, owl:sameAs, nytimes:latest_use
Vertical Partitioning
Without Replication
(Three fragments)
E1
E3
E2
Vertical Partitioning
With Replication
(Three fragments)
E2
E1
Horizontal Partitioning
Without Replication
(Two fragments)
Horizontal Partitioning
With Replication
(Two fragments)
20
E1
E1
E3
E4
E2
E2
CREDIBLE 2013
E3
E4
Life Science Data
Linked Data
Cross Domain
FedBench
Queries
CREDIBLE 2013
21
Partitioning and Data Distribution
LD10 (Perfect Network)
Partitioning and Data Distribution
LD10 (Perfect Network)
✖
✔
✖
✔
R3: Is the observed behavior caused by a limitation of existing engines?
R4: Is this caused by the way the
data2013
is partitioned and distributed?
CREDIBLE
22
Data Distribution
• All FedBench datasets distributed in one
endpoint versus distributed in different
endpoints. CD1 (Perfect Network)
✔
✖
R5: Is the observed behavior caused by a limitation of existing engines?
R6: Is this caused by the way the data is distributed?
CREDIBLE 2013
23
Endpoint Dimension-Network Latency
Gamma Distribution
Perfect Network
(No Delay)
Fast Network
Gamma (α=1,β=0.3)
Medium-Fast Network
Gamma (α=3,β=1.0)
Medium-Slow
Gamma (α=3,β=1.5)
Slow
Gamma (α=5,β=2.0)
CREDIBLE 2013
24
✔
✖
✖
✖
✖
✖
✖
✖
✖
✖
R7: Is the observed behavior caused by a limitation of existing engines?
R8: Is this caused by the shape of the query and the type of delays?
CREDIBLE 2013
25
Challenges
 Formalization of Query decomposition to
understand the characteristics and behavior of existing
engines.
 Define Query decomposition and
optimization techniques to produce efficient query
plans.
Efficiency means minimize:
size of data transferred from the endpoints.
total execution time.
26
Challenges
Adaptive query processing techniques for
SPARQL endpoints able to:
 Hide delays.
 Opportunistically produce results even if one of the
endpoints gets blocked.
Adaptivity means:
Capacity to adapt to changing conditions of the
sources.
27
2
ANAPSID
28
ANAPSID Architecture
SPARQL 1.0
Query
ANAPSID
Endpoint Descriptions:
Dataset Predicates
Query
Decomposer
SPARQL 1.1
A
Bushy Query
Adaptive
Query
Execution
29
ANAPSID Architecture
SPARQL 1.0
Query
ANAPSID
Query
Decomposer
Endpoint Descriptions:
Dataset Predicates
• Query Decomposer
rewrites SPARQL 1.0
queries into SPARQL 1.1
Bushy Queries of smallsize sub-queries.
SPARQL 1.1
Bushy Query
Adaptive
Query
Execution
30
ANAPSID Architecture
SPARQL 1.0
Query
ANAPSID
Query
Decomposer
SPARQL 1.1
Bushy Query
Endpoint Descriptions:
Dataset Predicates
• Query Decomposer
rewrites SPARQL 1.0
queries into SPARQL 1.1
Bushy Queries of smallsize sub-queries.
– Endpoints are contacted
to retrieve the dataset
predicates.
– Heuristics-based
approaches are
considered.
Adaptive
Query
Execution
31
ANAPSID Architecture
SPARQL 1.0
Query
ANAPSID
Query
Decomposer
SPARQL 1.1
Bushy Query
Endpoint Descriptions:
Dataset Predicates
• Query Decomposer
rewrites SPARQL 1.0
queries into SPARQL 1.1
Bushy Queries of smallsize sub-queries.
– Endpoints are contacted
to retrieve the dataset
predicates.
– Heuristics-based
approaches are
considered.
Adaptive
Query
Execution
32
ANAPSID Architecture
SPARQL 1.0
Query
ANAPSID
Query
Decomposer
SPARQL 1.1
Bushy Query
Adaptive
Query
Execution
Endpoint Descriptions:
Dataset Predicates
Schema
Alignments
• Query Decomposer
rewrites SPARQL 1.0
queries into SPARQL 1.1
Bushy Queries of smallsize sub-queries.
– Endpoints are contacted
to retrieve the dataset
predicates.
– Heuristics-based
approaches are
considered.
• Physical Query
Operators adapt query
executions to endpoint
data transfers and
endpoint availability.
33
Adaptive Query
Execution
Inter-Operator
Intra-Operator
Level
Level
Query engine is able to adapt
its schedule when operators get
Blocked.
Physical Operators are able
to adapt their schedules to
Data Availability.
Intra- and Inter-Operator Adaptation
Operators may
continue
generating
answers even if
one of the sources
gets blocked!
5
3
The plan can be
adapted on-the-fly
when one
operators gets
blocked!
1
A first answer can be produced very fast even if one
of the sources gets blocked!
CREDIBLE 2013
35
Intra- and Inter-Operator Adaptation
Operators may
continue
generating
answers even if
one of the sources
gets blocked!
5
3
The plan can be
adapted on-the-fly
when one
operators gets
blocked!
The engine keeps producing answers even if source
A is not available.
CREDIBLE 2013
36
Intra- and Inter-Operator Adaptation
Operators may
continue
generating
answers even if
one of the sources
gets blocked!
5
The plan can be
adapted on-the-fly
when one
operators gets
blocked!
The engine keeps producing answers even if source
A is not available.
CREDIBLE 2013
37
Intra- and Inter-Operator Adaptation
Operators may
continue
generating
answers even if
one of the sources
gets blocked!
5
The plan can be
adapted on-the-fly
when one
operators gets
blocked!
The engine keeps producing answers even if source
A is not available.
CREDIBLE 2013
38
Inter-Operator Adaptation
5
The plan is evaluated until
the timeout is reached.
The answer will be possibly
incomplete!.
CREDIBLE 2013
39
3
QUERY DECOMPOSITION
PROBLEM
42
Vertex Coloring Problem
 Assigns a color to
every vertex in a
graph.
 Such that adjacent
vertices are colored
with different colors.
 The number of
colors is minimized.
Deciding if a coloring of the graph is minimal or not is a NP-hard Problem
DSATUR an approximation that can find optimal solutions depending on the
shape of the input graph!
Mapping of the Query
Decomposition Problem into Vertex
Coloringto Problem
 Nodes corresponds
triple patterns in the
query.
 Connect two nodes:
 It is not possible to perform a JOIN between the
two triple patterns.
 The triple patterns cannot be answered by the
same endpoint.
Fed-DSATUR extends DSATUR to identify colorings that meet the former
conditions!
Formalization allows us to demonstrate:
Complexity of the problem.
Optimality of the query decomposition.
Optimality of the data transfer.
Completeness of the answer.
CREDIBLE 2013
44
Query Decomposition as the Vertex
Coloring Problem
q3: Select * WHERE {
?d drugbank:keggCompoundId ?c.
?e bio2rdf-kegg:xSubstrate ?c.
?e rdf:type bio2rdf-kegg:Enzyme.
?d owl:sameAs ?d1.
?d1 rdf:type dbpedia-owl:Drug.
?d1 rdf:type ?t1.
?d1 rdfs:label ?d2}
Four Sub-Queries
“Drugs that possibly target Leukemia”
SELECT DISTINCT ?drug1
WHERE {
?drug1 drugbank:possibleDiseaseTarget
diseasome:673 .
?drug1 drugbank:target ?o.
?o drugbank:genbankIdGene ?g.
?o drugbank:locus ?l.
?o drugbank:molecularWeight ?mw.
?o drugbank:hprdId ?hp.
?o drugbank:swissprotName ?sn.
?o drugbank:proteinSequence ?ps.
?o drugbank:generalReference ?gr.
?drug drugbank:target?o.
?drug drugbank:synonym?o1 .
OPTIONAL {
?drug owl:sameAs ?drug5 .
?drug5 rdf:type dbcategory:Drug .
?drug drugbank:keggCompoundId ?cpd .
?enzyme kegg:xSubstrate ?cpd .
?enzyme rdf:type kegg:Enzyme .
?reaction kegg:xEnzyme ?enzyme .
?reaction kegg:equation ?equation .
} }
CREDIBLE 2013
46
“Drugs that possibly target Leukemia”
SELECT DISTINCT ?drug1
WHERE {
?drug1 drugbank:possibleDiseaseTarget
diseasome:673 .
?drug1 drugbank:target ?o.
?o drugbank:genbankIdGene ?g.
?o drugbank:locus ?l.
?o drugbank:molecularWeight ?mw.
?o drugbank:hprdId ?hp.
?o drugbank:swissprotName ?sn.
?o drugbank:proteinSequence ?ps.
?o drugbank:generalReference ?gr.
?drug drugbank:target?o.
?drug drugbank:synonym?o1 .
OPTIONAL {
?drug owl:sameAs ?drug5 .
?drug5 rdf:type dbcategory:Drug .
?drug drugbank:keggCompoundId ?cpd .
?enzyme kegg:xSubstrate ?cpd .
?enzyme rdf:type kegg:Enzyme .
?reaction kegg:xEnzyme ?enzyme .
?reaction kegg:equation ?equation .
} }
CREDIBLE 2013
47
“Drugs that possibly target Leukemia”
SELECT DISTINCT ?drug1
WHERE {
?drug1 drugbank:possibleDiseaseTarget
diseasome:673 .
?drug1 drugbank:target ?o.
?o drugbank:genbankIdGene ?g.
?o drugbank:locus ?l.
?o drugbank:molecularWeight ?mw.
?o drugbank:hprdId ?hp.
?o drugbank:swissprotName ?sn.
?o drugbank:proteinSequence ?ps.
?o drugbank:generalReference ?gr.
?drug drugbank:target?o.
?drug drugbank:synonym?o1 .
OPTIONAL {
?drug owl:sameAs ?drug5 .
?drug5 rdf:type dbcategory:Drug .
?drug drugbank:keggCompoundId ?cpd .
?enzyme kegg:xSubstrate ?cpd .
?enzyme rdf:type kegg:Enzyme .
?reaction kegg:xEnzyme ?enzyme .
?reaction kegg:equation ?equation .
} }
CREDIBLE 2013
48
4
EMPIRICAL STUDY
49
Experimental Study 1
FedBench Benchmark

Dataset
File
Size
Nine Virtuoso endpoints
 Timeout set up to 240 secs, or
71,000 tuples.

Queries
 25 FedBech queries
CD, LD, LSD domains
3-6 Triple patterns
 Ten additional complex queries
6-47 triple patterns

Metrics
 Throughput
 Percentage of the Answer (PA)
Federated Engines:
 FedX
 SPLENDID
 ANAPSID
50
Experimental Study 1
Fed1
Fed2
26 SPARQL Endpoints
9 SPARQL Endpoints
Life Science Data
CREDIBLE 2013
52
Linked Data
Cross Domain
Life Science Data
CREDIBLE 2013
53
Linked Data
Cross Domain
C9 and C1 are complex queries that none of the engines was able to execute in less
than 30 minutes!
CREDIBLE 2013
54
C9 and C1 are complex queries that none of the engines was able to execute in less
than 30 minutes!
CREDIBLE 2013
55
http://silurian.thalassa.cbm.usb.ve/
Complex Query C9
Exclusive
Groups
Endpoints
FedDSATUR
Vertex
Coloring
Graph
Experimental Study 1
Fed 1 No Delays
Log 2 (Throughput= Answer Size/Elapse Time)
Ground Truth was computed in an integrated version of the FedBench collections.
Experimental Study 1
Fed 1 No Delays
Existing Engines are
competitive in queries
comprise of small number
of triple patterns
ANAPSID query
optimization techniques
may overcome existing
Federated engines in
complex queries.
Experimental Study 1
Fed 2 No Delays
Existing Engines are
competitive in queries
comprise of small number
of triple patterns
ANAPSID query
optimization techniques
may overcome existing
Federated engines in
complex queries.
Experimental study 2
Python proxies
Contact endpoints

Simulate Network Latency
Gamma1:
of response latency
seconds.
Gamma2:
fast network with a gamma distribution
resulting in an average latency of 0.3
simulates a
medium fast network with gamma
distribution of response latency resulting in an average latency of
simulates a
Timeout 300 secs

Experimental Environment

Linux Mint machine.

Intel Pentium Core2 Duo 3.0
GHz.

8GB RAM.
3 seconds.
Gamma3:
slow network
with a gamma

distribution of response latency resulting in an average latency of
Federated SPARQL Engines
4.5 seconds per message.

ANAPSID [Acosta et al]

FedX [Schwarte et al]
simulates a

Warm Cache

SPLENDID [Gorlitz et al]

Virtuoso SPARQL endpoints
Goal: Evaluate
performance of the proposed query decomposition
Techniques and the Adaptive Query Engine.
Compare them with respect to state-of-the-art engines.
60
Experimental study 2
LSD DOMAIN
GAMMA1
GAMMA2
GAMMA3
ANAPSID Time First Answer
111.03
111.59
164.11
ANAPSID Time Whole Answer
237.42
266.44
262.65
FEDX
39.71
131.11
151.53
SPLENDID
227.52
272.7
285.47
Statistics seem not to improve the performance.
Adaptive approach may:
exhibit a similar performance even in presence of network delays.
produce the first answer in less than half of the time of whole answer
is produced.
Execution time of non-adaptive approaches may be considerable affected
in presence of slow networks.
61
5
CONCLUSIONS AND FUTURE
WORK
62
Conclusions

Criteria to group
triple patterns in order to
reduce the number of tuples transferred from
the endpoints.

Mapping of query
Coloring Problem:



decomposition to Vertex
Fed-DSATUR
Optimization techniques to produce bushy
trees.
Empirical Evaluation suggests that
proposed techniques support
successful execution of queries.
63
Future Directions
 Analysis of families of SPARQL queries that
correspond to optimal solutions of Fed-DSATUR.
 Extend Fed-DSATUR to integrate different type of sources,
e.g., crowdsource, network of sensors, updatable data, etc.
 Definition of benchmarks to consider diverse families of
SPARQL queries, e.g.,
 Complex queries as C9 and C1.
 Empirical evaluation of the performance and quality
of different families of SPARQL queries .
64
QUESTIONS?
MANY THANKS
Query Decomposition
Exact Star-Shaped Groups
 Groups of pattern combinations according to exactly
one common variable.
?o
(7) ?o
(8) ?o
(9) ?o
(10)?o
(6)
<http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#UMLS_CUI> ?CUI .
<http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#RxNorm_CUI> ?RN .
<http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#SNOMED_CID> ?SM .
<http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#MeSH_DUI> ?Ms .
<http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#MeSH_Name> ?Mn.
66
Exact Star-Shaped Groups and satellites
 Groups of one exact star-shaped group and
triples patterns
t={s p o}
 Share one variable with a triple in the exact star
 The rest of the triple components are constants.
(3) ?I1 drugbank:interactionDrug2 ?drug1 .
(4) ?I1 drugbank:interactionDrug1 ?drug .
(1) ?drug1 drugbank:drugCategory drugbankres:antibiotics .
67
Star-Shaped Groups and satellites
SELECT DISTINCT ?drug ?diseaseName WHERE {
(1) ?drug1 drugbank:drugCategory drugbankres:antibiotics .
(2) ?drug2 drugbank:drugCategory drugbankres:antiviralAgents .
(3) ?I1 drugbank:interactionDrug2 ?drug1 .
(4) ?I1 drugbank:interactionDrug1 ?drug .
(5) ?I2 drugbank:interactionDrug2 ?drug2 .
(6) ?I2 drugbank:interactionDrug1 ?drug .
(7) ?drug diseasome:possibleTargetDisease ?disease.
(8) ?disease diseasome:diseaseName ?diseaseName
}
(3) ?I1 drugbank:interactionDrug2 ?drug1 .
(4) ?I1 drugbank:interactionDrug1 ?drug .
(1) ?drug1 drugbank:drugCategory drugbankres:antibiotics .
(5) ?I2 drugbank:interactionDrug2 ?drug2 .
(6) ?I2 drugbank:interactionDrug1 ?drug .
(2) ?drug2 drugbank:drugCategory drugbankres:antiviralAgents.
(7) ?drug diseasome:possibleTargetDisease ?disease.
(8) ?disease diseasome:diseaseName ?diseaseName
68
 Queries
 Five complex queries
 Between 6 and 19 triple patterns.
 Decomposed into up to 7 sub-queries.
 Different SPARQL Operators.
 General Predicates: rdf:type, owl:sameAs, rdfs:seeAlso
 Virtuoso Endpoints
 Timeouts 240 secs.
 71,000 tuples
CREDIBLE 2013
69
Query Optimization
 Greedy-based solution to identify bushy trees:
 To combine star-shaped groups with satellites.
 Endpoints are contacted to retrieve number of answers produced
by each sub-query.
 Number of Cartesian Products and the height of tree are
minimized.
70