PPT - University of Maryland
Download
Report
Transcript PPT - University of Maryland
GRIN: A Graph Based
RDF Index
Octavian Udrea1
Andrea Pugliese2
V. S. Subrahmanian1
1University
of Maryland College Park
2Università di Calabria
Motivation
Plenty of large RDF datasets:
TAP, GovTrack, ChefMoz, CIA World Factbook
Many many more (see rdfdata.org)
Query languages: RDQL, RQL, SPARQL
DB systems: Jena, Sesame, RDFBroker
Indexing?
Based on relational database indexes
Has to be rooted in the characteristics of the
query language
2
Contributions
Lightweight mechanism for indexing large
RDF datasets
GRIN: Graph-based RDF INdex
Query answer algorithms for SPARQL-like
queries
Evaluation on two real-world datasets: TAP
(Stanford) and ChefMoz (chefmoz.org)
3
Outline
RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
4
RDF graph example (ChefMoz)
5
RDF query example
6
Query example in SPARQL
X
SELECT ?v1 ?v2 ?v3
WHERE {
{(?v1 attire ?v3) . (?v1 cuisine Italian)}
{(?v2 attire ?v3) . (?v2 cuisine Italian) . (?v2 location Norfolk)}
{(Norfolk locatedIn NE/USA)}
}
FROM ChefMoz
7
Native RDF systems: Jena2
Stores RDF as (subject, property, value) in a
relational table
Indexes on each of the three attributes
Translates SPARQL/RDQL into SQL
X
6 self-joins
8
Native RDF systems: Sesame
Broekstra et al., ISWC 2002
The Sesame SAIL API improves on Jena:
Supports RDF Schema inference
Separates RDFS from the triple table
Supports database schema generation based on
the underlying RDF schema of a dataset
The problem of too many joins remains
9
Native RDF systems: RDFBroker
Sintek et al., ESWC 2006
The database schema is built based on
signatures – the set of properties used on a
resource
Reduces the number of joins between tables
10
The human perspective
11
The human perspective
12
The human perspective
13
The human perspective
14
The human perspective
15
Outline
RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
16
GRIN intuition
Resources “closer” in the RDF graph are
more likely to be part of the same answer
Hence they should appear on the same page
GRIN will group resources in circles around
selected center resources
Query evaluation:
Find the smallest circle that contains the answer
Evaluate query only on resources in that circle
17
The GRIN Index structure
GRIN is a binary tree in which:
Leaf nodes are sets of resources (and the
associated triples)
Inner nodes are circles consisting of a center
resource and a radius
Each node is fully contained in its parent
Distance metric: shortest path distance in the
undirected graph
18
Building the index: clustering
19
Building the index: clustering
20
Building the index: clustering
21
Building the index: clustering
22
Building the index: clustering
Standard k-medoids clustering (Kaufman &
Rousseeuw, 1987)
R
log
M
How many clusters? k 2
2
R is the set of resources
M is the maximum number of resources per page
Average link gives the best performance for
the inter-cluster distance
23
Building the index: the tree
24
Building the index: the tree
25
Building the index: the tree
26
Outline
RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
27
Queries to constraints
Extract constraints from the query:
d(?v1, Italian) ≤ 1
d(?v2, Norfolk) ≤ 1
d(?v3, Italian) ≤ 2
…and so on
28
Query evaluation
1.
2.
3.
Goal: identify the smallest circle that is
guaranteed to contain an answer to the
query
Perform a depth-first traversal
For each index node, evaluate the
constraints
If the constraints guarantee an answer,
perform subgraph matching
29
Query evaluation
30
Evaluating constraints
Constraints:
d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1,
d(?v3, Italian) ≤ 2
Question: is ?v1 in the circle (Grivanti, 3)?
d(Grivanti,?v1) ≤ d(Grivanti, Italian) +
d(?v1, Italian)
≤1+1=2
?v1 must be in the circle (Grivanti, 3)
31
Evaluating constraints
Question: is ?v3 in (Grivanti, 3)?
d(Grivanti, ?v3) ≤ d(Grivanti, Italian) +
d(Italian, ?v3) ≤
1+2=3
?v3 must be in (Grivanti, 3)
Similarly, ?v2 is in the same circle
32
Subgraph matching
Perform subgraph matching on the resources
in the circles guaranteed to contain an
answer
Algorithm by Cordella et. al, IEEE PAMI 26(10),
2006
Worst-time complexity of O(N!)
Where N is the maximum number of nodes in
either graph
In practice, GRIN makes N very small
33
Outline
RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
34
Experimental framework
Comparison between GRIN, Sesame, Jena2
and RDFBroker (in-memory)
Index build time
Memory consumption at query time
Query time
Two real-world datasets:
TAP (Stanford): datasets between 1.5MB and
300MB
ChefMoz (chefmoz.org): 220 MB
35
Index build time
Build time
200000
180000
GRIN
RDFBroker
Sesame
Jena
160000
Time [ms]
140000
120000
100000
80000
60000
40000
20000
0
2
4
16
32
64
128
256
300
Data size [MB]
36
Memory consumption
Memory consumption
30
GRIN
RDFBroker
Sesame
Jena
25
Memory [MB]
20
15
10
5
0
1
5
10
15
20
25
Number of variables
37
Query time
Query time
25000
GRIN
RDFBroker
Sesame
Jena
Time [ms]
20000
15000
10000
5000
0
1
5
10
15
20
25
Number of variables
38
Average degree of a query node
Dependence of query time on the avg deg of a query node
9000
GRIN
8000
RDFBroker
Query time [ms]
7000
6000
5000
4000
3000
2000
1000
0
1
1.2
1.5
1.8
2
2.2
2.5
2.8
3
3.2
3.5
Average degree of a node
39
Conclusions
Method for indexing large RDF graphs adapted
to the characteristics of RDF queries
Avoids expensive join operations
Gives better query times than Jena2, Sesame
and RDFBroker
Current and future work:
Disk-based index
Analysis of overlap and coverage
40