PPT - University of Maryland

Transcript PPT - University of Maryland

GRIN: A Graph Based
RDF Index
Octavian Udrea1
Andrea Pugliese2
V. S. Subrahmanian1
1University
of Maryland College Park
2Università di Calabria
Motivation

Plenty of large RDF datasets:





TAP, GovTrack, ChefMoz, CIA World Factbook
Many many more (see rdfdata.org)
Query languages: RDQL, RQL, SPARQL
DB systems: Jena, Sesame, RDFBroker
Indexing?


Based on relational database indexes
Has to be rooted in the characteristics of the
query language
2
Contributions

Lightweight mechanism for indexing large
RDF datasets



GRIN: Graph-based RDF INdex
Query answer algorithms for SPARQL-like
queries
Evaluation on two real-world datasets: TAP
(Stanford) and ChefMoz (chefmoz.org)
3
Outline




RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
4
RDF graph example (ChefMoz)
5
RDF query example
6
Query example in SPARQL
X
SELECT ?v1 ?v2 ?v3
WHERE {
{(?v1 attire ?v3) . (?v1 cuisine Italian)}
{(?v2 attire ?v3) . (?v2 cuisine Italian) . (?v2 location Norfolk)}
{(Norfolk locatedIn NE/USA)}
}
FROM ChefMoz
7
Native RDF systems: Jena2



Stores RDF as (subject, property, value) in a
relational table
Indexes on each of the three attributes
Translates SPARQL/RDQL into SQL
X
6 self-joins
8
Native RDF systems: Sesame


Broekstra et al., ISWC 2002
The Sesame SAIL API improves on Jena:




Supports RDF Schema inference
Separates RDFS from the triple table
Supports database schema generation based on
the underlying RDF schema of a dataset
The problem of too many joins remains
9
Native RDF systems: RDFBroker



Sintek et al., ESWC 2006
The database schema is built based on
signatures – the set of properties used on a
resource
Reduces the number of joins between tables
10
The human perspective
11
The human perspective
12
The human perspective
13
The human perspective
14
The human perspective
15
Outline




RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
16
GRIN intuition

Resources “closer” in the RDF graph are
more likely to be part of the same answer



Hence they should appear on the same page
GRIN will group resources in circles around
selected center resources
Query evaluation:


Find the smallest circle that contains the answer
Evaluate query only on resources in that circle
17
The GRIN Index structure

GRIN is a binary tree in which:




Leaf nodes are sets of resources (and the
associated triples)
Inner nodes are circles consisting of a center
resource and a radius
Each node is fully contained in its parent
Distance metric: shortest path distance in the
undirected graph
18
Building the index: clustering
19
Building the index: clustering
20
Building the index: clustering
21
Building the index: clustering
22
Building the index: clustering

Standard k-medoids clustering (Kaufman &
Rousseeuw, 1987)

R 
 log

M

How many clusters? k  2 
2




R is the set of resources
M is the maximum number of resources per page
Average link gives the best performance for
the inter-cluster distance
23
Building the index: the tree
24
Building the index: the tree
25
Building the index: the tree
26
Outline




RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
27
Queries to constraints

Extract constraints from the query:




d(?v1, Italian) ≤ 1
d(?v2, Norfolk) ≤ 1
d(?v3, Italian) ≤ 2
…and so on
28
Query evaluation

1.
2.
3.
Goal: identify the smallest circle that is
guaranteed to contain an answer to the
query
Perform a depth-first traversal
For each index node, evaluate the
constraints
If the constraints guarantee an answer,
perform subgraph matching
29
Query evaluation
30
Evaluating constraints

Constraints:


d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1,
d(?v3, Italian) ≤ 2
Question: is ?v1 in the circle (Grivanti, 3)?


d(Grivanti,?v1) ≤ d(Grivanti, Italian) +
d(?v1, Italian)
≤1+1=2
?v1 must be in the circle (Grivanti, 3)
31
Evaluating constraints

Question: is ?v3 in (Grivanti, 3)?



d(Grivanti, ?v3) ≤ d(Grivanti, Italian) +
d(Italian, ?v3) ≤
1+2=3
?v3 must be in (Grivanti, 3)
Similarly, ?v2 is in the same circle
32
Subgraph matching

Perform subgraph matching on the resources
in the circles guaranteed to contain an
answer


Algorithm by Cordella et. al, IEEE PAMI 26(10),
2006
Worst-time complexity of O(N!)


Where N is the maximum number of nodes in
either graph
In practice, GRIN makes N very small
33
Outline




RDF data and queries
The GRIN Index structure
Answering queries
Experimental evaluation
34
Experimental framework

Comparison between GRIN, Sesame, Jena2
and RDFBroker (in-memory)




Index build time
Memory consumption at query time
Query time
Two real-world datasets:


TAP (Stanford): datasets between 1.5MB and
300MB
ChefMoz (chefmoz.org): 220 MB
35
Index build time
Build time
200000
180000
GRIN
RDFBroker
Sesame
Jena
160000
Time [ms]
140000
120000
100000
80000
60000
40000
20000
0
2
4
16
32
64
128
256
300
Data size [MB]
36
Memory consumption
Memory consumption
30
GRIN
RDFBroker
Sesame
Jena
25
Memory [MB]
20
15
10
5
0
1
5
10
15
20
25
Number of variables
37
Query time
Query time
25000
GRIN
RDFBroker
Sesame
Jena
Time [ms]
20000
15000
10000
5000
0
1
5
10
15
20
25
Number of variables
38
Average degree of a query node
Dependence of query time on the avg deg of a query node
9000
GRIN
8000
RDFBroker
Query time [ms]
7000
6000
5000
4000
3000
2000
1000
0
1
1.2
1.5
1.8
2
2.2
2.5
2.8
3
3.2
3.5
Average degree of a node
39
Conclusions




Method for indexing large RDF graphs adapted
to the characteristics of RDF queries
Avoids expensive join operations
Gives better query times than Jena2, Sesame
and RDFBroker
Current and future work:


Disk-based index
Analysis of overlap and coverage
40

PPT - University of Maryland

Transcript PPT - University of Maryland

Directory