RDF Store – Evaluations and Issues

Download Report

Transcript RDF Store – Evaluations and Issues

RDF Triple Stores
Nipun Bhatia
Department of Computer Science. Stanford University
Contents


Introduction
Different Architectures
•


An Example : Jena SDB
Evaluations
•



Implications
Evaluations using LUBM/DBPedia
Open Research Issues
Which RDF Store to choose for a particular application?
Possible system diagram for Phenotype Annonations.
Introduction


What is an RDF store?
A system to provide a mechanism for persistent storage
and access of RDF graphs.
Potential Applications areas:
Plenty! Backend for Protege, BioPortal, Phenotype
Annotations.
Different Architectures




Based on their implementation, can be divided into 3
broad categories : In-memory, Native, Non-native Nonmemory.
In – Memory : RDF Graph is stored as triples in main –
memory. Eg. Storing an RDF graph using Jena API/
Sesame API.
Native : Persistent storage systems with their own
implementation of databases. Eg. Sesame Native,
Virtuoso, AllegroGraph, Oracle 11g.
Non-Native Non-Memory : Persistent storage systems setup to run on third party DBs. Eg. Jena SDB.
Implications


Scalability
Different query languages supported to varying degrees.
•

Different level of inferencing.
•

Sesame – SeRQL, Oracle 11g – Own query language.
Sesame supports RDFS inference, AllegroGraph – RDFS++,
Oracle 11g – RDFS++, OWL Prime
Lack of interoperability and portability.
•
More pronounced in Native stores.
Jena SDB






SDB basically is a Java Loader.
Multiple stores supported: MySQL, PostgreSQL, Oracle,
DB2.
Takes incoming triples and breaks them down into
components ready for the database.
Multiple layouts
Integration with the Joseki server.
SPARQL supported.
(Non) Interest Declaration: I was previously an intern at HP Labs with the Jena team
Evaluations



Third party evaluations for Sesame, Jena SDB, Virtuoso
Oracle 11g company evaluations
Methodology
•
•
•
•
LUBM – Lehigh University BenchMark
DBPedia
Multiple Queries
Load Times
Evaluations



DB Pedia – Database of structured information extracted
from Wikipedia. Information about places, persons, music
albums and films[2]
LUBM – Synthetically generated RDF data containing
universities, departments, students etc.[1]
Dataset size:
•
•
•
DataSet1: 15,472,624 triples; 2.1 GB
DataSet 2: LUBM 50 – 2.75 Million & LUBM 1000 – 55.09
Million
3 Queries
Loading Time-DataSet1
Results – Query 1

Simple select query – 2 variables
Query 2

Unconstrained Select Query – only predicate was
specified.
Query 3

Complex Query – Uses filter
Oracle 11g – DataSet 2
Ontology (size)
RDFS
OWL Prime
Triples
Time
Triples
Time
LUBM – 50(6.8 Million)
2.75 M
12.14 min
3.05 M
8.01 min
LUBM – 1000(133.6 M)
55.09M
7h 19m
65.25M
7h 12m
Observations

Native Stores perform better than systems using third
party stores.
•

Each of the systems uses different database layouts.
•
•

Optimizations are possible
Virtuoso – OGPS,POGS,PSOG,SOPG
SDB – SPO,GSPO
Hashing on SDB is very bad.
Open Research Issues

Inferencing[4]
•
Present common implementations:
•
•
•
•
Make a number of small queries to propagate the effects of rule firing.
Each of these queries creates an interaction with the database.
Not very efficient
Approaches
•
•
Snapshot the contents of the database-backed model into RAM for the
duration of processing by the inference engine.
Performing inferencing in-stream.
•
•
•
Precompute the inference closure of ontology and analyze the in-coming
data-streams, add triples to it based on your inference closure.
Assumes rigid seperation of the RDF Data(A-box) and the Ontology
data(T-box)
Even this maynot work for very large ontologies – BioMedical Ontologies
Open Research Issues

Query Optimization
•
•
•
Third party stores undo’s any optimization done at the API
level.
Better performance of native stores points to that direction.
Some work in optimizing SPARQL queries for in-memory
story.
Which RDF store to choose for an app?





Frequency of loads that the application would perform.
Single scaling factor and linear load times.
Level of inferencing.
Support for which query language. W3C
recommendations.
Special system needs. Eg. Allegograph needs 64 bit
processor.
Phenotype Annotations
Jena API
Jena API
Inferencing
Jena Model
j
SDB
Jena API
Set of Ontologies required for
Phenotype Annotationseg. PATO,
Fly etc.
MySQL
/ Virtuoso
Phenotype Annotations
Jena API
Jena Model
j
Jena API
SDB
References




[1] http://esw.w3.org/topic/RdfStoreBenchmarking
[2] http://www4.wiwiss.fu-berlin.de/benchmarks-200801/
[3] Kurt Rohloff et al.: An Evaluation of Triple-Store
Technologies for Large Data Stores. Comparing Sesame,
Jena and AllegroGraph. 2007
[4]N Bhatia, A Seaborne – ‘Ingestion pipeline for RDF’