IBM blue-and-white template - Center for Large

Download Report

Transcript IBM blue-and-white template - Center for Large

Building an efficient RDF store over a Relational
Database
Bishwaranjan Bhattacharjee
IBM T.J.Watson Research Center
[email protected]
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Related SIGMOD 2013 Research Track Paper :
Building an efficient RDF store over a relational database
Mihaela A. Bornea, Julian Dolby, Anastasios Kementsietsidis,
Kavitha Srinivas , Patrick Dantressangle, Octavian Udrea,
Bishwaranjan Bhattacharjee
2
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Executive Summary
 New mechanism to store RDF data in relational
systems
 Developed for DB2 LUW with support for SPARQL*
 Other possibilities beyond RDF
* (Simple Protocol and RDF Query Language)
3
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Brief introduction to RDF
Biological data
Financial applications
Government
Watson (Jeopardy Champ)
Social media
RDF data has variable schema and is sparse,
could have thousands of entities and predicate
4
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Sample SPARQL query
What are all the country capitals in Africa?
PREFIX abc: <http://example.com/exampleOntology#>
SELECT ?capital ?country
WHERE {
?x abc:cityname ?capital ;
abc:isCapitalOf ?y .
?y abc:countryname ?country ;
abc:isInContinent abc:Africa .
}
5
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
RDF data management
Relational RDF storage
Subject
Predicate
Native RDF storage
Object
Wolfgang
Mozart
sonOf
Leopold
Mozart
Wolfgang
Mozart
placeOfBirth
Salzburg
Wolfgang
Mozart
DoB
1756
Wolfgang
Mozart
category
musician
Pros : Transaction Support, Compression,
Scalability, Security,…..
Cons : Inefficient query processing
6
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Our System Architecture
7
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Primary Hash Table
Subject
Index
Column Names in Relational Table
Subject
Predicate Value/Ref
1
1
…
Predicate N
Value/Ref Bitmap
N
SubjectA
hasName
name Z
…
hasComposed
32162567
000…001
SubjectB
hasName
name Y
…
lifeStory
ABCDEFG
000…000
SubjectA
category
musicians
…
placeOfBirth
place X
000…000
Hashtable for the predicates connected to a subject.
Insertion:
When a triple is inserted, predicate is hashed to a position in the hashtable. If the position is occupied, multiple hashing used to find an empty
location
Retrieval:
If the predicate is unknown, scan all predicates; otherwise, hash to retrieve the column
Value/Ref:
If a single value for a subject, predicate pair (DBPedia: more than 81%), store value in hashtable (e.g., hasName)
Otherwise if there are multiple values, store reference to a secondary hash table (e.g., hasComposed and 32162567)
Bitmap:
Specifies whether value/ref column for each predicate contains value of reference.
8
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Reverse Primary Hash Table
Object
Index
Column Names in Relational Table
Object
Predicate1
Subject 1
musicians
category
3256874
place x
placeOfBirth
name z
hasName
…
Predicate M Subject M
Bitmap
…
Total number
100000
100…000
5238765
…
population
500000
100…000
SubjectA
…
000…000
Hashtable for the predicates connected to a object.
Insertion:
When a triple is inserted, predicate is hashed to a position in the hashtable. If the position is occupied, multiple hashing used to find an empty
location
Retrieval:
If the predicate is unknown, scan all columns; otherwise, hash to retrieve the column
Value/Ref:
If a single value for a object, store value in hashtable (e.g., population)
Otherwise if there are multiple values, store reference to a secondary hash table (example 3256874)
Bitmap:
Specifies whether value/ref column for each predicate contains value of reference.
9
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Secondary Hash Table
Reference
Index
Reference
Value1
3256874
Subject A
Value2
… Value(K-1)
Value K
Subject B
…
5238765
Subject A
Subject B
Subject Z
…
32162567
composition1
composition2
compositionk
compositionz
…
During query processing, based on the reference id, the values are attached to a subject or object
10
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Subject
Index
Graph Coloring To Assign Columns To Predicates
Column Names in Relational Table
Subject
Predicate Value/Ref
1
1
…
Predicate N
Value/Ref Bitmap
N
SubjectA
SSN
123456
…
hasComposed
32162567
000…001
SubjectB
Revenue
236090
…
Headquarter
Armonk
000…000
SubjectC
Population
50000
…
Mayor
John Smith
000…000
SubjectA : Details about a person
SubjectB : Details about a company
SubjectC : Details about a city
11
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Intelligent data layout
Given triples:
Find ‘predicate sets’
a P b
PQR
a Q c
QL
a R d
ST
b Q e
b L f
Build graph with edges connecting
predicate sets
P
Q
R
c S g
c T h
L
S
T
Color the graph using graph coloring, each color is now an assignment of a
predicate to a column. Notice for 7 predicates, we use only 3 colors.
Uses Floyd-Warshall greedy algorithm. Number of colors <= number of columns
12
Building an efficient RDF store over a Relational Database
Query Optimization
SPARQL
SPARQL to SQL Optimization
Extra Statistics
SQL
RDBMS optimizer statistics
SQL Optimization by QRW
Optimized SQL
13
Building an efficient RDF store over a Relational Database
14
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Overall Stack Architecture
Jena API
Jena API
Jena API
Jena
Query
Engine
Jena
Query
Engine
DB2
SPARQL
to
SQL
Query
Engine
DB2
SPARQL
to
SQL
Query
Engine
Jena
Native
Store
DB2/
Oracle/
MySQL
DB2
DB2
Jena TDB
Jena SDB
DB2 noSQL
Graph Store
DB2 noSQL
Graph Store
Java APIs and HTTP based SPARQL querying
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Scalability
member
DRAM
BP
CPUs
member
DRAM
BP
CA/CF
DRAM
GBP
DRAM
ESE Server
March 2013
pureScale
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Key Research Innovations
 Flexible schema uses hashing techniques to ‘bind’
predicates to column values. Overloads a single column to
hold multiple predicates.
 Intelligent compilation of SPARQL to SQL based on an
estimate of costs of accessing each triple.
 Schema customization for a dataset (like a re-org
capability) to exploit correlations between predicate cooccurrences, minimizing storage, and maximizing indexing
capabilities.
 Workload analysis that can advise predicates to be
indexed.
17
17
Nov
2011
March
2013
IBM Confidential
2011IBM
IBMCorporation
Corporation
©©2013
Building an efficient RDF store over a Relational Database
Internal use case of RDF
An IBM product has a RDF repository of objects
Previously experimented with various stores
Gave up due to performance problems
Currently using Jena TDB
Open source java based RDF repository
Performs better than previous stores tried
Scalability and handling of updates is a concern
18
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Some comparisons
 Dataset: 60M triples.
 SPARQL Query workload: 29 queries, some very complex
(include more than 100 unions of graph patterns).
 Schema layout using reorganization facility + predicate indexes
as determined by re-org.
 Query workload issued 5 times to Jena TDB/ DB2 noSQL Graph
Store in a randomized order, in a single user environment
 Average performance for 29 query workload
> 4X better than Jena TDB
19
19
Nov
2011
March
2013
IBM Confidential
2011IBM
IBMCorporation
Corporation
©©2013
Building an efficient RDF store over a Relational Database
Other comparisons 
On SP2Bench, LUBM, Dbpedia
With Virtuoso, Jena, Sesame, RDF3X
20
March 2013
© 2013 IBM Corporation
Building an efficient RDF store over a Relational Database
Conclusion
 New mechanism to store RDF data in relational
systems
 Provides significant performance improvement
Compared to conventional triple store approaches on a
RDBMS
Compared to Jena TDB
 For more details please attend the SIGMOD paper
presentation
21
March 2013
© 2013 IBM Corporation