Transcript Slide 1

Scalable Semantic Web
Data Management Using
Vertical Partitioning
Daniel Abadi21, Adam Marcus2, Samuel
Madden2, and Kate Hollenbach2
1Yale University 2MIT
July 21, 2015
RDF Data Is Proliferating



Semantic Web vision: make Web machine-readable
RDF is the data model behind Semantic Web
Increasing amount of data published using RDF


Swoogle indexes 2,271,350 Semantic Web documents
Biologists seem sold on Semantic Web

7/21/2015
Integrated data from Swiss-Prot, TrEMBL, and PIR
protein databases available in RDF (500 million
statements)
Daniel Abadi - Yale
2
DBFacebook: A New Social Networking Application
Mike Stonebraker
Things found in nature
(streams, sequoias, auroras)
David DeWitt
Person
name
name
type
likes
type
knows
RDF
Data Model
PersonID1
PersonID2
knows
dislikes
dislikes
authorOf
authorOf
Double blind reviewing
Elastic/Velcro/Anything
“One-size-fits-all”
authorOf
authorOf
Pub101
title
7/21/2015
The Design of Postgres
Pub102
venue
venue
SIGMOD
title
Implementation Techniques
Daniel Abadi - Yale
for Main Memory
Database Systems
Pub103
venue
VLDB
title
GAMMA – A High
3
Performance Dataflow
Database Machine
DBFacebook: A New Social Networking Application
Mike Stonebraker
Things found in nature
(streams, sequoias, auroras)
David DeWitt
foaf:Person
foaf:name
foaf:name
rdf:type
dbfb: likes
rdf:type
foaf:knows
http://DBFaceBook.com/PersonID1
http://DBFaceBook.com/PersonID2
foaf:knows
dbfb: dislikes
dbfb: dislikes
dbfb: authorOf
dbfb: authorOf
Double blind reviewing
Elastic/Velcro/Anything
“One-size-fits-all”
dbfb: authorOf
http://DBFaceBook.com/Pub101
dbfb: title
7/21/2015
The Design of Postgres
dbfb: venue
dbfb:SIGMOD
dbfb: authorOf
http://DBFaceBook.com/Pub102
dbfb: venue
dbfb: title
Implementation Techniques
Daniel Abadi - Yale
for Main Memory
Database Systems
http://DBFaceBook.com/Pub103
dbfb: venue
dbfb:VLDB
dbfb: title
GAMMA – A High
4
Performance Dataflow
Database Machine
RDF Data Management



Early projects built their own RDF stores
Trend now towards storing in RDBMSs
Paper examines 3 approaches for storing RDF
data in a RDBMS …
7/21/2015
Daniel Abadi - Yale
5
DBFacebook RDF Graph
Mike Stonebraker
Things found in nature
(streams, sequoias, auroras)
David DeWitt
Person
name
name
type
likes
type
knows
PersonID1
PersonID2
knows
dislikes
dislikes
authorOf
authorOf
Double blind reviewing
Elastic/Velcro/Anything
“One-size-fits-all”
authorOf
authorOf
Pub101
title
7/21/2015
The Design of Postgres
Pub102
venue
venue
SIGMOD
title
Implementation Techniques
Daniel Abadi - Yale
for Main Memory
Database Systems
Pub103
venue
VLDB
title
GAMMA – A High
6
Performance Dataflow
Database Machine
Approach 1: Triple Stores
Subject Property
PersonID1
PersonID1
PersonID1
PersonID1
PersonID1
PersonID1
PersonID2
PersonID2
PersonID2
PersonID2
PersonID2
Pub101
Pub101
Pub102
Pub102
Pub103
Pub103
7/21/2015
type
name
likes
dislikes
authorOf
authorOf
type
name
dislikes
authorOf
authorOf
title
venue
title
venue
title
venue
Object
Person
“Mike Stonebraker”
“Things found in nature (streams, sequoias, auroras)”
“Elastic/Velcro/Anything ‘One-size-fits-all’”
Pub101
Pub102
Person
“David DeWitt”
“Double blind reviewing”
Pub102
Pub103
“The Design of Postgres”
SIGMOD
“Implementation Techniques for Main Memory Databases”
SIGMOD
“GAMMA – A High Performance Dataflow Database”
VLDB
Daniel Abadi - Yale
7
DBFacebook RDF Graph
Mike Stonebraker
Things found in nature
(streams, sequoias, auroras)
David DeWitt
Person
name
name
type
likes
type
knows
PersonID1
PersonID2
knows
dislikes
dislikes
authorOf
authorOf
Double blind reviewing
Elastic/Velcro/Anything
“One-size-fits-all”
authorOf
authorOf
Pub101
title
7/21/2015
The Design of Postgres
Pub102
venue
venue
SIGMOD
title
Implementation Techniques
Daniel Abadi - Yale
for Main Memory
Database Systems
Pub103
venue
VLDB
title
GAMMA – A High
8
Performance Dataflow
Database Machine
Approach 2: Property Tables
Subject
name
PersonID1
Mike
Stonebraker
PersonID2
David
DeWitt
Subject
Pub101
Pub102
Pub103
7/21/2015
likes
dislikes
Things found in
Elastic/Velcro/
nature (streams,
Anything
sequoias, auroras) ‘One-size-fits-all’
Double Blind
NULL
Reviewing
title
“The Design of Postgres”
“Implementation Techniques
for Main Memory Databases”
“GAMMA – A High
Performance Dataflow Database”
Daniel Abadi - Yale
venue
SIGMOD
SIGMOD
SIGMOD
9
DBFacebook RDF Graph
Mike Stonebraker
Things found in nature
(streams, sequoias, auroras)
David DeWitt
Person
name
name
type
likes
type
knows
PersonID1
PersonID2
knows
dislikes
dislikes
authorOf
authorOf
Double blind reviewing
Elastic/Velcro/Anything
“One-size-fits-all”
authorOf
authorOf
Pub101
title
7/21/2015
The Design of Postgres
Pub102
venue
venue
SIGMOD
title
Implementation Techniques
Daniel Abadi - Yale
for Main Memory
Database Systems
Pub103
venue
VLDB
title
GAMMA – A High
10
Performance Dataflow
Database Machine
Approach 3: One-table-per-property
name
Subject Object
dislikes
Subject
Object
likes
Subject
Object
authorOf
Subject Object
Mike
Elastic/Velcro/
Things found in PersonID1 Pub101
Stonebraker PersonID1
Anything
PersonID1 nature (streams,
‘One-size-fits-all’
sequoias, auroras)
David
PersonID1 Pub102
PersonID2
DeWitt
Double Blind
PersonID2
Reviewing
PersonID2 Pub102
PersonID1
PersonID2 Pub103
7/21/2015
Daniel Abadi - Yale
11
Paper Contributions

Explores advantages/disadvantages of these
approaches





Triples stores are the dominant choice
Property Tables implemented by Jena and Oracle
We propose the one-table-per-property approach
Shows how a column-store can be extended to
implement the one-table-per-property approach
Introduces benchmark for evaluating RDF stores
7/21/2015
Daniel Abadi - Yale
12
Results Synopsis



Triple-store really slow on benchmark with 50M
triples
Property-tables and one-table-per-property
approaches are factor of 3 faster
One-table-per-property with column-store yields
another factor of 10
7/21/2015
Daniel Abadi - Yale
13
Querying RDF Data


SPARQL is the dominant language
Examples:

SELECT ?name
WHERE { ?x type Person .
?x name ?name }

SELECT ?likes ?dislikes
WHERE { ?x title “Implementation Techniques for
Main Memory Databases” .
?y authorOf ?x .
?y likes ?likes .
?y dislikes ?dislikes }
7/21/2015
Daniel Abadi - Yale
14
Translation to SQL over triples is easy
Subject Property
PersonID1
PersonID1
PersonID1
PersonID1
PersonID1
PersonID1
PersonID2
PersonID2
PersonID2
PersonID2
PersonID2
Pub101
Pub101
Pub102
Pub102
Pub103
Pub103
7/21/2015
type
name
likes
dislikes
authorOf
authorOf
type
name
dislikes
authorOf
authorOf
title
venue
title
venue
title
venue
Object
Person
“Mike Stonebraker”
“Things found in nature (streams, sequoias, auroras)”
“Elastic/Velcro/Anything ‘One-size-fits-all’”
Pub101
Pub102
Person
“David DeWitt”
“Double blind reviewing”
Pub102
Pub103
“The Design of Postgres”
SIGMOD
“Implementation Techniques for Main Memory Databases”
SIGMOD
“GAMMA – A High Performance Dataflow Database”
VLDB
Daniel Abadi - Yale
15
SPARQL  SQL (over triple store)

Query 1 SPARQL:
SELECT ?name
WHERE { ?x type Person .
?x name ?name }

Query 1 SQL:
SELECT B.object
FROM triples AS A, triples as B
WHERE A.subject = B.subject
AND A.property = “type”
AND A.object = “Person”
AND B.predicate = “name”
7/21/2015
Daniel Abadi - Yale
16
SPARQL  SQL (over triple store)

Query 2 SPARQL:
SELECT ?likes ?dislikes
WHERE { ?x title “Implementation Techniques for
Main Memory Databases” .
?y authorOf ?x .
?y likes ?likes .
?y dislikes ?dislikes }

Query 2 SQL:
SELECT C.object, D.object
FROM triples AS A, triples AS B, triples AS C, triples AS D
WHERE A.subject = B.object
AND A.property = “title”
AND A.object = “Implementation Techniques
for Main Memory Databases”
AND B.property = “authorOf”
AND B.subject = C.subject
AND C.property = “likes”
AND C.subject = D.subject
AND D.property = “dislikes”
7/21/2015
Daniel Abadi - Yale
17
Triple Stores



Accessing multiple properties for a resource
require subject-subject joins
Path expressions require subject-object joins
Can improve performance by:



Indexing each column
Dictionary encoding string data
Ultimately: Do not scale
7/21/2015
Daniel Abadi - Yale
18
Property Tables Can Reduce Joins
Subject
name
PersonID1
Mike
Stonebraker
PersonID2
David
DeWitt
likes
dislikes
Things found in
Elastic/Velcro/
nature (streams,
Anything
sequoias, auroras) ‘One-size-fits-all’
Double Blind
Reviewing
NULL
Left-over triples
Subject
PersonID1
PersonID1
PersonID2
PersonID2
…
7/21/2015
Property
authorOf
authorOf
authorOf
authorOf
…
Object
Pub101
Pub102
Pub102
Pub103
Daniel Abadi - Yale
…
19
Property Tables

Complex to design



If narrow: reduces nulls, increases unions/joins
If wide: reduces unions/joins, increases nulls
Implemented in Jena and Oracle

7/21/2015
But main representation of data is still triples
Daniel Abadi - Yale
20
Table-Per-Property Approach
name
Subject Object
dislikes
Subject
Object
likes
Subject
Object
authorOf
Subject Object
Mike
Elastic/Velcro/
Things found in PersonID1 Pub101
Stonebraker PersonID1
Anything
PersonID1 nature (streams,
‘One-size-fits-all’
sequoias, auroras)
David
PersonID1 Pub102
PersonID2
DeWitt
Double Blind
PersonID2
Reviewing
PersonID2 Pub102
PersonID1




PersonID2 Pub103
+ Nulls not stored
+ Easy to handle multi-valued attributes
+ Only need to read relevant properties
 Still need joins (but they are linear merge joins)
7/21/2015
Daniel Abadi - Yale
21
Materialized Paths
Mike Stonebraker
Things found in nature
(streams, sequoias, auroras)
David DeWitt
Person
name
name
type
likes
type
knows
PersonID1
authorOf:title
PersonID2
knows
dislikes
dislikes
authorOf
authorOf:title
authorOf
Double blind reviewing
Elastic/Velcro/Anything
“One-size-fits-all”
authorOf:title
authorOf
authorOf
authorOf:title
Pub101
title
7/21/2015
The Design of Postgres
Pub102
venue
venue
SIGMOD
title
Implementation Techniques
Daniel Abadi - Yale
for Main Memory
Database Systems
Pub103
venue
VLDB
title
GAMMA – A High
22
Performance Dataflow
Database Machine
Accelerating Path Expressions

Materialize Common Paths



authorOf:title
Improved property table
PersonID1 The Design of the Postgres
performance by 18-38%
Implementation Techniques
Improved one-table-perproperty performance by 75- PersonID1 for Main Memory Database
Systems
84%
Use automatic database
designer (e.g., C-Store
/Vertica) to decide what to
materialize
7/21/2015
Subject
Implementation Techniques
PersonID2 for Main Memory Database
Systems
GAMMA – A High
PersonID2 Performance Dataflow
Database Machine
Daniel Abadi - Yale
23
One-table-per-property  Column-Store



Can think of one-table-per-property as vertical
partitioning super-wide property table
Column-store is a natural storage layer to use for
vertical partitioning
Advantages:




7/21/2015
Tuple Headers Stored Separately
Column-oriented data compression
Do not necessarily have to store the subject column
Carefully optimized merge-join code
Daniel Abadi - Yale
24
Library Benchmark

Data



Queries


Real Library Data (50 million RDF triples)
Data acquired from a variety of diverse sources
(some quite unstructured)
Automatically generated from the Longwell RDF
browser
Details in paper …
7/21/2015
Daniel Abadi - Yale
25
Results
7/21/2015
Daniel Abadi - Yale
26
Conclusions and Future Work


Experimented with storing RDF data using different
schemas in RDMS (both row and column-oriented)
Future work: build a fully-functional RDF database





Extracts and loads RDF data from structured, semistructured, and unstructured data sources
Translates SPARQL to queries over vertical schema
Performs reasoning inside the DB
Use with biology research
Excited about this work? Then …
7/21/2015
Daniel Abadi - Yale
27
Come To Yale!
7/21/2015
Daniel Abadi - Yale
28