Transcript Revised_ppt

SW-STORE: A VERTICALLY PARTITIONED
DBMS FOR SEMANTIC WEB DATA
MANAGEMENT
Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate
Hollenbach. 2009. The VLDB Journal.
Group 4
Surabhi Mithal 4282643
Nipun Garg
4282567
http://www-users.cs.umn.edu/~smithal/
Surabhi
Mithal
Nipun Garg
OUTLINE
Introduction to Semantic Web
 Motivation
 Problem Statement
 Challenges
 Major Contributions
 Related Work
 Key Concepts
 Assumptions
 Validation Methodology
 Results
 Improvements

INTRODUCTION TO SEMANTIC WEB : AN EXAMPLE
A simplified bookstore data (dataset “A”)
ISBN
0006511409X
ID
id_xyz
Publisher
id_xyz
id_qpr
The Glass Palace
Name
Ghosh, Amitav
ID
id_qpr
Author Title
Publisher’s name
Harper Collins
Year
2000
Homepage
http://www.amitavghosh.com
City
London
Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/
EXAMPLE CONT : GRAPH REPRESENATION
The Glass Palace
http://…isbn/0006
51409X
2000
London
a:author
Harper Collins
a:name
a:homepage
Ghosh, Amitav
http://www.amitavghosh.com
ANOTHER BOOKSTORE DATA (DATASET “F”)
A
1
2
B
C
ID
ISBN 2020286682
Titre
Le Palais des
Miroirs
3
4
5
6
ID
7
ISBN 0-006511409-X
8
9
10
Nom
11
Ghosh, Amitav
12
Besse, Christianne
Auteur
$A11$
Traducte
ur
$A12$
D
Original
ISBN 0-00-6511409X
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409
X
Le palais des
miroirs
f:auteur
http://…isbn/20203866
82
f:traducteur
f:nom
Ghosh, Amitav
f:nom
Besse,
Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
London
a:autho
r
Harper Collins
a:name
a:homepage
http://…isbn/000651409X
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
f:auteur
http://…isbn/2020386682
f:traducte
ur
f:nom
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC
WEB
The Glass Palace
http://…isbn/000651409X
2000
SAME
URI
London
a:autho
r
Harper Collins
a:name
a:homepage
http://…isbn/000651409X
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
f:auteur
http://…isbn/2020386682
f:traducte
ur
f:nom
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
London
a:autho
r
Harper Collins
a:nam
e
f:origina
l
f:auteur
a:homepage
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
http://…isbn/2020386682
User of data “F” can now ask queries like:
“give me the title of the original”
f:traducte
ur
f:no
m
Ghosh, Amitav
f:nom
Besse, Christianne
MOTIVATION
Integration and sharing of data across different
applications and organizations.
 The Semantic Web logical data model is called
“Resource Description Framework.
 Semantic web concept has issues related to
scalability and performance due to the nature of
the data. Current data management solutions for
RDF scale poorly.

PROBLEM STATEMENT



Input : RDF data in the form of triples
<subject,property,object>
e.g. The Glass Palace hasAuthor Amitav Ghosh
Output : Efficient storage system for RDF data.
Objective : Improve the query performance for complex real
world queries.
CHALLENGES
Find all authors of
books whose title has
the word “Transaction”.
5 way self join!
MAJOR CONTRIBUTIONS AND NOVELTY
Introduction of a new concept of vertically
partitioning RDF data and use of a columnoriented database to improve performance and
increase simplicity.
 The performance evaluation of the new and
existing techniques with a real world example.
 A new column oriented database SW-store is
proposed which is based on the above approach.

RELATED WORK– PROPERTY TABLES
HP LABORATORIES - JENA

Property Clustered Tables and Property Class Tables

Approach 1: A data clustering approach.

Approach 2: Creates clusters based on subject’s type.

Limitations:



Accuracy of Clustering algorithms.
NULLs in data.
Multivalued attributes.
SAMPLE DATABASE
Too many NULLs
Source: - SW-Store: a vertically partitioned DBMS for Semantic Web data management
KEY CONCEPTS:
VERTICAL PARTITIONING AND COLUMN ORIENTED STORE


Vertical partitioning of data and further storing this vertically
partitioned data into a column oriented database.
Subject-object columns for each property. Advantages:



Effective handling of Multivalued attributes.
Elimination of NULLs
The number of unions is less.
 Column oriented storage. Advantages:
 no wastage of bandwidth as projections on data happen before it is pulled into main
memory.
 record header is stored in separate columns thus reducing the tuple width and
letting us choose different compression techniques for each column.
KEY CONCEPTS: SW-STORE

SW-store is a column oriented DBMS optimized for storing RDF

Single column table for subjects.

Representing Sparse data

Overflow tables
ASSUMPTIONS


Postgres is assumed to be the best available choice for a row
oriented RDBMS because of effective handling of NULLs.
Queries that do not restrict on property values are very rare for
RDF applications.

Moderate amount of Insert/Updates on RDF store.

Critique for Assumption: Limited Insert/Update

If the overflow tables get filled rapidly, the batch operation to update the
column oriented store will occur more often degrading the performance as a
whole.
VALIDATION METHODOLOGY


Barton Libraries dataset provided by the Simile Project at MIT
(http://simile.mit.edu/rdf-test-data/barton).
The benchmark is set of 7 queries which is based on a browsing
session of Long well, a UI built by Simile group for querying the
library dataset. These queries are executed on:




Triple data store (subject, property, object table with no improvements on
Postgres).
Property tables ( on Postgres)
Vertically partitioned data in a row oriented store (Postgres).
Vertically partitioned data in a column oriented store (C- Store).
VALIDATION METHODOLOGY


Strengths :
 Real world data and query scenarios.
 Comparison of all the existing techniques the proposed technique.
Weaknesses : Avoiding queries involving unrestricted property problem which
are particularly prevalent for vertical partitioned scenarios.
 Accuracy of clustering for property tables.
 Performance may differ when using different underlying
databases.
RESULTS

From the results, it is clear that proposed storage scheme
outperforms the exiting methods in terms of query time.
IMPROVEMENTS – SPATIAL PERSPECTIVE
Schema design- Queries are fired on vertically partitioned tables as
well as overflow tables. Owing to the heaviness of spatial data, there
should be some spatial indexing like R* TREE or GRID to make these
queries faster.
 Restrictive nature - Spatial queries are not restricted to only specific
“properties” which is an important assumption on their part.
 E.g. Landmarks
 Tables should be partitioned in a better way rather than just
handling one property per table!

e.g. Grouping similar properties together based on domain knowledge.