Transcript Revised_ppt
SW-STORE: A VERTICALLY PARTITIONED
DBMS FOR SEMANTIC WEB DATA
MANAGEMENT
Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate
Hollenbach. 2009. The VLDB Journal.
Group 4
Surabhi Mithal 4282643
Nipun Garg
4282567
http://www-users.cs.umn.edu/~smithal/
Surabhi
Mithal
Nipun Garg
OUTLINE
Introduction to Semantic Web
Motivation
Problem Statement
Challenges
Major Contributions
Related Work
Key Concepts
Assumptions
Validation Methodology
Results
Improvements
INTRODUCTION TO SEMANTIC WEB : AN EXAMPLE
A simplified bookstore data (dataset “A”)
ISBN
0006511409X
ID
id_xyz
Publisher
id_xyz
id_qpr
The Glass Palace
Name
Ghosh, Amitav
ID
id_qpr
Author Title
Publisher’s name
Harper Collins
Year
2000
Homepage
http://www.amitavghosh.com
City
London
Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/
EXAMPLE CONT : GRAPH REPRESENATION
The Glass Palace
http://…isbn/0006
51409X
2000
London
a:author
Harper Collins
a:name
a:homepage
Ghosh, Amitav
http://www.amitavghosh.com
ANOTHER BOOKSTORE DATA (DATASET “F”)
A
1
2
B
C
ID
ISBN 2020286682
Titre
Le Palais des
Miroirs
3
4
5
6
ID
7
ISBN 0-006511409-X
8
9
10
Nom
11
Ghosh, Amitav
12
Besse, Christianne
Auteur
$A11$
Traducte
ur
$A12$
D
Original
ISBN 0-00-6511409X
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409
X
Le palais des
miroirs
f:auteur
http://…isbn/20203866
82
f:traducteur
f:nom
Ghosh, Amitav
f:nom
Besse,
Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
London
a:autho
r
Harper Collins
a:name
a:homepage
http://…isbn/000651409X
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
f:auteur
http://…isbn/2020386682
f:traducte
ur
f:nom
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC
WEB
The Glass Palace
http://…isbn/000651409X
2000
SAME
URI
London
a:autho
r
Harper Collins
a:name
a:homepage
http://…isbn/000651409X
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
f:auteur
http://…isbn/2020386682
f:traducte
ur
f:nom
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
London
a:autho
r
Harper Collins
a:nam
e
f:origina
l
f:auteur
a:homepage
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
http://…isbn/2020386682
User of data “F” can now ask queries like:
“give me the title of the original”
f:traducte
ur
f:no
m
Ghosh, Amitav
f:nom
Besse, Christianne
MOTIVATION
Integration and sharing of data across different
applications and organizations.
The Semantic Web logical data model is called
“Resource Description Framework.
Semantic web concept has issues related to
scalability and performance due to the nature of
the data. Current data management solutions for
RDF scale poorly.
PROBLEM STATEMENT
Input : RDF data in the form of triples
<subject,property,object>
e.g. The Glass Palace hasAuthor Amitav Ghosh
Output : Efficient storage system for RDF data.
Objective : Improve the query performance for complex real
world queries.
CHALLENGES
Find all authors of
books whose title has
the word “Transaction”.
5 way self join!
MAJOR CONTRIBUTIONS AND NOVELTY
Introduction of a new concept of vertically
partitioning RDF data and use of a columnoriented database to improve performance and
increase simplicity.
The performance evaluation of the new and
existing techniques with a real world example.
A new column oriented database SW-store is
proposed which is based on the above approach.
RELATED WORK– PROPERTY TABLES
HP LABORATORIES - JENA
Property Clustered Tables and Property Class Tables
Approach 1: A data clustering approach.
Approach 2: Creates clusters based on subject’s type.
Limitations:
Accuracy of Clustering algorithms.
NULLs in data.
Multivalued attributes.
SAMPLE DATABASE
Too many NULLs
Source: - SW-Store: a vertically partitioned DBMS for Semantic Web data management
KEY CONCEPTS:
VERTICAL PARTITIONING AND COLUMN ORIENTED STORE
Vertical partitioning of data and further storing this vertically
partitioned data into a column oriented database.
Subject-object columns for each property. Advantages:
Effective handling of Multivalued attributes.
Elimination of NULLs
The number of unions is less.
Column oriented storage. Advantages:
no wastage of bandwidth as projections on data happen before it is pulled into main
memory.
record header is stored in separate columns thus reducing the tuple width and
letting us choose different compression techniques for each column.
KEY CONCEPTS: SW-STORE
SW-store is a column oriented DBMS optimized for storing RDF
Single column table for subjects.
Representing Sparse data
Overflow tables
ASSUMPTIONS
Postgres is assumed to be the best available choice for a row
oriented RDBMS because of effective handling of NULLs.
Queries that do not restrict on property values are very rare for
RDF applications.
Moderate amount of Insert/Updates on RDF store.
Critique for Assumption: Limited Insert/Update
If the overflow tables get filled rapidly, the batch operation to update the
column oriented store will occur more often degrading the performance as a
whole.
VALIDATION METHODOLOGY
Barton Libraries dataset provided by the Simile Project at MIT
(http://simile.mit.edu/rdf-test-data/barton).
The benchmark is set of 7 queries which is based on a browsing
session of Long well, a UI built by Simile group for querying the
library dataset. These queries are executed on:
Triple data store (subject, property, object table with no improvements on
Postgres).
Property tables ( on Postgres)
Vertically partitioned data in a row oriented store (Postgres).
Vertically partitioned data in a column oriented store (C- Store).
VALIDATION METHODOLOGY
Strengths :
Real world data and query scenarios.
Comparison of all the existing techniques the proposed technique.
Weaknesses : Avoiding queries involving unrestricted property problem which
are particularly prevalent for vertical partitioned scenarios.
Accuracy of clustering for property tables.
Performance may differ when using different underlying
databases.
RESULTS
From the results, it is clear that proposed storage scheme
outperforms the exiting methods in terms of query time.
IMPROVEMENTS – SPATIAL PERSPECTIVE
Schema design- Queries are fired on vertically partitioned tables as
well as overflow tables. Owing to the heaviness of spatial data, there
should be some spatial indexing like R* TREE or GRID to make these
queries faster.
Restrictive nature - Spatial queries are not restricted to only specific
“properties” which is an important assumption on their part.
E.g. Landmarks
Tables should be partitioned in a better way rather than just
handling one property per table!
e.g. Grouping similar properties together based on domain knowledge.