A3-PPT - CSE User Home Pages

Transcript A3-PPT - CSE User Home Pages

SW-STORE: A VERTICALLY PARTITIONED
DBMS FOR SEMANTIC WEB DATA
MANAGEMENT
Surabhi
Mithal
Nipun Garg
INTRODUCTION TO SEMANTIC WEB : AN EXAMPLE
A simplified bookstore data (dataset “A”)
ISBN
0006511409X
ID
id_xyz
Publisher
id_xyz
id_qpr
The Glass Palace
Name
Ghosh, Amitav
ID
id_qpr
Author Title
Publisher’s name
Harper Collins
Year
2000
Homepage
http://www.amitavghosh.com
City
London
Source : http://www.w3.org/People/Ivan/CorePresentations/SWTutorial/
EXAMPLE CONT : GRAPH REPRESENATION
The Glass Palace
http://…isbn/0006
51409X
2000
London
a:author
Harper Collins
a:name
a:homepage
Ghosh, Amitav
http://www.amitavghosh.com
ANOTHER BOOKSTORE DATA (DATASET “F”)
A
1
2
B
C
ID
ISBN 2020286682
Titre
Le Palais des
Miroirs
3
4
5
6
ID
7
ISBN 0-006511409-X
8
9
10
Nom
11
Ghosh, Amitav
12
Besse, Christianne
Auteur
$A11$
Traducte
ur
$A12$
D
Original
ISBN 0-00-6511409X
EXAMPLE CONT : GRAPH REPRESENATION
http://…isbn/000651409
X
Le palais des
miroirs
f:auteur
http://…isbn/20203866
82
f:traducteur
f:nom
Ghosh, Amitav
f:nom
Besse,
Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
London
a:autho
r
Harper Collins
a:name
a:homepage
http://…isbn/000651409X
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
f:auteur
http://…isbn/2020386682
f:traducte
ur
f:nom
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS : SEMANTIC
WEB
The Glass Palace
http://…isbn/000651409X
2000
SAME
URI
London
a:autho
r
Harper Collins
a:name
a:homepage
http://…isbn/000651409X
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
f:auteur
http://…isbn/2020386682
f:traducte
ur
f:nom
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
London
a:autho
r
Harper Collins
a:nam
e
f:origina
l
f:auteur
a:homepage
Le palais des miroirs
Ghosh, Amitav
http://www.amitavghosh.com
http://…isbn/2020386682
User of data “F” can now ask queries like:
“give me the title of the original”
f:traducte
ur
f:no
m
Ghosh, Amitav
f:nom
Besse, Christianne
DATA INTEGRATION ACROSS THE TWO DATASETS :SEMANTIC WEB
The Glass Palace
http://…isbn/000651409X
2000
f:origina
l
Le palais des miroirs
London
a:autho
r
f:auteur
Harper Collins
http://…isbn/2020386682
r:type
r:type
a:name
f:nom
a:homepage
f:traducte
ur
http://…foaf/Person
f:nom
Besse, Christianne
Ghosh, Amitav
http://www.amitavghosh.com
Even richer queries can be answered “give me the home page of the original’s ‘auteur’”
PROBLEM STATEMENT




Semantic web concept has issues related to scalability and
performance.
The current storage procedures for RDF do not perform well due
to the nature and size of the data.
RDF data is in the form of triples <subject,property,object>. These
triples are then stored in a relational database. This table
becomes really large and when queries are executed there are a
lot of self joins which lead to poor performance.
Traditional row oriented databases do not perform well with this
kind of data and there is a need for a new mechanism to store
data.
MAJOR CONTRIBUTIONS




The analysis of current mechanisms storing RDF data in
databases.
Introduction of a new concept of vertically partitioning RDF data
to improve query performance and a mechanism to use a columnoriented database with the vertical partitioning approach to
improve performance and increase simplicity.
The performance evaluation of the new and existing techniques
with a real world and appropriate example and a good
explanation of the results are provided. The results show that the
new technique outperforms the existing mechanism by a
significant magnitude.
A new column oriented database SW-store is proposed which is
based on storing vertically partitioned RDF data in an effective
manner. The basic structure of this database is explained with
some examples.
KEY CONCEPTS – PROPERTY TABLES



Property Clustered Tables
A data clustering algorithm is used to find out the related
properties.
The limitation with this type of property table is that a property
can occur at most only in one property table.

Property Class Tables

This approach creates clusters based on subject’s type property
and one property can occur in multiple property tables.

NULLs in data.

Multivalued attributes.
SAMPLE DATABASE
Source: - SW-Store: a vertically partitioned DBMS for Semantic Web data management
KEY CONCEPTS: VERTICAL PARTITIONING


The authors propose a storage mechanism which involves vertical
partitioning of data and further storing this vertically partitioned data
into a column oriented database.
A two column table for each unique property in the RDF data is
created where the first column contains subjects and the second
column contains the object values for that property. The advantages of
this approach are the following:




Effective handling of Multivalued attributes.
Elimination of NULLs
Clustering of properties
The number of unions is less.
 Column oriented


no wastage of bandwidth as projections on data happen before it is pulled into
main memory
record header is stored in separate columns thus reducing the tuple width and
letting us choose different compression techniques for each column.
KEY CONCEPTS: SW-STORE

The authors propose SW-store which is a column oriented
DBMS optimized for storing RDF triples after partitioning
them vertically. The Key concepts in this DBMS include:Storage system: - Some properties can be stored as a single
column table having one to one mapping with the table
containing subjects.
 Query engine: - Lack of tree structure in query plan for
column oriented databases can lead to problems where there
can be mismatch of rate of consumption of data. This is
overcome by ensuring that the graph is still rooted with a
single node with no parents and the parents request the data
at the same rate.
 Overflow tables and Query translation
 Materialized joins: - Subject object joins can be removed using
materialized paths. An example is storing multiple properties
as one property to prevent multiple joins for frequent queries.

ASSUMPTIONS MADE BY THE AUTHOR




Postgres is assumed to be the best available choice for a row
oriented RDBMS because of effective handling of NULLs.
Authors assume that queries that do not restrict on property
values are very rare for RDF applications and have not evaluated
their technique for these queries.
Authors assume that there will be a moderate amount of
Insert/Updates on RDF store which will keep the compression and
decompression of the data to a minimum.
Authors have assumed that both attributes of two column tables
in vertically partitioned RDF store are fixed length using
dictionary encoding for strings but have ignored the overheads
involved for this.
VALIDATION METHODOLOGY


The research group uses the dataset taken from the publicly
available Barton Libraries dataset provided by the Simile Project
at MIT (http://simile.mit.edu/rdf-test-data/barton).
The set of queries is based on a browsing session of Long well, a
UI built by Simile group for querying the library dataset. These
queries are then executed on the all the existing mechanisms and
the new techniques proposed namely:




Triple data store (subject, property, object table with no improvements on
Postgres).
Property tables ( on Postgres)
Vertically partitioned data in a row oriented store (Postgres).
Vertically partitioned data in a column oriented store (C- Store).
The Results are captured and analysed.
VALIDATION METHODOLOGY



Strengths :
 Real world data and effective coverage of the different query scenarios.
 Comparison of all the techniques which exist and the proposed technique
in an effective medium which is not biased towards any particular
approach.
 The inclusion of special/practical scenarios and revaluating the results is
not very hard using this methodology.
Weaknesses : Avoiding queries involving unrestricted property problem which are
particularly prevalent for vertical partitioned scenarios.
 The methodology uses clustering technique for property table’s, accuracy of
which is assumed very high by the authors.
 With the different setting i.e. use of a different underlying databases the
performance may differ.
Reasons for choosing this methodology
 The paper describes the storage and effective retrieval of RDF data and the
most logical way to test database performance is to see the response time
of complex real world queries.
 An unstructured real world data set is already available publically with an
effective UI to query it. This helped the authors design the execution flow
simulating a real world flow.
EXAMPLE
QUERIES AND EXECUTION ANALYSIS
Q1.
Counts the different types of data in the RDF store. This requires a
search for the objects and counts of those objects with property Type.
Analysis:
The vertically partitioned table and the column-store aggregate the
object values for the Type table. Because the property table solution
has the same schema as the vertically partitioned table for this
query, the query plan is the same.
Q2.
The user selects Type: Text from the previous panel. Longwell must then display a
list of other defined properties for resources of Type: Text. It must also calculate
the frequency of these properties. For example, the Language property is defined
1,028,826 times for resources that are of Type: Text.
Analysis :
On a triple-store, this query requires a selection on property=Type an object=Text,
followed by a self-join on subject to find what other properties are defined for these
subjects. The final step is an aggregation over the properties of the newly joined
triples table.
In the property table solution, the selection predicate Type=Text is applied, and then
the counts of the non-NULL values for each of the 28 columns is written to a
temporary table. The counts are then selected out of the temporary table and
unioned together to produce the correct results schema.
The vertically partitioned store and column-store select the subjects for which the
Type table has object value Text, and store these in a temporary table, t. They
then union the results of joining each property’s table with t and count all
elements of the resulting joins.
RESULTS


From the results, it is clear that column-stores present some clear
advantages in storing RDF data.
A new RDF database, called SW-Store has been proposed which
will be faster in executing RDF based queries
IMPROVEMENTS

Authors have not paid much attention to the RDF concept from spatial
perspective –
 Schema design- Queries are fired on vertically partitioned tables as
well as overflow tables. Owing to the heaviness of spatial data, there
should be some spatial indexing like R* TREE or GRID to make these
queries faster.
 Restrictive nature - Spatial queries are not restricted to only specific
“properties” which is an important assumption on their part.
 E.g. Landmarks
 Tables should be partitioned in a better way rather than just
handling one property per table!
e.g. Taxonomies can be helpful grouping similar properties together.
Thank you !

A3-PPT - CSE User Home Pages

Transcript A3-PPT - CSE User Home Pages

Directory