Powerpoint - Mathematical & Computer Sciences

Download Report

Transcript Powerpoint - Mathematical & Computer Sciences

A survey “Off the Record” – Using
Alternative Data Models to
Increase Data Density in Data
Warehouse Enviroments.
Presented by: Victor Gonzalez-Castro
Lachlan MacKinnon
1
Agenda
 Introduction
 Data Sparsity
 State of the art





Relational Model
The Triple Store
The Binary Model
The Associative model
The Transrelational model
 Our proposal
 Questions
2
Introduction
• In Data Warehouse
environments Data Sparsity is a
common issue that remains
unresolved.
• Alternative Data Models that
abandon the traditional record
storage/manipulation structure
have been researched.
• We are investigating the use of
these alternative data models to
increase data density with the
idea to decrease data sparsity.
3
Origin of Data Sparsity
• Data sparsity is originated from the aim of
answering all possible user queries from the
information stored in a Data Warehouse that
contains Nulls.
Time Dimension
Year
$
$
$
$
$
Month
Day
$ $
$$
$ $
$$
Fig.1. A three level dimension and Nulls. After [6]
4
Origin of Data Sparsity (Cont…)
• Data Sparsity is the result of the Cartesian
product of all dimensions and all aggregation
levels.
(Sparse)
(Dense)
Fig.2. Data Sparsity and data density. From [6].
5
State of the art. (Relational)
• The Relational Model [7] uses the traditional record
storage/manipulation structure.
1234
Nut
Red
London
• It is the base model against which the other models
will be compared.
• All RDBMS made a poor management
of sparsity (missing information).
•Codd [7] suggested a fundamental
change in the relational Model V2, the
use of a 4 value-logic.
•No one has implemented this
6
fundamental change
State of the art. (Relational)
• Major players on the Relational Market
/ SQL Server
7
State of the art. (TripleStore)
• The Triple Store. [1],[2]. It uses a Structure called
the Name Store to keep all the names.
Identifier
Name
1
2
3
1
Nut
4
5
6
2
Red
3
London
…
…
…
…
…
l
m
n
• To construct the processing Structure, uses Triples.
8
State of the art. (TripleStore)
• The major project in Triple Store is TriStarp
• Tristarp was stablished in 1984. Leaded by Peter
King with Support from IBM Hursley labs.
• Dr. Sharman from IBM Hursley [1] is visiting the
Tristarp team.
• Current directions
• Further development of the persistent Triple
Store Repository.
• Continuing Research on the graph-based
model.
• Extending technology to manage partially
structured data
9
State of the art. (Binary)
• The Binary Model [4] considers that all tables are
Binary tables.
Sur
City
Sur Pname Color City
s1
London
s1 Nut
Red
London
s2
Paris
s2 Bolt
Green Paris
s3
Oslo
s3 Screw Blue
Oslo
Sur Pname
s1
Nut
s2
s3
Bolt
Screw
Sur
s1
Color
Red
s2
s3
Green
Blue
10
State of the art. (Binary)
• A Major Project in the Binary Model [4] is MONETDB.
• Is a DBMS designed to provide high performance on complex
queries against real-world sized database.
• Achieves this goal using innovations at all layers of a
DBMS: a storage model based on vertical fragmentation,
processing speed by self-tuning relational operators,
algorithms designed to exploit modern hardware, selfmanaging indexing structures, modular and extensible
software architecture, etc.
• It is developed at the Institute for Mathematics and
Computer Science Research of The Netherlands.
11
State of the art. (Associative)
• The Associative Model [3] comprises two types of
data structures Items and Links.
Identifier
Name
77
Nut
08
Identifier
Source
Verb
Target
Red
74
77
12
08
32
London
03
74
67
32
12
That is
67
Is located in
• It differs from Binary and Triple store in one
fundamental way; Associations themselves may be
either the source or the target of other associations.
• It uses Quadruplets.
12
State of the art. (Associative)
• The Major product in the Associative Model is SentencesDB.
• Instead of using a separate, unique table for every different
type of data, it uses a single, generic structure to contain all
types of data.
•
Information about the logical structure of the data and the
rules that govern it are stored alongside the data in the
database.
• The programs are truly reusable, and no longer need to be
amended when the data structures change.
13
State of the art. (Transrelational)
• The TransRelational ModelTM. [5] keeps the Relational
model itself but abandon the record storage structure. It
uses two structures:
P#
PNAME
COLOR
CITY
P#
PNAME
COLOR
CITY
P1
Bolt
Blue
London
4
3
2
1
P2
Cam
Blue
London
1
1
4
4
P3
Cog
Green
London
5
6
5
6
P4
Nut
Red
Oslo
6
4
1
3
P5
Screw
Red
Paris
2
2
3
2
P6
Screw
Red
Paris
3
5
6
5
The Field Values Table.
The Record Reconstruction Table.
• Since there is currently no instantiation of the
Transrelational Model available, We will build an
implementation of the essential algorithms.
14
Transrelational. Algorithms
1. A file for the suppliers relation
Field Values Table (FVT)
2. Sort each column in asc.
Record Reconst. Table (RRT)
P#
PNAME
COLOR
CITY
P#
PNAME
COLOR
CITY
P#
PNAME
COLOR
CITY
P1
Nut
Red
London
P1
Bolt
Blue
London
4
3
2
1
P2
Bolt
Green
Paris
P2
Cam
Blue
London
1
1
4
4
P3
Screw
Blue
Oslo
P3
Cog
Green
London
5
6
5
6
P4
Screw
Red
London
P4
Nut
Red
Oslo
6
4
1
3
P5
Cam
Blue
Paris
P5
Screw
Red
Paris
2
2
3
2
P6
Cog
Red
London
P6
Screw
Red
Paris
3
5
6
5
1. Go to Cell [1,1] of the FVT, fetch the value stored (P1).
P#
2. Go to the same cell [1,1] in the RRT and fetch the value (4). It is
interpreted to mean that the next field value (PNAME), is in the 4th row
of the FVT. Go to that cell and fetch the value (Nut)
3. Go to the corresponding RRT cell [4,2] and fetch the row number (4).
The next (3rd or COLOR) is the 4th row in the FVT (Red).
4. Go to the corresponding RRT cell [4,3] and fetch value (1). The next
4th or CITY) is the 1st row in the FVT (London).
P1
PNAME
COLOR
CITY
London
Nut
5. Go to the corresponding RRT cell [4,1] and fetch value
(1). The next 5th column does not exist, so it wraps
around to the 1st column, so then is the 1st row in the
FVT.
15
Red
Alternative Data Models Comparison
Model
Storage Structure
Linkage Structure
Relational
Table (Relation)
By position
Triple Store
Name Store
Triple Store
Binary
Binary Table
Joins
Associative
Items
Links
Transrelational
Field Values Table
Record Reconstruction Table
16
Our proposal (Our aims)
• To carry out an impartial survey on
alternative Data Models.
• Compare whether or not the use of
alternative data models can improve
the Data Density in Data Warehouse
environments.
• Observe the effect that such data
density increase has on the data
sparsity.
17
Our proposal (How…)
• We intend to use an implementation of each data model
TransRelationalTM
• We will use TPC-H data set to load each database.
• Run a set of benchmark metrics, where available if
not we will develop our metrics to determine
relative performance and then consider relative
data density and sparsity.
18
Just Remember…
• Instead of storing data horizontally, do it
vertically and eliminate duplicate values.
123
456
Bolt
Screw
Black
Blue
789
234
567
Nut
Nail
White
Paris
London
Here are the
Savings
• We are abandoning the traditional Record
Structure, we are going “off the record”.
19
Questions?
20
Thanks !!
[email protected]
[email protected]
21
References
1. G C H Sharman and N Winterbottom, The Universal Triple Machine: a Reduced
Instruction Set Repository Manager. Proceedings of BNCOD 6, pp 189-214, 1988.
2. TriStarp Web Site: http://www.dcs.bbk.ac.uk/~tristarp. Updated November, 2000.
3. Simon Williams. The Associative Model of Data, Second Edition, Lazy Software Ltd.
ISBN: 1-903453-01-1 www.lazysoft.com
4. MonetDB. ©1994-2004 by CWI. http://monetdb.cwi.nl
5. Date, C.J. An introduction to Database Systems. Appendix A. The Transrelational
Model , Eighth Edition. Addison Wesley. 2004. USA. ISBN: 0-321-18956-6.
6. Pendse Nigel. Database explosion. http://www.olapreport.com Updated Aug, 2003.
7. Codd, E.F. The Relational Model for Database Management Version 2. Addison-Wesley.
1990. ISBN 0-201-14192-2.
22