Oid - Rakesh Agrawal
Download
Report
Transcript Oid - Rakesh Agrawal
Storage and Retrieval of
E-Commerce Data
R. Agrawal, A. Somani, Y. Xu: VLDB-2001
Overview
E-Commerce Data Characteristics
Alternative Physical Representations
Querying Mapping Layer
Performance evaluation of various approaches
Conclusions
Typical E-Commerce Data
Characteristics
An Experimental E-marketplace for
Computer components
Nearly 2 Million components
More than 2000 leaf-level
categories
Large number of Attributes (5000)
Constantly evolving schema
Sparsely populated data (about
50-100 attributes/component)
Outline
E-Commerce Data Characteristics
Alternative Physical Representations
Horizontal - One n-ary relation
Binary - N 2-ary relations
Vertical - One 3-ary relation
Query Mapping Layer
Performance evaluation
Conclusion
Conventional horizontal representation
(n-ary relation)
Name
Monitor
Height
Recharge
Output
playback
Smooth scan
Progressive Scan
PAN DVD-L75
7 inch
-
Built-in
Digital
-
-
-
KLH DVD221
-
3.75
-
S-Video
-
-
No
SONY S-7000
-
-
-
-
-
-
-
SONY S-560D
-
-
-
-
Cinema Sound
Yes
-
…
…
…
…
…
…
…
…
DB Catalogs do not support thousands of columns (DB2/Oracle
limit: 1012 columns)
Storage overhead of NULL values Nulls increase the index size
and they sort high in DB2 B+ tree index
Hard to load/update
Schema evolution is expensive
Querying is straightforward
Binary Representation
(N 2-ary relations)
Monitor
Height
Output
Name
Val
Name
Val
PAN DVD-L75
7 inch
KLH DVD221
3.75
Dense representation
Manageability is hard
because of large number of
tables
Schema evolution expensive
Name
Val
PAN DVD-L75
Digital
KLH DVD221
S-Video
Decomposition Storage Model
[Copeland et al SIGMOD 85],
[Khoshafian et al ICDE 87]
Monet: Binary Attribute Tables
[Boncz et al VLDB Journal 99]
Attribute Approach for storing
XML Data [Florescu et al INRIA
Tech Report 99]
Vertical representation
(One 3-ary relation)
Oid (object identifier) Key (attribute name) Val (attribute value)
Oid
Key
Val
0
‘Name’
‘PAN DVDL75’
0
‘Monitor’
‘7 inch’
0
‘Recharge’
‘Built-in’
0
‘Output’
‘Digital’
1
‘Name’
‘KLH DVD221’
1
‘Height’
‘3.75’
1
‘Output’
‘S-Video’
1
‘Progressiv
e Scan’
‘No’
2
‘Name’
‘SONY S-7000’
…
…
…
Objects can have large number of
attributes
Handles sparseness well
Schema evolution is easy
Implementation of SchemaSQL [LSS 99]
Edge Approach for storing XML Data [FK
99]
Querying over Vertical
Representation
Simple query on a Horizontal scheme
SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’
Becomes quite complex:
SELECT v1.Val
FROM vtable v1, vtable v2
WHERE v1.Key = ‘Monitor’
AND v2.Key = ‘Output’
AND v2.Val = ‘Digital’
AND v1.Oid = v2.Oid
Writing applications becomes much harder. What can we do ?
Solution : Query Mapping on
Vertical Representation
Simplify querying of the data
– Provide a horizontal view of the vertical table
– Automatically transform queries on the
horizontal view to the vertical table
Solution (Continued)
Translation layer maps relational algebraic
operations on H to operations on V
Horizontal
view (H)
Attr1
…
Attr2
Query Mapping Layer
Vertical
table (V)
Oid
Key
Val
Attrk
…
Key Issues
Can we define a translation algebra ?
Standard database view mechanism does not
work
attribute values
become column names (need higher
order views)
Can the vertical representation support fast
querying ?
What are the new requirements from the database
engines?
Outline
E-Commerce data characteristics
Alternative Approaches
Query Mapping Layer
Transformation Algebra
Implementation Strategies
Performance evaluation
Conclusion
Transformation Algebra
Defined an algebra for transforming expressions
over horizontal views into expressions over the
vertical representation. Two key operators:
– v2h (Vertical to Horizontal)
– h2v (Horizontal to Vertical)
Notation
Select (s)
Project (p)
Join ()
Left Outerjoin ()
Right Outerjoin ()
Union ()
Intersection ()
Difference (-)
Aggregation (f)
Null Value ()
Sample Algebraic Transforms
v2h () Operation – Convert from vertical to horizontal
k(V) = [Oid(V)] [i=1,k Oid,Val(sKey=‘Ai’(V))]
h2V () Operation – Convert from horizontal to vertical
k(H) = [i=1,k Oid,’Ai’Ai(sAi ‘’(V))]
[i=1,k Oid,’Ai’Ai(si=1,k Ai=‘’(V))
Similar operations such as Unfold/Fold and Gather/Scatter
exist in SchemaSQL [LSS 99] and [STA 98] respectively
Complete transforms in VLDB-2001 Paper
From the Algebra to SQL
Equivalent SQL transforms for algebraic transforms
– Select, Project
– Joins (self, two verticals, a horizontal and a vertical)
– Cartesian Product
– Union, Intersection, Set difference
– Aggregation
Extend DDL to provide the Horizontal View
CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable
USING COLUMNS (Attr1, Attr2, … Attrk, …)
Four implementation strategies to provide Horizontal View
Implementation Strategies on
Vertical Representation
VerticalSQL
– Uses only SQL-92 level capabilities
– Used XQGM code to represented parsed SQL
queries
VerticalUDF
– Exploits User Defined Functions and Table
Functions to provide a direct implementation
Alternative Implementation
Strategies
Binary (hand-coded queries)
– 2-ary representation with one relation per
attribute (using only SQL-92 transforms)
SchemaSQL
– Addresses a more general problem
– Performed 2-3X worse than Vertical
representation because of temporary tables
Transformation strategy: VerticalSQL
Horizontal view
Attr1
Attr2
…
SELECT Attr1, Attr2 FROM hview
Query transformation
SELECT t7.Attr1, t7.Attr2
FROM ( SELECT t4.Oid, t4.Attr1, t6.Attr2
FROM (SELECT t0.Oid, t3.Attr1
FROM (SELECT DISTINCT t0.Oid FROM vtable AS t0
) AS t1(Oid)
LEFT OUTER JOIN
(SELECT t2.Oid, t2.Val FROM vtable AS t2
WHERE t2.Key = ‘Attr1’
) AS t3(Oid, Attr1)
ON t1.Oid = t3.Oid
) AS t4(Oid, Attr1)
LEFT OUTER JOIN
(SELECT t5.Oid, t5.Val FROM vtable AS t5
WHERE t5.Key = ‘Attr2’
) AS t6(Oid, Attr2)
ON t4.Oid = t6.Oid
) AS t7(Oid, Attr1, Attr2)
Vertical table
Oid
Key
Val
…
…
Transformation strategy: VerticalUDF
Horizontal view
Attr1
Attr2
…
…
…
SELECT Attr1, Attr2 FROM hview
Query transformation
SELECT t1.Attr1, t1.Attr2
FROM vtable AS t0,
TABLE(v2h(t0.Oid, t0.Key, t0.Val)) AS t1(Oid, Attr1, Attr2)
WHERE t0.Key = ‘Attr1’ OR t0.Key = ‘Attr2’
Vertical table
The
Oid
Key
Val
0
‘Attr1’
…
v2h table function reads tuples of vertical table
sorted on Oid and outputs a horizontal tuple for each oid
Severely penalized because of lack of engine support
Transformation strategy: Binary
Horizontal view
Attr1
…
Attr2
…
…
SELECT Attr1, Attr2 FROM hview
Query transformation
SELECT t2.Attr1, ATTR2.val
FROM ( SELECT t1.Oid, ATTR1.Val
FROM ( SELECT Oid FROM ALLPROD
) AS t1(Oid)
LEFT OUTER JOIN
ATTR1
ON t1.Oid = Attr1.Oid
) AS t2(Oid, Attr1)
LEFT OUTER JOIN
ATTR2
ON t2.Oid = ATTR2.Oid
ALLPROD
Binary Relations
Oid
ATTR1
Oid
Val
ATTR2
Oid
Val
Story So Far …
E-Commerce Data
– High-Arity, Sparse and Constantly Evolving
Three Approaches
Horizontal
Vertical (w/
Mapping)
Binary (w/
Mapping)
Manageability
+
+
-
Flexibility
-
+
-
Querying
+
+
+
But what about Performance ?
Performance:
Experimental setup
600 MHz dual-processor Intel Pentium machine
512 MB RAM
Windows NT 4.0
Database IBM DB2 UDB 7.1
Two 30GB IDE Drives
Buffer Pool Size – 50 MB
All numbers reported are cold numbers
Experimental Setup (Continued)
Parameters used for Synthetic Data
– Number of columns
– Number of rows
– Non-null density
– Selectivity of a predicate for a column
For example, 200X100K & r=10% implies 200 columns,
100K rows and non-null density = 10%
Complete results in technical report
Clustering by Key outperforms
clustering by Oid
Execution time (seconds)
density = 10%, 1000 cols x 20K rows
25
20
VerticalSQL_oid
15
VerticalSQL_key
10
5
0
0.1%
1%
Join selectivity
Join
5%
VerticalSQL comparable to Binary
and outperforms Horizontal
density = 10%
Execution time (seconds)
60
50
40
HorizontalSQL
30
VerticalSQL
20
Binary
10
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
VerticalUDF is the best approach
density = 10%
Execution time (seconds)
30
20
VerticalSQL
Binary
10
VerticalUDF
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
Wish List from Database Engines
– VerticalUDF approach showed need for
– Enhanced table functions
– First class treatment of table functions
– Native support for v2h and h2v operations
– Partial indices
Summary
Horizontal
Vertical (w/
Mapping)
Binary (w/
Mapping)
Manageability
+
+
-
Flexibility
-
+
-
Querying
+
+
+
Performance
-
+
+
Conclusions
Vertical Representation attractive for dealing with ECommerce Data
High-arity, sparse and constantly evolving schema
Get the best of both horizontal and vertical representations
Translation layer => Querying as easy as horizontal
representation
Great flexibility and manageability
Performance
– VerticalUDF best approach
– VerticalSQL comparable to Binary and outperforms Horizontal
Full Report @ http://www.almaden.ibm.com/cs/quest/