Oid - Rakesh Agrawal

Download Report

Transcript Oid - Rakesh Agrawal

Storage and Retrieval of
E-Commerce Data
R. Agrawal, A. Somani, Y. Xu: VLDB-2001
Overview





E-Commerce Data Characteristics
Alternative Physical Representations
Querying Mapping Layer
Performance evaluation of various approaches
Conclusions
Typical E-Commerce Data
Characteristics
An Experimental E-marketplace for
Computer components

Nearly 2 Million components
 More than 2000 leaf-level
categories
 Large number of Attributes (5000)

Constantly evolving schema
Sparsely populated data (about
50-100 attributes/component)

Outline

E-Commerce Data Characteristics

Alternative Physical Representations
 Horizontal - One n-ary relation
 Binary - N 2-ary relations
 Vertical - One 3-ary relation

Query Mapping Layer
 Performance evaluation
 Conclusion
Conventional horizontal representation
(n-ary relation)
Name
Monitor
Height
Recharge
Output
playback
Smooth scan
Progressive Scan
PAN DVD-L75
7 inch
-
Built-in
Digital
-
-
-
KLH DVD221
-
3.75
-
S-Video
-
-
No
SONY S-7000
-
-
-
-
-
-
-
SONY S-560D
-
-
-
-
Cinema Sound
Yes
-
…
…
…
…
…
…
…
…

DB Catalogs do not support thousands of columns (DB2/Oracle
limit: 1012 columns)
 Storage overhead of NULL values Nulls increase the index size
and they sort high in DB2 B+ tree index
 Hard to load/update
 Schema evolution is expensive
 Querying is straightforward
Binary Representation
(N 2-ary relations)
Monitor

Height
Output
Name
Val
Name
Val
PAN DVD-L75
7 inch
KLH DVD221
3.75
Dense representation
 Manageability is hard
because of large number of
tables
 Schema evolution expensive
Name



Val
PAN DVD-L75
Digital
KLH DVD221
S-Video
Decomposition Storage Model
[Copeland et al SIGMOD 85],
[Khoshafian et al ICDE 87]
Monet: Binary Attribute Tables
[Boncz et al VLDB Journal 99]
Attribute Approach for storing
XML Data [Florescu et al INRIA
Tech Report 99]
Vertical representation
(One 3-ary relation)
Oid (object identifier) Key (attribute name) Val (attribute value)
Oid
Key
Val
0
‘Name’
‘PAN DVDL75’
0
‘Monitor’
‘7 inch’
0
‘Recharge’
‘Built-in’
0
‘Output’
‘Digital’
1
‘Name’
‘KLH DVD221’
1
‘Height’
‘3.75’
1
‘Output’
‘S-Video’
1
‘Progressiv
e Scan’
‘No’
2
‘Name’
‘SONY S-7000’
…
…
…





Objects can have large number of
attributes
Handles sparseness well
Schema evolution is easy
Implementation of SchemaSQL [LSS 99]
Edge Approach for storing XML Data [FK
99]
Querying over Vertical
Representation

Simple query on a Horizontal scheme
SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’
Becomes quite complex:
SELECT v1.Val
FROM vtable v1, vtable v2
WHERE v1.Key = ‘Monitor’
AND v2.Key = ‘Output’
AND v2.Val = ‘Digital’
AND v1.Oid = v2.Oid
Writing applications becomes much harder. What can we do ?
Solution : Query Mapping on
Vertical Representation

Simplify querying of the data
– Provide a horizontal view of the vertical table
– Automatically transform queries on the
horizontal view to the vertical table
Solution (Continued)

Translation layer maps relational algebraic
operations on H to operations on V
Horizontal
view (H)
Attr1
…
Attr2
Query Mapping Layer
Vertical
table (V)
Oid
Key
Val
Attrk
…
Key Issues

Can we define a translation algebra ?
 Standard database view mechanism does not
work
 attribute values
become column names (need higher
order views)

Can the vertical representation support fast
querying ?
 What are the new requirements from the database
engines?
Outline

E-Commerce data characteristics
 Alternative Approaches


Query Mapping Layer
 Transformation Algebra
 Implementation Strategies
Performance evaluation
 Conclusion
Transformation Algebra

Defined an algebra for transforming expressions
over horizontal views into expressions over the
vertical representation. Two key operators:
– v2h (Vertical to Horizontal)
– h2v (Horizontal to Vertical)
Notation










Select (s)
Project (p)
Join ()
Left Outerjoin ()
Right Outerjoin ()
Union ()
Intersection ()
Difference (-)
Aggregation (f)
Null Value ()
Sample Algebraic Transforms

v2h () Operation – Convert from vertical to horizontal
k(V) = [Oid(V)]  [i=1,k Oid,Val(sKey=‘Ai’(V))]

h2V () Operation – Convert from horizontal to vertical
k(H) = [i=1,k Oid,’Ai’Ai(sAi  ‘’(V))] 
[i=1,k Oid,’Ai’Ai(si=1,k Ai=‘’(V))

Similar operations such as Unfold/Fold and Gather/Scatter
exist in SchemaSQL [LSS 99] and [STA 98] respectively

Complete transforms in VLDB-2001 Paper
From the Algebra to SQL

Equivalent SQL transforms for algebraic transforms
– Select, Project
– Joins (self, two verticals, a horizontal and a vertical)
– Cartesian Product
– Union, Intersection, Set difference
– Aggregation

Extend DDL to provide the Horizontal View
CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable
USING COLUMNS (Attr1, Attr2, … Attrk, …)

Four implementation strategies to provide Horizontal View
Implementation Strategies on
Vertical Representation

VerticalSQL
– Uses only SQL-92 level capabilities
– Used XQGM code to represented parsed SQL
queries
 VerticalUDF
– Exploits User Defined Functions and Table
Functions to provide a direct implementation
Alternative Implementation
Strategies

Binary (hand-coded queries)
– 2-ary representation with one relation per
attribute (using only SQL-92 transforms)

SchemaSQL
– Addresses a more general problem
– Performed 2-3X worse than Vertical
representation because of temporary tables
Transformation strategy: VerticalSQL
Horizontal view
Attr1
Attr2
…
SELECT Attr1, Attr2 FROM hview
Query transformation
SELECT t7.Attr1, t7.Attr2
FROM ( SELECT t4.Oid, t4.Attr1, t6.Attr2
FROM (SELECT t0.Oid, t3.Attr1
FROM (SELECT DISTINCT t0.Oid FROM vtable AS t0
) AS t1(Oid)
LEFT OUTER JOIN
(SELECT t2.Oid, t2.Val FROM vtable AS t2
WHERE t2.Key = ‘Attr1’
) AS t3(Oid, Attr1)
ON t1.Oid = t3.Oid
) AS t4(Oid, Attr1)
LEFT OUTER JOIN
(SELECT t5.Oid, t5.Val FROM vtable AS t5
WHERE t5.Key = ‘Attr2’
) AS t6(Oid, Attr2)
ON t4.Oid = t6.Oid
) AS t7(Oid, Attr1, Attr2)
Vertical table
Oid
Key
Val
…
…
Transformation strategy: VerticalUDF
Horizontal view
Attr1
Attr2
…
…
…
SELECT Attr1, Attr2 FROM hview
Query transformation
SELECT t1.Attr1, t1.Attr2
FROM vtable AS t0,
TABLE(v2h(t0.Oid, t0.Key, t0.Val)) AS t1(Oid, Attr1, Attr2)
WHERE t0.Key = ‘Attr1’ OR t0.Key = ‘Attr2’
Vertical table
 The
Oid
Key
Val
0
‘Attr1’
…
v2h table function reads tuples of vertical table
sorted on Oid and outputs a horizontal tuple for each oid
 Severely penalized because of lack of engine support
Transformation strategy: Binary
Horizontal view
Attr1
…
Attr2
…
…
SELECT Attr1, Attr2 FROM hview
Query transformation
SELECT t2.Attr1, ATTR2.val
FROM ( SELECT t1.Oid, ATTR1.Val
FROM ( SELECT Oid FROM ALLPROD
) AS t1(Oid)
LEFT OUTER JOIN
ATTR1
ON t1.Oid = Attr1.Oid
) AS t2(Oid, Attr1)
LEFT OUTER JOIN
ATTR2
ON t2.Oid = ATTR2.Oid
ALLPROD
Binary Relations
Oid
ATTR1
Oid
Val
ATTR2
Oid
Val
Story So Far …

E-Commerce Data
– High-Arity, Sparse and Constantly Evolving

Three Approaches
Horizontal
Vertical (w/
Mapping)
Binary (w/
Mapping)
Manageability
+
+
-
Flexibility
-
+
-
Querying
+
+
+

But what about Performance ?
Performance:
Experimental setup

600 MHz dual-processor Intel Pentium machine
 512 MB RAM
 Windows NT 4.0
 Database IBM DB2 UDB 7.1
 Two 30GB IDE Drives

Buffer Pool Size – 50 MB

All numbers reported are cold numbers
Experimental Setup (Continued)

Parameters used for Synthetic Data
– Number of columns
– Number of rows
– Non-null density
– Selectivity of a predicate for a column
For example, 200X100K & r=10% implies 200 columns,
100K rows and non-null density = 10%

Complete results in technical report
Clustering by Key outperforms
clustering by Oid
Execution time (seconds)
density = 10%, 1000 cols x 20K rows
25
20
VerticalSQL_oid
15
VerticalSQL_key
10
5
0
0.1%
1%
Join selectivity
Join
5%
VerticalSQL comparable to Binary
and outperforms Horizontal
density = 10%
Execution time (seconds)
60
50
40
HorizontalSQL
30
VerticalSQL
20
Binary
10
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
VerticalUDF is the best approach
density = 10%
Execution time (seconds)
30
20
VerticalSQL
Binary
10
VerticalUDF
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
Wish List from Database Engines
– VerticalUDF approach showed need for
– Enhanced table functions
– First class treatment of table functions
– Native support for v2h and h2v operations
– Partial indices
Summary
Horizontal
Vertical (w/
Mapping)
Binary (w/
Mapping)
Manageability
+
+
-
Flexibility
-
+
-
Querying
+
+
+
Performance
-
+
+
Conclusions

Vertical Representation attractive for dealing with ECommerce Data


High-arity, sparse and constantly evolving schema
Get the best of both horizontal and vertical representations
 Translation layer => Querying as easy as horizontal
representation
 Great flexibility and manageability
 Performance
– VerticalUDF best approach
– VerticalSQL comparable to Binary and outperforms Horizontal
Full Report @ http://www.almaden.ibm.com/cs/quest/