No Slide Title

Transcript No Slide Title

Parallel Data Cube
F ord
B y M a ke & Y e ar
C he vy
B y Y ear
1 99 0
1 99 1
1 99 2
1 99 3
Data Mining
B y M ak e
Red
OLAP (On-line
analytical processing)
W hite
B lu e
B y C o lo ur
& Y e ar
cube / group-by
operator in SQL
B y M a ke & C olou r
B y C olou r
Frank Dehne
www.dehne.net
Data Warehousing for Decision
Support
Query Reports
Analysis
Data Mining
Front-End Tools
Olap Server
Output
Olap Server
Olap Engines
Monitoring
Administration
Data Warehouse
Data Storage
Meta Data Repository
Extract
Clean
Transform
Load
Refresh
Operational Databases
Frank Dehne
Data Marts
External Sources
Data Cleaning
and
Integration
• Operational data
collected into DW
• DW used to support
multi-dimensional
views
• Views form the basis
of OLAP processing
• Our focus: the OLAP
server
www.dehne.net
Multi-dimensional views
• Collection of feature
attributes
• Aggregate along one
or more measure
attributes
• Reduce the granularity
by collapsing
dimensions
F ord
B y M a ke & Y e ar
C he vy
B y Y ear
1 99 0
1 99 1
1 99 2
1 99 3
B y M ak e
Red
W hite
B lu e
B y C o lo ur
& Y e ar
B y M a ke & C olou r
B y C olou r
Frank Dehne
www.dehne.net
Data Cube Generation
• Proposed by Gray et al
(Microsoft) in 1995
• Exploits the relationship
between cuboids to
compute all 2d cuboids
• In OLAP views are
typically pre-computed
to improve query
response time
Frank Dehne
ABCD
ABC
AC
AB
A
ABD
ACD
AD
BC
B
C
BCD
BD
CD
D
All
www.dehne.net
Sequential Solutions
• Top Down Cube
– Compute high
dimension views first
– Exploit shared
dimensions
– Pipesort
– PipeHash
Frank Dehne
• Bottom Up Cube
– Minimizes external
memory sorting by
partitioning first on
single attributes
• ArrayCube
www.dehne.net
Sequential Solutions
MOLAP
• array representation
• easy to build and
query
• large storage
• needs translation
from/to relational
model
Frank Dehne
ROLAP
• relational data
representation
• harde to build and
query
• smaller storage
• no translation from/to
relational model
www.dehne.net
Top Down Cube (Pipesort)
ABCD
ABC
AC
AB
A
ABD
ACD
AD
BC
B
C
All
Frank Dehne
BCD
BD
CD
D
• Construct the data
cube lattice
• Estimate the edge
costs
• Find a least cost
spanning tree
• Compute the views by
following the “pipes”
www.dehne.net
Optimizations
– Share-sorts - sharing sorting cost across
multiple group-bys.
– Smallest parent - computing a cuboid from the
smallest previously computed parent.
– Cache results - reduce I/O by caching (in
memory) parent views from which other
cuboids are computed.
– Amortize disk-scans - compute as many child
views as possible when scanning each parent.
Frank Dehne
www.dehne.net
Bottom Up Cube
ABCD
ABC
AC
AB
A
ABD
ACD
AD
BC
B
C
BCD
BD
CD
D
All
Frank Dehne
www.dehne.net
Bottom Up Cube
C1
B1
C2
A1
B2
B3
B4
A2
A3
Frank Dehne
D1
D2
• Partition large view
into memory-sized
units
• Perform sorting
operations in memory
• May significantly
reduce external
memory processing
www.dehne.net
Our Results
– Parallel top-down ROLAP cube construction
for shared disks (Distributed and Parallel Databases, 2002)
– Parallel top-down ROLAP cube construction
for distributed disks (IPDPS 2002)
– Parallel bottom-up ROLAP cube construction
for shared and distributed disks (Distributed and
Parallel Databases, 2002)
– Parallel ROLAP cube indexing for distributed
disks (CCGrid 2003)
Frank Dehne
www.dehne.net
Our Results
– Parallel top-down ROLAP cube construction
for shared disks (Distributed and Parallel Databases, 2002)
– Parallel top-down ROLAP cube construction
for distributed disks (IPDPS 2002)
– Parallel bottom-up ROLAP cube construction
for shared and distributed disks (Distributed and
Parallel Databases, 2002)
– Parallel ROLAP cube indexing for distributed
disks (CCGrid 2003)
Frank Dehne
www.dehne.net
Parallel top-down ROLAP cube
Our approach:
– Partition the load in advance and assign cuboids
to individual processors
– Local computation exploits existing optimized
sequential algorithms (ROLAP)
– Communication is reduced to a single phase in
which work lists are distributed
Frank Dehne
www.dehne.net
Parallel top-down ROLAP cube
CBAD
CBA
BAD ACD
AC AD
BA
A
B
Frank Dehne
CB DB
C
All
BCD
CD
D
• Cut the process tree
into p “equal weight”
sub-trees
• Each processor
independently
generates cuboids
from its own sub-tree
• Load balance/stripe
the output
www.dehne.net
Tree Partitioning
• Optimal tree partitioning is NP-complete
• Min-max tree k-partitioning: Given a tree T with
n vertices and a positive weight assigned to each vertex,
delete k edges in the tree to obtain k connected
components T1, T2, ... Tk+1 such that the largest total
weight of a resulting sub-tree is minimized.
• O(n) time, Frederickson 1990
• O(Rk(k + log d)+n) time - Becker, Perl and
Schach ‘82
Frank Dehne
www.dehne.net
Over Sampling
p subtrees
s * p subtrees
p subsets
Frank Dehne
www.dehne.net
Time vs. #Proc
Frank Dehne
www.dehne.net
For more information ...
http://cgm.dehne.net
Frank Dehne
www.dehne.net
Frank Dehne
www.dehne.net

No Slide Title

Transcript No Slide Title

Directory