Compiler Techniques for Data Parallel Applications With Very Large

Download Report

Transcript Compiler Techniques for Data Parallel Applications With Very Large

FREERIDE: System Support for High
Performance Data Mining
Ruoming Jin
Leo Glimcher
Xuan Zhang
Ge Yang
Gagan Agrawal
Department of Computer and Information Sciences
Ohio State University
Motivation: Data Mining Problem



Datasets available for mining
are often large
Our understanding of what
algorithms and parameters
will give desired insights is
limited
Time required for creating
scalable implementations of
different algorithms and
running them with different
parameters on large
datasets slows down the
data mining process
Project Overview


FREERIDE (Framework for
Rapid Implementation of
datamining engines) as the
base system
Demonstrated for a variety
of standard mining
algorithms
FREERIDE offers:





The ability to rapidly prototype a highperformance mining implementation
Distributed memory parallelization
Shared memory parallelization
Ability to process large and disk-resident
datasets
Only modest modifications to a sequential
implementation for the above three
Key Observation from Mining Algorithms
While( ) {


Popular algorithms have
a common canonical
loop
Can be used as the
basis for supporting a
common middleware
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
Shared Memory Parallelization Techniques





Full Replication: create a copy of the reduction object
for each thread
Full Locking: associate a lock with each element
Optimized Full Locking: put the element and
corresponding lock on the same cache block
Fixed Locking: use a fixed number of locks
Cache Sensitive Locking: one lock for all elements in
a cache block
Memory Layout for Various Locking Schemes
Full Locking
Optimized Full Locking
Lock
Fixed Locking
Cache-Sensitive Locking
Reduction Element
Trade-offs between Techniques



Memory requirements: high memory requirements
can cause memory thrashing
Contention: if the number of reduction elements is
small, contention for locks can be a significant factor
Coherence cache misses and false sharing: more
likely with a small number of reduction elements
Combining Shared Memory and
Distributed Memory Parallelization



Distributed memory parallelization by replication of
reduction object
Naturally combines with full replication on shared
memory
For locking with non-trivial memory layouts, two
options


Communicate locks also
Copy reduction elements to a separate buffer
Apriori Association Mining
Time(s)
Relative Performance of Full Replication, Optimized Full
Locking and Cache-Sensitive Locking
100000
80000
60000
40000
20000
0
fr
ofl
csl
0.10%
0.05%
0.03%
0.02%
Support Levels
500MB dataset, N2000,L20, 4 threads
K-means Shared Memory Parallelization
1600
1400
1200
full repl
1000
800
opt full locks
600
400
cache sens.
Locks
200
0
1
thread
4
threads
16
threads
Performance on Cluster of SMPs
Apriori
Association
Mining
70000
60000
50000
40000
1 thread
2 threads
3 threads
30000
20000
10000
0
1 node
2
nodes
4
nodes
8
nodes
Results from EM Clustering Algorithm


EM is a popular data mining
algorithm
Can we parallelize it using
the same support that
worked for other clustering
algo (k-means) and algo for
other mining tasks
3000
2500
2000
1 thread
2 threads
3 threads
4 threads
1500
1000
500
0
1
2
4
8
nodes
Results from FP-tree
FPtree:
800 MB dataset
20 frequent itemsets
60
50
40
1 thread
2 threads
3 threads
30
20
10
0
1
2
4
8
A Case Study: Decision Tree Construction
•
•
•
Question: can we parallelize decision tree
construction using the same framework ?
Most existing parallel algorithms have a fairly
different structure (sorting, writing back …)
Being able to support decision tree construction will
significantly add to the usefulness of the framework
Approach



Implemented RainForest framework (Gehrke)
Currently focus on RF-read
Overview of the algorithm
 While the stop condition not satisfied




read the data
build the AVC-group for nodes
choose the splitting attributes to split nodes
select a new set of node to process as long as the main
memory could hold it
Parallelization Strategies




Pure approach: only apply one of full replication,
optimized full locking and cache-sensitive locking
Vertical approach: use replication at top levels,
locking at lower
Horizontal: use replication for attributes with a small
number of distinct values, locking otherwise
Mixed approach: combine the above two
Results
3500
3000
Time(s)
2500
fr
2000
ofl
1500
csl
1000
500
0
1
2
4
8
No. of Nodes
Performance of pure versions, 1.3GB dataset with 32 million records in the
training set, function 7, the depth of decision tree = 16.
Results
3000
Time(s)
2500
2000
horizontal
1500
vertical
1000
mixed
500
0
1
2
4
8
No. of Nodes
Combining full replication and full locking
Results
3000
Time(s)
2500
2000
horizontal
1500
vertical
1000
mixed
500
0
1
2
4
8
No. of Nodes
Combining full replication and cache-sensitive locking
Combining Distributed Memory and Shared
Memory Parallelization for Decision Tree



The key problem: large size of AVC groups means
very high communication volume
Results in very limited speedups
Can we modify the algorithm to reduce
communication volume ?
SPIES On (a) FREERIDE




Developed a new
communication efficient decision
tree construction algorithm –
Statistical Pruning of Intervals
for Enhanced Scalability (SPIES)
Combines RainForest with
statistical pruning of intervals of
numerical attributes to reduce
memory requirements and
communication volume
Does not require sorting of
data, or partitioning and
writing-back of records
Paper in SDM regular program
7000
6000
5000
1
thread
2
threads
3
threads
4000
3000
2000
1000
0
1
node
8
nodes
Applying FREERIDE for Scientific Data
Mining




Joint work with Machiraju
and Parthasarathy
Focusing on feature
extraction, tracking, and
mining approach developed
by Machiraju et al.
A feature is a region of
interest in a dataset
A suite of algorithms for
extracting and tracking
features
Summary

Demonstrated a common framework for
parallelization of a wide range of mining algos






Association mining – apriori and fp-tree
Clustering – k-means and EM
Decision tree construction
Nearest neighbor search
Both shared memory and distributed memory
parallelism
A number of advantages


Ease parallelization
Support higher-level interfaces