Compiler Techniques for Data Parallel Applications With Very Large

Download Report

Transcript Compiler Techniques for Data Parallel Applications With Very Large

System Support for High Performance
Scientific Data Mining
Gagan Agrawal
Ruoming Jin
Raghu Machiraju
S. Parthasarathy
Department of Computer and Information Sciences
Ohio State University
Scientific Data Mining Problem



Datasets used for scientific
data mining are large –
particularly from simulations
Our understanding of what
algorithms and parameters
will give desired insights is
limited
Time required for
implementing different
algorithms and running them
with different parameters on
large datasets slows down
the scientific data mining
process
Project Overview



FREERIDE (Framework for
Rapid Implementation of
datamining engines) as the
base system
Already demonstrated for a
variety of standard mining
algorithms
Working for feature analysis
and mining of simulation
data currently
FREERIDE offers:





The ability to rapidly prototype a highperformance mining implementation
Distributed memory parallelization
Shared memory parallelization
Ability to process large and disk-resident
datasets
Only modest modifications to a sequential
implementation for the above three
Key Observation from Mining Algorithms
While( ) {


Popular algorithms have
a common canonical
loop
Can be used as the
basis for supporting a
common middleware
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
Performance of Shared Memory
Parallelization
K-means
clustering
1600
1400
1200
full repl
1000
800
opt full locks
600
400
cache sens.
Locks
200
0
1
thread
4
threads
16
threads
Performance on Cluster of SMPs
Apriori
Association
Mining
70000
60000
50000
40000
1 thread
2 threads
3 threads
30000
20000
10000
0
1 node
2
nodes
4
nodes
8
nodes
SPIES On (a) FREERIDE



Developed a new
communication efficient decision
tree construction algorithm –
Statistical Pruning of Intervals
for Enhanced Scalability (SPIES)
Combines RainForest with
statistical pruning of intervals of
numerical attributes to reduce
memory requirements and
communication volume
Does not require sorting of
data, or partitioning and
writing-back of records
7000
6000
5000
1
thread
2
threads
3
threads
4000
3000
2000
1000
0
1
node
8
nodes
Broader Research Agenda
Applying FREERIDE for Scientific Data
Mining



Focusing on feature
extraction, tracking, and
mining approach developed
by Machiraju et al.
A feature is a region of
interest in a dataset
A suite of algorithms for
extracting and tracking them
A Feature Analysis Algorithm
Data
Transform
Tour Grid
Aggregate
Denoise
Operator
Classify Points
Rank
Track
Catalog
ROIs
Classify-Aggregate
Ongoing Work – Parallelization Using
FREERIDE



Most of the steps involve
generalized reductions supported well in FREERIDE
Extensions to FREERIDE
required for aggregation and
tracking steps
Overall, FREERIDE can allow
rapid implementation of
scalable versions of a variety of
steps and algorithms that are
part of the feature mining
paradigm