High Performance Data Mining on Multi-core

Download Report

Transcript High Performance Data Mining on Multi-core

Service Aggregated Linked
Sequential Activities: High
Performance Data Mining on Multi-core Systems
Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae
Xiaohong Qiu
Community Grids Laboratory, Indiana University Bloomington
Research Computing UITS
Technology Collaboration: George Chrysanthakopoulos, Henrik Frystyk Nielsen Microsoft
Application Collaboration: Cheminformatics: Rajarshi Guha and David Wild
Bioinformatics: Haiku Tang
Demographics (GIS): Neil Devadasan IUPUI
GOALS: Increasing number of cores accompanied by continued data deluge
Develop scalable parallel data mining algorithms with good multicore and
cluster performance; understand software runtime and parallelization method
Use managed code (C#) and package algorithms as services to encourage
broad use assuming experts parallelize core algorithms
CURRENT RESUTS: Microsoft CCR supports MPI, dynamic threading and via
DSS Service model of computing; detailed performance measurements
Speedups of 7.5 or above on 8-core systems for “large problems” with
deterministic annealed (avoid local minima) algorithms for clustering, Gaussian
1
Mixtures, GTM (dimensional reduction); extending to new algorithms/applications
Speedup = Number of cores/(1+f)
f = (Sum of Overheads)/(Computation per core)
Parallel Programming Strategy
“Main Thread” and Memory M
Computation  Grain Size n . # Clusters K
Overheads are
0
1
2
3
4
5
6
7
Synchronization: small with CCR
m0
m1
m2
m3
m4
m5
m6
m7
Load Balance: good
Subsidiary threads t with memory mt
Memory Bandwidth Limit:  0 as K  
Use Data Decomposition as in classic distributed memory
Cache Use/Interference: Important
Runtime Fluctuations: Dominant large n,K but use shared memory for read variables. Each thread
All our “real” problems have f ≤ 0.05 and
speedups on 8 core systems greater than 7.6
0.4
DA Clustering Performance
K=10 Clusters
Fractional
Overhead
f
0.3
20 Clusters
0.2
0.1
30 Clusters
uses a “local” array for written variables to get good cache
performance
MPI Exchange Latency in µs (20-30 µs computation between messaging)
Machine
OS
Intel8c:gf12
(8 core 2.33
Ghz)
(in 2 chips)
Redhat
Intel8c:gf20
(8 core 2.33
Ghz)
Fedora
Intel8b
(8 core 2.66
Ghz)
AMD4
(4 core 2.19
Ghz)
1
1.5
2
2.5
10000/Grain Size
3
3.5
4
Intel(4 core)
MPJE(Java) Process
MPICH2 (C) Process
MPICH2:Fast Process
Parallelism
MPI Latency
8
8
8
181
40.0
39.3
Process
Process
Process
8
8
8
4.21
157
111
MPICH2
Process
8
64.2
Vista
MPJE
Process
8
170
Fedora
MPJE
Process
8
142
Fedora
mpiJava
Process
8
100
Vista
CCR (C#)
Thread
8
20.2
XP
MPJE
Process
4
185
Redhat
MPJE
Process
4
152
mpiJava
Process
4
99.4
MPICH2
Process
4
39.3
XP
CCR
Thread
4
16.3
XP
CCR
Thread
4
25.8
0
0.5
Grains
Nemesis
MPJE
mpiJava
Runtime Fluctuations 2% to 5% overhead
0
Runtime
Deterministic Annealing Clustering of Indiana Census Data
Decrease temperature (distance scale) to discover more clusters
p: Total
a:Asian
r: Renters
h: Hispanic
Resolution T0.5
Resolution T0.5
PCA
GTM
Linear PCA v. nonlinear GTM on 6 Gaussians in 3D
GTM Projection of 2 clusters of
335 compounds in 155 dimensions
Stop Press: GTM Projection of
PubChem: 10,926,94 compounds
in 166 dimension binary property
space takes 4 days on 8 cores.
64X64 mesh of GTM clusters
interpolates PubChem. Could
usefully use 1024 cores! David
Wild will use for GIS style 2D
browsing interface to chemistry
Bioinformatics: Annealed Clustering and
Euclidean embedding for repetitive sequences,
gene/protein families. Use GTM to replace PCA
in structure analysis
General Formula DAC GM GTM DAGTM DAGM
N data points E(x) in D dim. space and Minimize F by EM
N
F  T  a ( x) ln{
x 1
Parallel Dimensional
Scaling and Metric
embedding;
Generalized
K
2
g
(
k
)
exp[

0.5(
E
(
x
)

Y
(
k
))
/
(
Ts
(
k
))]
Cluster analysis
k 1
Generative Topographic Mapping GTM
Deterministic Annealing Clustering DAC • a(x) = 1 and g(k) = (1/K)( /2)D/2
• a(x) = 1/N or generally p(x) with  p(x) =1
• s(k) = 1/  and T = 1
• g(k)=1 and s(k)=0.5
• Y(k) = m=1M Wmm(X(k))
• T is annealing temperature varied down from • Choose fixed  (X) = exp( - 0.5 (X- )2/2 )
m
m
 with final value of 1
• Vary Wm and  but fix values of M and K a priori
• Vary cluster center Y(k) but can calculate Pk
• Y(k) E(x) Wm are vectors in original high D dim. space
and (k) (even for matrix (k)) using
•
X(k)
and

are
vectors
in
2
dim.
mapped
space
m
IDENTICAL formulae for Gaussian mixtures
• K starts at 1 and is incremented by algorithm DAGTM: GTM has several natural annealing versions
based on either DAC or DAGM: under investigation
Deterministic Annealing Gaussian
mixture models DAGM
Also have Principal Component Analysis
Near
Term
Future
Work:
Parallel
Algorithms
for
• a(x) = 1
Random Projection Metric Embedding (Bourgain)
• g(k)={Pk/(2(k)2)D/2}1/T
• s(k)= (k)2 (taking case of spherical Gaussian) MDS Dimensional Scaling (EM like SMACOF)
• T is annealing temperature varied down from Marquardt Algorithm for Newton’s Method
Later: HMM and SVM, Other embedding
 with final value of 1
• Vary Y(k) Pk and (k)
• K starts at 1 and is incremented by algorithm
Traditional Gaussian mixture models GM
As DAGM but set T=1 and fix K
We need: Large Windows Cluster
Link of CCR and MPI (or cross cluster CCR)
Linear Algebra for C#: Multiplication, SVD, Equ. Solve
High Performance C# Math Libraries