PowerPoint - Oklahoma Supercomputing Symposium 2006

Download Report

Transcript PowerPoint - Oklahoma Supercomputing Symposium 2006

Improving Parallelism in
Structural Data Mining
Min Cai,
Istvan Jonyer, Marcin Paprzycki
Computer Science Department,
Oklahoma State University,
Stillwater, Oklahoma 74078,
U.S.A.
Who am I?

Min Cai: [email protected]
 Ph.D. Student of Computer Science
Department at Oklahoma State
University
 Research Interests:
 Parallel
and distributed computing
 Data Mining
Introduction
Data warehouses of increasing size
 Data mining  technique for discovering
interesting properties in data


structural data mining
 data represented as a graph
 aim  substructure discovery  finding
“interesting” and recurring subgraphs in a
labeled graph
SUBDUE (1)

Discovers substructures utilizing minimum
description length (MDL) principle

Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches
to Parallel Graph-Based Knowledge Discovery. Journal of
Parallel and Distributed Computing, 61(3) (2001) 427-446
Data objects  graph vertices
 Relationships  graph edges
 Substructure  connected subgraph
 NOTE  graph algorithms are notorious
for long execution times

SUBDUE (2)

Algorithm  two basic steps

substructure discovery



substructure replacement


apply minimal description length (MDL) principle to
find the “best” / “most important” structure in the
graph
possibly stop here  this is the answer
replace the substructure found in the first step by a
single vertex and repeat the process
results


single substructure
hierarchy of substructures
Parallel SUBDUE
Data-parallel approach
 Graph divided into subgraphs and send to
separate processor
 Processors find their best structure and
communicate with the rest
 The best overall substructure is found
 Hierarchical process can be repeated

MPI-SUBDUE
Graph divided into subgraphs using METIS
 point-to-point communications (MPI_Send
and MPI_Recv) used to communicate
between processors



NOTE  best structure in data set “7” may be
dreadful when confronted with data set “18”
Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a
Knowledge Discovery System by Exploiting Parallelism In the
Proceedings of the Third International Conference on
Knowledge Discovery and Data Mining (1997) 171-174
NEW-MPI-SUBDUE

Improvements




use PARMETIS to divide the initial graph
use global communication (MPI_Allgatherv)
use binary summation
YES, these changes do not look like much 
NEW-MPI-SUBDUE
Spawn P(0), P(1), P(2), ... , P(n)
Apply PARMETIS to partition G into n partitions
for all P(i) where 1 ≤ i ≤ n do
discover the best substructure in partition
broadcast best substructure to all other
processors
evaluate best substructure and broadcast
results
parallel-binary summation of results to find the
best overall partition
P(0) finds the best overall structure
EXPERIMENTAL SETUP





Mutagenesis data from OxUni
 datasets collected in order to predict mutagenicity
of aromatic and heteroaromatic nitro compounds
Graph 1
 2844 vertices and 2883 edges
Graph 2
 2896 vertices and 2934 edges
Graph 3
 22268 vertices and 22823 edges
16 node cluster (32 processors)


two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR
SDRAM, full-backplane Gigabit Ethernet switch
RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2
EXPERIMENTAL RESULTS I
EXPERIMENTAL RESULTS II
EXPERIMENTAL REULTS III
COMMENTS
 Graphs
1 and 2 that were large in 2000
are small and “useless” today
 Graph 3 gives realistic performance
picture  gains about 33%
 Speedup over original SUBDUE  268
on 32 processors for Graph 3

this IS “cheating” as some information may
be lost due to graph partitioning but…
 Graph
partitioning and balancing matter
THE END
THANK YOU!