Scalable Clustering on the Data Grid

Download Report

Transcript Scalable Clustering on the Data Grid

Scalable Clustering
on the Data Grid
Patrick Wendel ([email protected])
Moustafa Ghanem
Yike Guo
Discovery Net
Department of Computing
Imperial College, London
Outline
Discovery Net
 Data Clustering
 Mining Distributed Data
 Description of the strategy
 Deployment
 Evaluation
 Conclusions – Future Works

All Hands Meeting, Nottingham
20/09/2005
Discovery Net

Multidisciplinary project funded by the EPSRC under the UK e-Science
programme (started Oct 2002, ended March 05)

Developed an infrastructure for Knowledge Discovery Services for
integrating and analysing data collected from high throughput devices
and sensors

Applications to:

Life Sciences
• High throughput genomics and proteomics

Real-time Environmental Monitoring
• High throughput dispersed air sensing technology

Geo-Hazard modelling
• Earthquake modelling through satellite imagery

The project covered many areas including infrastructure, applications
and algorithms (text mining)

Produced the Discovery Net platform which aims to integrate, compose,
coordinate and deploy knowledge discovery services using a workflow
technology.
All Hands Meeting, Nottingham
20/09/2005
Discovery Net
Scientific
Information
Scientific
Discovery
Literature
Databases
Images
Operational
Data
e-Science
large scale science that will increasingly
be carried out through distributed global
collaborations enabled by the Internet.
Using Distributed Computing Resources
Instrument
Data
All Hands Meeting, Nottingham
20/09/2005
Data Clustering

We concentrate on a particular class of data mining
algorithms: Clustering

A class of explorative data mining techniques, used to
find out groups of points that are similar/close to each
other.

Popular analysis technique. Useful for exploring,
understanding, modelling large data sets

Two main types of clustering:



Hierarchical: Reorganises the data set into a hierarchy
of clusters based on their similarity.
Partition/Model based: Tries to partition the data set into
a number of clusters or try to fit a statistical model (e.g.
mixture of Gaussians) to a data set
Successfully applied to sociological data, image
processing and genomic data.
All Hands Meeting, Nottingham
20/09/2005
Mining Data on the Grid


Changing environment for data analysis:
 From analysing data files held locally (or
close to the algorithm), to using remote data
source, using remote services through
portals, now towards distributed data
executions.
Distributed data sources:


Data mining processes can now require data
spread across multiple organisations
Service-oriented approach:

High-level functionalities are now available through
well-defined services, instead of providing low-level
(terminal etc..) access to resources
All Hands Meeting, Nottingham
20/09/2005
Goal

Design a service-oriented distributed data
clustering strategy:

that can be deployed on a Grid environment
(i.e. a standard-based, service oriented,
secure distributed environment)

that would allow the end-user/data analysts
to deploy easily against its own data sets
All Hands Meeting, Nottingham
20/09/2005
Requirements 1/2




Performance issues:
 The analysis process using data grids directly and
analysis services must be more efficient than
gathering all the data on my desktop!
Accuracy:
 The strategy should at least provide a model more
representative of the overall data set
Security
 The deployed strategy should ensure consistent
handling of authentication and authorization aspects
throughout
Privacy:
 Restricted access to the data source
All Hands Meeting, Nottingham
20/09/2005
Requirements 2/2

Heterogeneity of the resources used and/or connectivity


Loose-coupling between resources participating in the
distributed analysis


It’s very unlikely the set of resources involved in the
distributed analysis process will be similar or work over
networks of similar bandwidth
The analyst has less control on what is available/provided
by each data grid or each analysis service. Therefore the
framework should, as much as possible, be unaffected by
minor differences between functionalities provided by each
site.
Service-oriented approach: The deployment of the analysis
process should be based on the co-ordination of high-level
services (instead of a dedicated distributed algorithm, e.g.
MPI implementation)
All Hands Meeting, Nottingham
20/09/2005
Current strategy

We restrict the current framework to the case where
instances are distributed but have the same attributes
on each different fragments (~ horizontal fragments)

Based on the EM-Clustering algorithm (mixture of
Gaussian model fitting algorithm).
 Hierarchical clustering inherently complex to distribute
 Statistical approach of EM provides a sound basis to
define a model combination strategy
All Hands Meeting, Nottingham
20/09/2005
Approach




Generate clustering models at each data
source location (compute near the data)
Transfer partial models in standard format
(PMML) to a combiner site
Normalise the relative weights of each
model
Perform an EM-based method on partial
models to generate a global model.
All Hands Meeting, Nottingham
20/09/2005
Combining Cluster Models




Derived from the EM-Clustering algorithm
itself
Adapted to take as input the models
generated at each site
Each partial model is treated like a (very)
compressed representation of the fragment
(similar to the two step approaches of some
scalable clustering algorithms).
More detailed algorithm and formulae in
proceedings
All Hands Meeting, Nottingham
20/09/2005
Deployment: Discovery Net

The Discovery Net platform is used to build and deploy this framework.

Implementation based on an open architecture re-using common protocols
and common infrastructure elements (such as the Globus Toolkits).

It also defines its own protocol for workflows, Discovery Process Markup
Language (DPML) which allows the definition of data analysis workflows to
be executed on distributed resources.

The platform comprises a server that stores, schedules the workflows and
manage the data, and a thick client to help the workflow construction
process.

Thus giving the end user the ability to define application-specific workflows
performing such tasks as distributed data mining.

The model combiner is implemented as a workflow activity in Discovery Net
All Hands Meeting, Nottingham
20/09/2005
Deployment
Data sources
Discovery Net servers
Source A
Partial models
PMML
Global model
Partial clustering
Source B
PMML
Partial clustering
Combiner site
PMML
Source C
Partial clustering
All Hands Meeting, Nottingham
20/09/2005
Deployment: Workflow


The Discovery Net client enables the composition
and the execution of the distributed process as a
workflow constructed visually.
The execution engine will coordinate the distributed
execution
All Hands Meeting, Nottingham
20/09/2005
Accuracy Evaluation: Data
Distribution



Comparison of the accuracy of the combined model with the
average accuracy of partial models against the entire data sets
(i.e. have we gained some accuracy by considering the
fragments together)
Accuracy will strongly depend on how the data is distributed
among different sites. In the evaluation we introduce a
randomness ratio to determine how similar the data distribution
is among fragments.
 0 meaning that each site would have data drawn from
different distributions
 1 meaning that the data from all fragments are drawn from
the same distribution
Measured by log-likelihood function of the test data set:
 The likelihood function of a data set represents how much
that data is likely to be following the distribution function
defined by the model
All Hands Meeting, Nottingham
20/09/2005
Accuracy Evaluation: Data
distribution
Randomness ratio effect
0
LLikelihood
-20000
0
0.2
0.4
0.6
0.8
1
1.2
-40000
Average log-Likelihood
-60000
Combined log-Likelihood
-80000
-100000
-120000
Ratio

As expected, the ratio has a huge effect on gained accuracy.
For low levels, each fragment becomes less and less
representative of the complete data set, therefore the combined
model will outperform partial ones.
All Hands Meeting, Nottingham
20/09/2005
Accuracy Evaluation: Number of
fragments
0
2
3
4
5
6
7
8
9
10
-10000
-20000
-30000
Avg Likelihood
-40000
Combined Likelihood
-50000
-60000
-70000
-80000

(r= 0.2, 10,000 points, 5 clusters) The accuracy does degrade
with increasing number of fragments, but so does the average
accuracy of models generated from individual fragments.
All Hands Meeting, Nottingham
20/09/2005
Accuracy Evaluation: Increasing data size
Increasing data size
0
0
LLikelihood
-1000000
200000 400000 600000 800000 100000 120000
0
0
-2000000
Average log-Likelihood
-3000000
Combined log-Likelihood
-4000000
-5000000
# instances

(r=0.2,d=5,5 fragments). Consistent behaviour of the combined model’s
accuracy over partial ones.
All Hands Meeting, Nottingham
20/09/2005
Performance Evaluation
Performance evaluation is only partially relevant, as the process does not
feed back combined models and partial models are generated near the
data.
The heterogeneity of real deployments is difficult to take into account.



Time in seconds, for an increasing number of fragments
All Hands Meeting, Nottingham
20/09/2005
Performance Evaluation

Execution time with lower dimensionality and larger data sets
All Hands Meeting, Nottingham
20/09/2005
Conclusions




Encouraging results in terms of accuracy vs.
performance, given the constraints.
But is the trade-off between accuracy and flexibility
(generally the case in distributed data mining)
acceptable?
This should be part of a wider explorative process,
probably as a first step into the understanding of the
data set.
Being part of the Discovery Net platform, the
distributed analysis process can be simply designed
from the Discovery Net client software.
All Hands Meeting, Nottingham
20/09/2005
Future Works



First step towards more generic distributed data mining
strategies (classification algorithms, association rules)
Need evaluation against real data sets !
Possible improvements including:
 Refinement through feedback
 Use of a more complex intermediate summary
structure for the partial models (e.g. tree structures
containing summary information)
 Estimation of the number of clusters (using Bayesian
Information Criteria)
 Plenty of possible clustering algorithms to try to use.
All Hands Meeting, Nottingham
20/09/2005
Questions?
All Hands Meeting, Nottingham
20/09/2005