Cross-Validation Tools Overview Presentation
Download
Report
Transcript Cross-Validation Tools Overview Presentation
Data Mining and Cross-Validation
over distributed / grid enabled
networks: current state of the art
Presented by: Juan Bernal
COT4930 - Introduction to Data Mining
Instructor: Dr. Koshgoftaar,
Florida Atlantic University
Spring, 2008
Topics
Introduction
Cross-Validation definition and importance
Why is Cross-Validation a computational-intensive task?
Distributing Data-Mining processes over a computer network.
WEKA and distributed Data Mining: How it is done
Other Projects implementing grids/distributed networks
Weka Parallel
Grid Weka
Weka4WS
Inhambu
Conclusion
Introduction
Data Mining today is being performed in vast amounts of ever
growing data. The need to analyze and extract information
from different domain databases demands more
computational resources and the expected results in the
minimum amount of time possible.
There are many different projects that try to address Data
Mining processes over distributed, or grid enabled networks.
All of them attempt to make use of all the available computer
resources in a grid or networked environment to improve the
time that takes to obtain results and even to increase the
accuracy of the results obtained.
One of the Data Mining most computational intensive tasks is
Cross Validation, which is the focus in many grid/distributed
network Data Mining tools.
Cross-Validation
Cross-Validation (CV) is the standard Data Mining method for
evaluating performance of classification algorithms. Mainly, to
evaluate the Error Rate of a learning technique.
In CV a dataset is partitioned in n folds, where each is used
for testing and the remainder used for training. The procedure
of testing and training is repeated n times so that each
partition or fold is used once for testing.
The standard way of predicting the error rate of a learning
technique given a single, fixed sample of data is to use a
stratified 10-fold cross-validation.
Stratification implies making sure that when sampling is done
each class is properly represented in both training and test
datasets. This is achieved by randomly sampling the dataset
when doing the n fold partitions.
10-Fold Cross-Validation
In a stratified 10-fold CrossValidation the data is divided
randomly into 10 parts in which the
class is represented in
approximately the same proportions
as in the full dataset. Each part is
held out in turn and the learning
scheme trained on the remaining
nine-tenths; then its error rate is
calculated on the holdout set. The
learning procedure is executed a
total of 10 times on different training
sets, and finally the 10 error rates
are averaged to yield an overall
error estimate.
3-fold cross-validation
graphical example
Why is Cross-Validation a
computational-intensive task?
When seeking an accurate error estimate, it
is standard procedure to repeat the CV
process 10 times. This means invoking the
learning algorithm 100 times and is a
computational and time intensive task.
Given the nature of Cross-Validation many
researchers have worked on executing this
process more efficiently over a grid or
networked computer environments.
Distributing Data-Mining processes
over a computer network
Different projects including WEKA have
implemented a way to distribute Data Mining
processes and in particular Cross-Validation over
networked computers. In almost all projects a clientserver approach is used and methods like Java RMI
(Remote Method Invocation) and WSRF (Web
Services Resource Framework) are implemented to
allow network communications between clients and
servers.
Also, WEKA is the main tool over which different
projects are based to achieve Data Mining over
computer networks due to its easily accessible Java
source code and adaptability.
WEKA distribution of Data Mining
Processes over several computers
The WEKA tool contains a feature to split an experiment
and distribute it across several processors.
Distributing an experiment involves splitting it into
subexperiments that RMI sends to the host for
execution. The experiment can be partitioned by
datasets, where each subexperiment is self-contained
and applies all schemes to a single dataset. In the other
hand, with few datasets the partitions can set by run. For
example a 10 times 10-fold CV would be split into 10
sub experiments, one per run.
This feature is available from the experimenter section of
the WEKA tool which is the main section under which
research is done.
Under the Experimenter the ability to distribute
processes is found under the advanced version of the
Setup panel.
WEKA requirements for
distributing experiments
Each host:
Needs Java installed
Needs access to databases to be used
Needs to be running the
weka.experiment.RemoteEngine experiment server
Distributing an experiment works best if the results
are sent to a central database by selecting JDBC as
the result destination. If not preferred, each host
can save the results to a different ARFF that can be
merged afterwards.
WEKA difficulties for
distributed implementation
File and directory permissions can be difficult to set
up.
Manually installing and configuring each host with the
Weka experimenter server and the remote.policy file
which grants remote engine permissions for network
operations.
Manually initializing or starting each host.
Setting up a centralized database server and access.
In the positive side once all these configurations and
preparations are done the experiment can be
executed and time can be saved by distributing the
workload among the hosts.
WEKA Experimenter Tutorial:
http://sourceforge.net/project/downloading.php?groupnam
e=weka&filename=ExperimenterTutorial3.4.12.pdf&use_mirror=internap
Other Projects implementing grids / distributed
networks for Data Mining and Cross-Validation
Based on Weka there are some projects that try to
improve the process of performing data mining and
cross-validation over numerous computers:
Weka Parallel
Grid Weka
Inhambu
Weka4WS
Weka-Parallel:
Machine Learning in Parallel
Weka-Parallel was created with the intention of being able to run the
cross-validation portion of any given classifier very quickly. This
speed increase is accomplished by simultaneously calculating the
necessary information using many different machines.
To achieve communication from the computer running Weka (client)
to the other computers (servers) Weka-Parallel uses a simple
connection established by the Socket class in the Java.net package.
Each server would start a daemon that listens to a port, then the
socket would open a Data and an Object DataStream to send/receive
information.
RMI was not used to manage the client calls to servers to do the
necessary methods for calculating specific folds of CV. Instead, the
client sends integer codes to the servers telling him what methods to
run.
Each server receives a copy of the dataset, and information on what
fold it has to perform. The client computer maintains an index to
assign what fold each server performs and has a Round Robin
algorithm.
Weka-Parallel:
Speedup performance analysis
An experiment was done
running the J48 decision tree
classifier with default
parameters on the Waveform5000 dataset from the UCI
repository. The 5000-Waveform
dataset contains 5300 points in
21 dimensions, and the goal s
to find the classifier that
correctly distinguishes between
3 classes of waves. A 500-fold
cross validation was ran using
up to 14 computers with similar
hardware.
Weka-Parallel link: http://weka-parallel.sourceforge.net/
Grid-Weka
-
In the Grid-enabled Weka, execution of the following tasks can be distributed
across several computers in an ad-hoc Grid:
Building a classifier on a remote machine.
Testing a previously built classifier on several machines in parallel .
Labeling a dataset using a previously built classifier on several machine in
parallel.
Using several machines to perform parallel cross-validation.
Labeling involves applying a previously learned classifier to an unlabelled
data set to predict instance labels.
Testing takes a labeled data set, temporarily removes class labels, applies
the classifier, and then analyses the quality of the classification algorithm by
comparing the actual and the predicted labels.
Finally, for n-fold cross-validation a labeled data set is partitioned into n folds,
and n training and testing iterations are performed. On each iteration, one
fold is used as a test set, and the rest of the data is used as a training set. A
classifier is learned on the training set and then validated on the test data.
Grid-Weka is similar to the Weka-Parallel project, but allows for performing
more functions in parallel on remote machines (and also includes better load
balancing, fault monitoring, and datasets management).
Grid-Weka
The labeling function is
distributed by partitioning the
data set, labeling several
partitions in parallel on different
available machines, and
merging the results into a single
labeled data set.
The testing function is
distributed in a similar way, with
test statistics being computed in
parallel on several machines for
different subsets of the test
data.
Distributing cross-validation is
also straightforward: individual
iterations for different folds are
executed on different machines.
Grid-Weka : Setup details
It uses a custom interface for communication between clients
and servers utilizing native Java object serialization for data
exchange.
It is mainly done on a Java command line execution style.
It uses a .weka-parallel configuration file in the client computer
to setup the list of servers, in the following format:
PORT=<Port number>
<Machine IP address or DNS name>
<Number of Weka servers running on this machine>
<Max. amount of memory on this machine in Mbytes>
<Machine IP address or DNS name>
For each Weka server, a copy of the Weka software (the .jar
file) is made on the selected machines and the Weka server
class is run as follows: java weka.core.DistributedServer
<Port number>
If a machine is going to run more than one weka server each
server should have its own directory so it doesn’t combine
results generated.
Performance analysis between
Weka-Parallel and Grid-Weka
Grid-weka sacrifices
some performance in
exchange of more
features, compare to the
Parallel-Weka. These
features are loadbalancing, data
recovery/fault monitoring,
and more data mining
functions than just cross
validation.
Grid-Weka Development: http://userweb.port.ac.uk/~khusainr/weka/Xin_thesis.pdf
Grid-Weka HowTo: http://userweb.port.ac.uk/~khusainr/weka/gweka-howto.html
Inhambu
Inhambu is a distributed object-oriented system that
supports the execution of data mining applications
on clusters of PCs and workstations.
Inhambu is a system that uses the idle resources in
a cluster composed of commodity PCs, supporting
the execution of DM applications based on the
Weka tool.
Its goal is to improve issues with Scheduling and
load sharing, Overloading and contention
avoidance, Heterogeneity, and Fault tolerance
when performing Data Mining processes in a grid or
clusters of computers.
Inhambu: architecture
The architecture of Inhambu implements:
An application layer: consists in a modified implementation of
Weka. With specific components implemented and deployed at the
client and server sides. The client component executes the user
interface and generates DM tasks, while the server contains the
core Weka classes which execute the DM tasks.
A resource management layer: which provides for the execution of
Weka in a distributed environment.
The trader provides publishing and discovery mechanisms for
clients and servers.
Inhambu: Improvements
Scheduling and load sharing: Implementation of static and
dynamic performance indices. Static performance indices are
usually implemented by static values that express or quantify
amounts of resources and capacities. After an index is created
then a dynamic monitoring performance updates the index.
Overloading and contention avoidance: Implementation of a “best
effort” policy, where to avoid overloading a computer, it can only be
chosen to receive load entities if its load index is below a given
threshold. Default value chosen for the threshold is 0.7. for the
relationship utilization index vs. the response time of a computer
system.
Heterogeneity: Based on the Capacity State Index maintained
distribution of the work can be enhanced in heterogeneous
environments.
Fault tolerance: Checkpointing and recovery was implemented in
the client side.
Inhambu: performance against
Parallel-Weka.
Performance was done by
running experiments on 2 real
world databases:
Adults Census Income, and
the a dataset for the diffuse
large-B-cell lymphoma
DLBCL.
The first performance test was
done to determine scalability
as shown in the tables when
using J48 and PART
classifications.
Inhambu and Weka-Parallel
performs roughly similar for
fine granularity tasks, and
Inhambu performs better than
Weka-Parallel when running
tasks whose granularity is
coarser.
Inhambu: Performance on nondedicated and heterogeneous clusters
Notice that Weka-Parallel
can lead to better
performance in presence of
shorter tasks, such as J4.8,
due to its low communication
overhead (it uses sockets).
Regardless of higher
overhead due to the use of
RMI, Inhambu has a better
performance in presence of
longer tasks,
Inhambu link: http://inhambu.incubadora.fapesp.br/portal
Weka4WS
The goal of Weka4WS is to extend Weka to support
remote execution of the data mining algorithms
through the Web Services Resource Framework
(WSRF) Web Services.
To enable remote invocation, all the data mining
algorithms provided by the Weka library are
exposed as a Web Service.
Weka4WS has been developed by using the WSRF
Java library provided by Globus Toolkit 4 (GT4).
Which is an OGSA (Open Grid Service
Architecture).
Weka4WS structure
-
-
In the Weka4WS framework all
nodes use the GT4 services for
standard Grid functionalities, such
as security and data
management. Those nodes can
be distinguished in two
categories:
1. user nodes, which are the local
machines of the users providing
the Weka4WS client software
2. computing nodes, which
provide the Weka4WS Web
Services allowing the execution of
remote data mining tasks.
The storage node can be applied
when a centralized database is
used.
Weka4WS :Setup details
Weka4WS requires Globus Toolkit 4 on the computing nodes and
only the Java WS Core (a subset of Globus Toolkit) on the user
nodes. But since GT4 only runs in Unix platforms, the computing
nodes need to be Unix or Linux.
The Weka4WS client can be installed in either Unix or Windows
environment.
Due to the nature of the web-service oriented approach there are
security requirements because Weka4WS runs in a security context,
and uses a grid-map authorization (only users listed in the service
grid-map can execute it). Authentication needed using certificates.
In the client computer a machines file is needed for listing all the
computing nodes. This is the only setup/configuration Weka4WS
needs. The format of this file:
# ==================== computing node ====================
# hostname
container port
gridFTP port
pluto.deis.unical.it
8443
2811
Weka4WS: performance
Performance analysis of
Weka4WS for executing a
typical data mining task in
different network scenarios.
In particular, the execution
times of the different steps
needed to perform the
overall data mining task
were evaluated to determine
overhead on LAN vs. WAN
networks.
No performance
comparisons were done
against other Grid enabled
data mining tools.
Weka4WS paper:
http://grid.deis.unical.it/papers/pdf/PKDD
2005.pdf
Conclusion
The area of Data Mining and Cross-Validation over Grid
enabled environments is in constant development.
Latest efforts try to develop and implement standard
frameworks such as the OGSA (Open Grid Service
Architecture) for data mining tools.
From the analysis of each of the presented tools, Weka4WS
presents the most interesting overall. Still, other projects have
positive features that eventually will be conglomerated into a
single Grid Data Mining Tool based on Weka.
Further research will focus on enhancing performance, of the
current tools that use RMI and the WSRF to avoid the
overhead given by communications. Also, a further research
topic can include available peer-to-peer or internet networks to
facilitate performing data mining task over an Internet cluster
available to everyone.