Appendix: The WEKA Data Mining Software

Download Report

Transcript Appendix: The WEKA Data Mining Software

Appendix: The WEKA Data Mining
Software
http://www.cs.waikato.ac.nz/ml/weka/
1
WEKA: Introduction




WEKA, developed by Waikato University, New Zealand.
WEKA (Waikato Environment for Knowledge Analysis)
History: 1st version (version 2.1, 1996); Version 2.3,
1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6,
2008.
WEKA provides a collection of data mining, machine
learning algorithms and preprocessing tools.





It includes algorithms for regression, classification, clustering,
association rule mining and attribute selection.
It also has data visualization facilities.
WEKA is an environment for comparing learning
algorithms
With WEKA, researchers can implement new data
mining algorithms to add in WEKA
WEKA is the best-known open-source data mining
software.
2
WEKA: Introduction

WEKA was written in Java.







WEKA 3.4 consists of 271477 lines of code.
WEKA 3.6 consists of 509903 lines of code.
It can work on Windows, Linux and Macintosh.
Users can access its components through Java
programming or through a command-line interface.
It consists of three main graphical user interfaces:
Explorer, Experimenter and Knowledge Flow.
The easiest way to use WEKA is through Explorer,
the main graphical user interface.
Data can be loaded from various sources, including
files, URLs and databases. Database access is
provided through Java Database Connectivity.
3
WEKA data format




WEKA stores data in flat files (ARFF format).
It’s easy to transform EXCEL file to ARFF format.
An ARFF file consists of a list of instances
We can create an ARFF file by using Notepad or
Word.




The name of the dataset is with @relation
Attribute information is with @attribute
The data is with @data.
Beside ARFF format, WEKA allows CSV, LibSVM,
and C4.5’s format.
4
WEKA ARFF format
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny, 85, 85, FALSE, no
sunny, 80, 90, TRUE, no
overcast, 83, 86, FALSE, yes
rainy, 70, 96, FALSE, yes
rainy, 68, 80, FALSE, yes
……………………………
5
Explorer GUI

Consists of 6 panels, each for one data mining
tasks:







Preprocess
Classify
Cluster
Associate
Select Attributes
Visualize.
Preprocess:


to use WEKA’s data preprocessing tools (called “filters”) to
transform the dataset in several ways.
WEKA contains filters for:

Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
6
Explorer (cont.)

Classify:

Regression techniques (predictors of “continuous classes”)





Linear regression
Logistic regression
Neural network
Support vector machine
Classification algorithms
 Decision trees – ID3, C4.5 (called J48)
 Naïve Bayes, Bayes network
 k-nearest-neighbors
 Rule learners: Ripper, Prism
 Lazy rule learners
 Meta learners (bagging, boosting)
7

Clustering
 Clustering algorithms:



K-Means, X-Means, FarthestFirst
Likelihood-based clustering: EM (Expectation-Maximization)
Cobweb (incremental clustering algorithm)
Clusters can be visualized and compared to “true” clusters (if
given)
Attribute Selection: This provides access to various methods for
measuring the utility of attributes and identifying the most
important attributes in a dataset.
 Filter method: the attribute set is filtered to produce the most
promising subset before learning begins.
 A wide range of filtering criteria, including correlation-based
feature selection, the chi-square statistic, gain ratio, information,
support-machine-based criterion.
 A variety of search methods: forward and backward selection,
best-first search, genetic search and random search.
 PCA (principal component analysis) to reduce the dimensionality
of a problem.
 Discretizing numeric attributes.


8
Explorer (cont.)

Assocation rule mining

Apriori algorithm


Work only with discrete data
Visualization




Scatter plots, ROC curves,Trees, graphs
WEKA can visualize single attributes (1-d) and pairs of
attributes (2-d).
Color-coded class values.
“Zoom-in” function
9
10
Explorer
GUI
(Classify)
11
WEKA Experimenter





This interface is designed to facilitate experimental
comparisons of the performance of algorithms
based on many different evaluation criteria.
Experiments can involves many algorithms that are
run on multiple datasets.
Can also iterate over different parameter settings
Experiments can also be distributed across different
computer nodes in a network.
Once an experiment has been set up, it can be
saved in either XML or binary form, so that it can be
re-visited.
12
13
Knowledge Flow Interface




The Explorer is designed for batch-based data
processing: training data is loaded into memory and
then processed.
However WEKA has implemented some incremental
algorithms.
Knowledge-flow interface can handle incremental
updates. It can load and preprocess individual
instances before feeding them into incremental
learning algorithms.
Knowledge-flow also provides nodes for
visualization and evaluation.
14
15
Conclusions





Comparison to R, WEKA is weaker in classical statistics but
stronger in machine learning (data mining) algorithms.
WEKA has developed a set of extensions covering diverse areas,
such as text mining, visualization and bioinformatics.
WEKA 3.6 includes support for importing PMML models
(Predictive Modeling Markup Language). PMML is a XML-based
standard fro expressing statistical and data mining models.
WEKA 3.6 can read and write data in the format used by the well
known LibSVM and SVM-Light support vector machine
implementations.
WEKA has 2 limitations:
 Most of the algorithms require all the data stored in main memory.
So it restricts application to small or medium-sized datasets.
 Java implementation is somewhat slower than an equivalent in
C/C++
16
References




I.H. Witten and E. Frank, Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, San
Francisco, 2000.
M. Hall and E. Frank, The WEKA Data Mining
Software: An Update, J. SIGKDD Explorations, Vol.
11, No. 1, 2008.
R. R. Bouckaert et al., WEKA Manual for Version
3.6.0, 2008.
E. Frank et al., WEKA – A Machine Learning
Workbench for Data Mining, 2003.
17