Transcript wekalab

An Exercise in
Machine Learning
http://www.cs.iastate.edu/~cs573x/bbsilab.html
• Machine Learning Software
• Preparing Data
• Building Classifiers
• Interpreting Results
• Test-driving WEKA
Machine Learning Software

Suites (General Purpose)
WEKA (Source: Java)
 MLC++ (Source: C++)
 SIPINA
 List from KDNuggets (Various)


Specific
Classification: C4.5, SVMlight
 Association Rule Mining
 Bayesian Net … …


Commercial vs. Free vs. Programming
What does WEKA do?





Implementation of state-of-art learning
algorithm
Main strengths in the classification
Regression, Association Rules and clustering
algorithms
Extensible to try new learning schemes
Large variety of handy tools (transforming
datasets, filters, visualization etc…)
WEKA resources




API Documentation, Tutorial, Source code.
WEKA mailing list
Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations
Weka-related Projects:
Weka-Parallel - parallel processing for Weka
 RWeka - linking R and Weka
 YALE - Yet Another Learning Environment
 Many others…

Getting Started






Installation (Java runtime +WEKA)
Setting up the environment (CLASSPATH)
Reference Book and online API document
Preparing Data sets
Running WEKA
Interpreting Results
ARFF Data Format




Attribute-Relation File Format
Header – describing the attribute
types
Data – (instances, examples)
comma-separated list
Use the right data format:


Filestem, CSV  ARFF format
Use C45Loader and CSVLoader to
convert
Launching WEKA
Load Dataset into WEKA
Data Filters




Useful support for data preprocessing
Removing or adding attributes, resampling
the dataset, removing examples, etc.
Creates stratified cross-validation folds of the
given dataset, and class distributions are
approximately retained within each fold.
Typically split data as 2/3 in training and 1/3
in testing
Building Classifiers



A classifier model - mapping from dataset
attributes to the class (target) attribute.
Creation and form differs.
Decision Tree and Naïve Bayes Classifiers
Which one is the best?

No Free Lunch!
Building Classifier
(1) weka.classifiers.rules.ZeroR

Building and using a 0-R classifier. Predicts the
mean (for a numeric class) or the mode (for a
nominal class).
(2) weka.classifiers.bayes.NaiveBayes

Class for building a Naive Bayesian classifier
(3) weka.classifiers.trees.J48

Class for generating an
unpruned or a pruned
C4.5 decision tree.
Test Options


Percentage Split (2/3 Training; 1/3 Testing)
Cross-validation
estimating the generalization error based on
resampling when limited data; averaged error
estimate.
 stratified
 10-fold
 leave-one-out (Loo)
 10-fold vs. Loo

Understanding Output
Decision Tree Output (1)
J48 pruned tree
-----------------outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves : 5
Size of the tree : 8
=== Error on training data ===
Correctly Classified Instance 14 100
Incorrectly Classified Instances 0
0
Kappa statistic
1
Mean absolute error
0
Root mean squared error
0
Relative absolute error
0%
Root relative squared error
0%
Total Number of Instances
14
%
%
=== Detailed Accuracy By Class ===
TP FP Precision Recall F-Measure Class
1 0 1
1
1
yes
1 0 1
1
1
no
=== Confusion Matrix ===
a b <-- classified as
9
0 | a = yes
10
0 5 | b = no
Decision Tree Output (2)
=== Stratified cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
9
64.2857 %
5
35.7143 %
0.186
0.2857
0.4818
60%
97.6586 %
14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.778 0.6
0.7
0.778 0.737
yes
0.4
0.222
0.5
0.4
0.444
no
=== Confusion Matrix ===
a b <-- classified as
7 2 | a = yes
3 2 | b = no
Performance Measures






Accuracy & Error rate
Mean absolute error
Root mean-squared root (square root of the
average quadratic loss)
Confusion matrix – contingency table
True Positive rate & False Positive rate
Precision & F-Measure
Decision Tree Pruning





Overcome Over-fitting
Pre-pruning and Post-pruning
Reduced error pruning
Subtree raising with different confidence
Comparing tree size and accuracy.
Subtree replacement

Bottom-up: tree is considered for replacement
once all its subtrees have been considered
Subtree Raising


Deletes node and redistributes instances
Slower than subtree replacement
Naïve Bayesian Classifier



Output CPT, same set of performance measures
By default, use normal distribution to model
numeric attributes.
Kernel density estimator could improve
performance if normality assumption is
incorrect. (-k option)
Data Sets to work on



Data sets were preprocessed into ARFF format
Three data sets from UCI repository
Two data sets from Computational Biology
Protein Function Prediction
 Surface Residue Prediction

Protein Function Prediction





Build a Decision Tree classifier that assign protein
sequences into functional families based on
characteristic motif compositions
Each attribute (motif) has a Prosite access number:
PS####
Class label use Prosite Doc ID: PDOC####
73 attributes (binary) & 10 classes (PDOC).
Suggested method: Use 10-fold CV and Pruning the
tree using Sub-tree raising method
Surface Residue Prediction





Prediction is based on the identity of the target
residue and its 4 sequence neighbors
X1 X2 X3 X4 X5
Window Size = 5
Target residue is on Surface or not?
5 attributes and binary classes.
Suggested method: Use Naïve Bayesian
Classifier with no kernels
Your Turn to Test Drive!