Transcript wekalab
An Exercise in
Machine Learning
http://www.cs.iastate.edu/~cs573x/bbsilab.html
• Machine Learning Software
• Preparing Data
• Building Classifiers
• Interpreting Results
• Test-driving WEKA
Machine Learning Software
Suites (General Purpose)
WEKA (Source: Java)
MLC++ (Source: C++)
SIPINA
List from KDNuggets (Various)
Specific
Classification: C4.5, SVMlight
Association Rule Mining
Bayesian Net … …
Commercial vs. Free vs. Programming
What does WEKA do?
Implementation of state-of-art learning
algorithm
Main strengths in the classification
Regression, Association Rules and clustering
algorithms
Extensible to try new learning schemes
Large variety of handy tools (transforming
datasets, filters, visualization etc…)
WEKA resources
API Documentation, Tutorial, Source code.
WEKA mailing list
Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations
Weka-related Projects:
Weka-Parallel - parallel processing for Weka
RWeka - linking R and Weka
YALE - Yet Another Learning Environment
Many others…
Getting Started
Installation (Java runtime +WEKA)
Setting up the environment (CLASSPATH)
Reference Book and online API document
Preparing Data sets
Running WEKA
Interpreting Results
ARFF Data Format
Attribute-Relation File Format
Header – describing the attribute
types
Data – (instances, examples)
comma-separated list
Use the right data format:
Filestem, CSV ARFF format
Use C45Loader and CSVLoader to
convert
Launching WEKA
Load Dataset into WEKA
Data Filters
Useful support for data preprocessing
Removing or adding attributes, resampling
the dataset, removing examples, etc.
Creates stratified cross-validation folds of the
given dataset, and class distributions are
approximately retained within each fold.
Typically split data as 2/3 in training and 1/3
in testing
Building Classifiers
A classifier model - mapping from dataset
attributes to the class (target) attribute.
Creation and form differs.
Decision Tree and Naïve Bayes Classifiers
Which one is the best?
No Free Lunch!
Building Classifier
(1) weka.classifiers.rules.ZeroR
Building and using a 0-R classifier. Predicts the
mean (for a numeric class) or the mode (for a
nominal class).
(2) weka.classifiers.bayes.NaiveBayes
Class for building a Naive Bayesian classifier
(3) weka.classifiers.trees.J48
Class for generating an
unpruned or a pruned
C4.5 decision tree.
Test Options
Percentage Split (2/3 Training; 1/3 Testing)
Cross-validation
estimating the generalization error based on
resampling when limited data; averaged error
estimate.
stratified
10-fold
leave-one-out (Loo)
10-fold vs. Loo
Understanding Output
Decision Tree Output (1)
J48 pruned tree
-----------------outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves : 5
Size of the tree : 8
=== Error on training data ===
Correctly Classified Instance 14 100
Incorrectly Classified Instances 0
0
Kappa statistic
1
Mean absolute error
0
Root mean squared error
0
Relative absolute error
0%
Root relative squared error
0%
Total Number of Instances
14
%
%
=== Detailed Accuracy By Class ===
TP FP Precision Recall F-Measure Class
1 0 1
1
1
yes
1 0 1
1
1
no
=== Confusion Matrix ===
a b <-- classified as
9
0 | a = yes
10
0 5 | b = no
Decision Tree Output (2)
=== Stratified cross-validation ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
9
64.2857 %
5
35.7143 %
0.186
0.2857
0.4818
60%
97.6586 %
14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.778 0.6
0.7
0.778 0.737
yes
0.4
0.222
0.5
0.4
0.444
no
=== Confusion Matrix ===
a b <-- classified as
7 2 | a = yes
3 2 | b = no
Performance Measures
Accuracy & Error rate
Mean absolute error
Root mean-squared root (square root of the
average quadratic loss)
Confusion matrix – contingency table
True Positive rate & False Positive rate
Precision & F-Measure
Decision Tree Pruning
Overcome Over-fitting
Pre-pruning and Post-pruning
Reduced error pruning
Subtree raising with different confidence
Comparing tree size and accuracy.
Subtree replacement
Bottom-up: tree is considered for replacement
once all its subtrees have been considered
Subtree Raising
Deletes node and redistributes instances
Slower than subtree replacement
Naïve Bayesian Classifier
Output CPT, same set of performance measures
By default, use normal distribution to model
numeric attributes.
Kernel density estimator could improve
performance if normality assumption is
incorrect. (-k option)
Data Sets to work on
Data sets were preprocessed into ARFF format
Three data sets from UCI repository
Two data sets from Computational Biology
Protein Function Prediction
Surface Residue Prediction
Protein Function Prediction
Build a Decision Tree classifier that assign protein
sequences into functional families based on
characteristic motif compositions
Each attribute (motif) has a Prosite access number:
PS####
Class label use Prosite Doc ID: PDOC####
73 attributes (binary) & 10 classes (PDOC).
Suggested method: Use 10-fold CV and Pruning the
tree using Sub-tree raising method
Surface Residue Prediction
Prediction is based on the identity of the target
residue and its 4 sequence neighbors
X1 X2 X3 X4 X5
Window Size = 5
Target residue is on Surface or not?
5 attributes and binary classes.
Suggested method: Use Naïve Bayesian
Classifier with no kernels
Your Turn to Test Drive!