Transcript Lect15-Weka

Weka
Just do it
Free and Open Source
ML Suite
Ian Witten & Eibe Frank
University of Waikato
New Zealand
Overview
•
•
•
•
Classifiers, Regressors, and clusterers
Multiple evaluation schemes
Bagging and Boosting
Feature Selection:
– right features and data key to successful learning
•
•
•
•
Experimenter
Visualizer
Text not up to date.
They welcome additions.
Learning Tasks
• Classification: given examples labelled
from a finite domain, generate a procedure
for labelling unseen examples.
• Regression: given examples labelled with a
real value, generate procedure for labelling
unseen examples.
• Clustering: from a set of examples,
partitioning examples into “interesting”
groups. What scientists want.
Data Format: IRIS
@RELATION iris
@ATTRIBUTE sepallength
@ATTRIBUTE sepalwidth
@ATTRIBUTE petallength
@ATTRIBUTE petalwidth
@ATTRIBUTE class
REAL
REAL
REAL
REAL
{Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
Etc.
General from
@atttribute attribute-name REAL or list of values
J48 = Decision Tree
petalwidth <= 0.6: Iris-setosa (50.0) : # under node
petalwidth > 0.6
# ..number wrong
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation
• Correctly Classified Instances 143 95.3%
• Incorrectly Classified Instances 7 4.67 %
• Default 10-fold cross validation i.e.
– Split data into 10 equal sized pieces
– Train on 9 pieces and test on remainder
– Do for all possibilities and average
J48 Confusion Matrix
Old data set from statistics: 50 of each class
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 3 47 | c = Iris-virginica
Precision, Recall, and Accuracy
• Precision: probability of being correct given
that your decision.
– Precision of iris-setosa is 49/49 = 100%
– Specificity in medical literature
• Recall: probability of correctly identifying
class.
– Recall accuracy for iris-setosa is 49/50 = 98%
– Sensitity in medical literature
• Accuracy: # right/total = 143/150 =~95%
Other Evaluation Schemes
• Leave-one-out cross-validation
– Cross-validation where n = number of training
instanced
• Specific train and test set
– Allows for exact replication
– Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling
• Randomly select n with replacement from n
• Expect about 2/3 to be chosen for training
– Prob of not chosen = (1-1/n)^n ~ 1/e.
• Testing on remainder
• Repeat about 30 times and average.
• Avoids partition bias