Evaluation of Classifiers Biplav Srivastava
Download
Report
Transcript Evaluation of Classifiers Biplav Srivastava
Evaluation of Results
(classifiers, and beyond)
Sources:
Biplav Srivastava
•[Witten&Frank00] Witten, I.H. and Frank, E.
Data Mining - Practical Machine Learning Tools and Techniques with
JAVA Implementations, Morgan Kaufmann, 2000.
• [Lim et al99] Lim, T.-S., Loh, W.-Y. and Shih, Y.-S.
A Comparison of Prediction Accuracy, Complexity, and Training
Time of Thirty-three Old and New Classification Algorithms, Machine
Learning. Forthcoming. (Appendix containing complete tables of error
rates, ranks, and training times; download the data sets in C4.5 format)
Evaluation of Methods
Ideally: find the best one
Practically: find comparable classes
• Classification accuracy
– Effect of noise on accuracy
• Comprehensibility of result
– Compactness
– Complexity
• Training time
• Scalability with increase in sample size
Types of classifiers
• Decision trees
• Neural Network
• Statistical
– Regression
• Pi = w0 + w1*x1 + w2*x2 + ... + wm*xm
• Use regression on data set:
Min Sum(Ci - (w0 + w1*x1 + ... + wm*xm ))^2
to get the weights.
– Classification
• Perform regression for each class; output = 1 for Ci, 0
otherwise, to get a linear expression
• Given test data, evaluate each linear expression and choose the
class corresponding to the largest
Assumptions in classification
• Data set is a representative sample
• Prior probability is proportional to the
frequency in training sample sizes
• Changing categories to numerical values
If attribute X takes k values: C1 ,C2 ,...,Ck,
have (k-1) dimensional vector d1 ,d2 ,...,dk-1 such that
di=1 if X= Ci and di=0, otherwise.
For X= Ck, the vector contains all zeros.
Error Rate of Classifier
• If data set is large
– use result from test data
– test data is usually 1/3 of total data
• If data size is small
– K-fold cross validation (usually 10)
• Divide data set into roughly equal K sets.
• Use K-1 for training, the remaining for testing
• Repeat K times and average the error rate
Error Rate of Classifier (cont)
– Fold choice can affect result
• STRATIFICATION: Class ratio in each set is same
as in whole data set.
• Use K K-fold cross validation to overcome random
variation in fold choice.
– Leave-one-out: N-fold cross validation
– Bootstrap:
• Training Set = Select N data items at random with
substitution. Prob = (1 - 1/N)^N = 1/e = 0.368
• Test Set = Data Set - Training Set
• e = 0.632 * e(test) + 0.368 * e(training)
Evaluation in [Lim et al 99]
• 22 DT, 9 Statistical, 2 NN classifiers
• Time Measurement
– Use SPEC marks to rate platforms and scale
results based on these marks.
• Error rates
– Calculation
• For test data > 1000, use its error rate
• Else, use 10-fold cross validation
– Does not use multiple 10-folds.
Evaluation in [Lim et al 99]
– Acceptable performance
• If p is minimum error rate of all classifiers, those
within 1 standard error on p are accepted
• Std error of p = Sqrt(p*(1-p)/N)
• Statistical significance of error rates
– Null hypothesis: All algorithms with same
mean error rate ? Test differences of mean error
rates. (Hypothesis REJECTED).
– Tukey method: Difference between mean error
rates significant at 10% if differ by 0.058.
Evaluation in [Lim et al 99]
• Rank analysis (no normality assumption)
– Data Set(i): Sort according to ascending error
rate and assign rank (ties given avg rank)
Error rate: 0.1 0.342 0.5 0.5 0.5 0.543 0.677 0.789
Rank:
1
2 4 4 4 6
7
8
– Get Mean Rank across all Data Set
• Statistical significance of rank
– Null Hypothesis on difference of mean ranks
(Friedman test) is REJECTED.
– Difference in mean ranks greater than 8.7 is
significant at 10% level.
Evaluation in [Lim et al 99]
• Equivalent classifiers from best (POL)
(Training time <= 10 min)
– 15 found with mean error rate
– 18 found with mean rank
• Training time shown in order slower than
fastest classifier (10^(x-1) to 10^x)
– Decision tree learners (C4.5, FACT - statistical
tests for splitter), regression methods train fast
– Spline-based statisticals and NNs slower
Evaluation in [Lim et al 99]
• Size of trees (in decision trees)
– Use 10-fold cross-validation
– Noise typically reduces tree size
• Scalability with data set size
– If data set small, use bootstrap re-sampling
• N items are drawn with substitution
• Class attribute is randomly changed with prob = 0.1
to a value from the valid set selected uniformly
– Else, use given data size
Evaluation in [Lim et al 99]
– log(training time) increases linearly with log(N)
– Decision trees usually scale
• C4.5 with rules does not scale, but trees do
• QUEST, FACT (multi-attribute) scale
– Regression methods usually scale
– NN methods were not tested
[Lim et al 99] Summary
• Use method that fits the requirement
– error rate, training time, ...
• Decision tree with univariate splits (single
attribute) for data interpretation
– C4.5 trees are big, C4.5 rules don’t scale
• Simple regression methods are good: fast,
easy implementation, scalable
What more ?
• Precision/ Recall
PREDICTED
False (-)ve
True (+)ve
ACTUAL
False (+)ve
• Cost based analysis
True (-)ve
Demonstration from WEKA