csce462chapter5PowerPointOldXx

Download Report

Transcript csce462chapter5PowerPointOldXx

Data Mining
Chapter 5
Credibility: Evaluating What’s
Been Learned
Kirk Scott
1
• One thing you’d like to do is evaluate the
performance of a data mining algorithm
• More specifically, you’d like to evaluate
results obtained by applying a data mining
algorithm to a certain data set
• For example, can you estimate what
percent of new instances it will classify
correctly?
2
• Problems include:
• The performance level on the training set
is not a good indicator of performance on
other data sets
• Data may be scarce, so that enough data
for separate training and test data sets
may be hard to come by
• A related question is how to compare the
performance of 2 different algorithms
3
• Parameters for evaluation and
comparison:
• Are you predicting classification?
• Or are you predicting the probability that
an instance falls into a classification?
• Or are you doing numeric prediction?
• What does performance evaluation mean
for things other than prediction, like
association rules?
4
• Can you define the cost of a
misclassification or other “failure” of
machine learning?
• There are false positives and false
negatives
• Different kinds of misclassifications will
have different costs in practice
5
• There are broad categories of answers to
these questions
• For evaluation of one algorithm, a large
amount of data makes estimating
performance easy
• For smaller amounts of data, a technique
called cross-validation is commonly used
• Comparing the performance of different
data mining algorithms relies on statistics
6
5.1 Training and Testing
7
• For tasks like prediction, a natural
performance measure is the error rate
• This is the number of incorrect predictions
out of all made
• How to estimate this?
8
• A rule set, a tree, etc. may be imperfect
• It may not classify all of the training set
instances correctly
• But it has been derived from the training
set
• In effect, it is optimized for the training set
• Its error rate on an independent test set is
likely to be much higher
9
• The error rate on the training set is known
as the resubstitution error
• You put the same data into the classifier
that was used to create it
• The resubstitution error rate may be of
interest
• But it is not a good estimate of the “true”
error rate
10
• In a sense, you can say that the classifier,
by definition, is overfitted on the training
set
• If the training set is a very large sample, its
error rate could be closer to the true rate
• Even so, as time passes and conditions
change, new instances might not fall into
quite the same distribution as the training
instances and will have a higher rate of
11
misclassification
• Not surprisingly, the way to estimate the
error rate is with a test set
• You’d like both the training set and the test
set to be representative samples of all
possible instances
• It is important that the training set and the
test set be independent
• Any test data should have played no role
in the training of the classifier
12
•
•
•
•
There is actually a third set to consider
The validation set
It may serve either of these two purposes:
Selecting one of several possible data
mining algorithms
• Optimizing the one selected
13
•
•
•
•
•
All three sets should be independent
Train only with the training set
Validate only with the validation set
Test only with the test set
In general, error rate estimation is done on
the test set
• After all decisions are made, it is
permissible to combine all test sets and
retrain on this superset for better results 14
5.2 Predicting Performance
15
• This section can be dealt with quickly
• Its main conclusions are based on
statistical concepts, which are presented
in a box
• You may be familiar with the derivations
from statistics class
• They are beyond the scope of this course
16
• These are the basic ideas:
• Think in terms of a success rate rather
than an error rate
• Based on a sample of n instances, you
have an observed success rate in the test
data set
17
• Statistically, you can derive a confidence
interval around the observed rate that
depends on the sample size
• Doing this provides more complete
knowledge about the real success rate,
whatever it might be, based on the
observed success rate
18
5.3 Cross-Validation
19
• Cross-validation is the general term for
techniques that can be used to select
training/testing data sets and estimate
performance when the overall data set is
limited in size
• In simple terms, this might be a
reasonable rule of thumb:
• Hold out 1/3 of the data for testing, for
example, and use the rest for training (and
20
validation)
Statistical Problems with the
Sample
• Even if you do a proper random sample,
poorly matched test and training sets can
lead to poor error estimates
• Suppose that all of the instances of one
classification fall into the test set and none
fall into the training set
21
• The rules resulting from training will not be
able to classify those test set instances
• The error rate will be high (justifiably)
• But a different selection of training set and
test set would give rules that covered that
classification
• And when the rules are evaluated using
the test set, a more realistic error rate
would result
22
Stratification
• Stratification can be used to better match
the test set and the training set
• You don’t simply obtain the test set by
random sampling
• You randomly sample from each of the
classifications in the overall data set in
proportion to their occurrence there
• This means all classifications are
represented in both the test and training
23
sets
Repeated Holdout
• With additional computation, you can
improve on the error estimate obtained
from a stratified sample alone
• n times, randomly hold out 1/3 of the
instances for training, possibly using
stratification
• Average the error rates over the n times
• This is called repeated holdout
24
Cross-Validation
• The idea of multiple samples can be
generalized
• Partition the data into n “folds” (partitions)
• Do the following n times:
• Train on (n – 1) of the partitions
• Test on the remaining 1
• Average the error rates over the n times
25
Stratified 10-Fold CrossValidation
• The final refinement of cross validation is
to make the partitions so that all
classifications are roughly in proportion
• The standard rule of thumb is to use 10
partitions
• Why?
• The answer boils down to this essentially:
• Human beings have 10 fingers…
26
• In general, it has been observed that 10
partitions lead to reasonable results
• 10 partitions is not computationally out of
the question
27
• The final refinement presented in the book
is this:
• If you want really good error estimates, do
10-fold cross validation 10 times with
different stratified partitions and average
the results
• At this point, estimating error rates has
become a computationally intensive task
28
5.4 Other Estimates
29
• 10 times 10-fold stratified cross-validation
is the current standard for estimating error
rates
• There are other methods, including these
two:
• Leave-One-Out Cross-Validation
• The Bootstrap
30
Leave-One-Out CrossValidation
• This is basically n-fold cross validation
taken to the max
• For a data set with n instances, you hold
out one for testing and train on the
remaining (n – 1)
• You do this for each of the instances and
then average the results
31
•
•
•
•
This has these advantages:
It’s deterministic:
There’s no sampling
In a sense, you maximize the information
you can squeeze out of the data set
32
• It has these disadvantages:
• It’s computationally intensive
• By definition, a holdout of 1 can’t be
stratified
• By definition, the classification of a single
instance will not conform to the distribution
of classifications in the remaining
instances
• This can lead to poor error rate estimates 33
The Bootstrap
• This is another technique that ultimately
relies on statistics (presented in a box)
• The question underlying the development
and use of the bootstrap technique is this:
• Is there a way of estimating the error rate
that is especially well-suited to small data
sets?
34
• The basic statistical idea is this:
• Do not just randomly select a test set from
the data set
• Instead, select a training set (the larger of
the two sets) using sampling with
replacement
35
• Sampling with replacement in this way will
lead to these results:
• Some of the instances will not be selected
for the training set
• These instances will be the test set
• By definition, you expect duplicates in the
training set
36
• Having duplicates in the training set will
mean that it is a poorer match to the test
set
• The error estimate is skewed to the high
side
• This is corrected by combining this
estimate with the resubstitution error rate,
which is naturally skewed to the low side
37
• This is the statistically based formula for
the overall error rate:
• Overall error rate =
• .632 * test set error rate
• + .368 * training set error rate
• To improve the estimate, randomly sample
with replacement multiple times and
average the results
38
• Like the leave-out-one technique, there
are cases where this technique does not
give good results
• This depends on the distribution of the
classifications in the overall data set
• Sampling with replacement in some cases
can lead to the component error rates
being particularly skewed in such a way
that they don’t compensate for each other
39
5.5 Comparing Data Mining
Schemes
40
• This section can be dealt with quickly
because it is largely based on statistical
derivations presented in a box
• The fundamental idea is this:
• Given a data set (or data sets) and two
different data mining algorithms, you’d like
to choose which one is best
41
• For researchers in the field, this is the
broader question:
• Across all possible data sets in a given
problem domain (however that may be
defined) which algorithm is superior
overall?
• We’re really only interested in the simpler,
applied question
42
• At heart, to compare two algorithms, you
compare their estimated error or success
rates
• In a previous section it was noted that you
can get a confidence interval for an
estimated success rate
• In a situation like this, you have two
probabilistic quantities you want to
compare
43
• The paired t-test is an established
statistical technique for comparing two
such quantities and determining if they are
significantly different
44
5.6 Predicting Probabilities
45
• The discussion so far had to do with
evaluating schemes that do simple
classification
• Either an instance is predicted to be in a
certain classification or it isn’t
• Numerically, you could say a successful
classification was a 1 and an unsuccessful
classification was a 0
• In this situation, evaluation boiled down to
46
counting the number of successes
Quadratic Loss Function
• Some classification schemes are finer
tuned
• Instead of just predicting a classification,
they produce the following:
• For each instance examined, a probability
that that instance falls into each of the
classes
47
• The set of k probabilities can be thought of
as a vector of length k containing elements
pi
• Each pi represents the probability of one of
the classifications for that instance
• The pi sum to 1
48
• The following discussion will be based on
the case of one instance, not initially
considering the fact that a data set will
contain many instances
• How do you judge the goodness of the
vector of pi‘s that a data mining algorithm
produces?
49
• Evaluation is based on what is known as a
loss function
• Loss is a measure of the probabilities
against the actual classification found
• A good prediction of probabilities should
have a low loss value
• In other words, the difference is small
between what is observed and the
predicted probabilities
50
• Consider the loss for one instance
• Out of k possible classifications, the
instance actually falls into one of them,
say the ith
• This fact can be represented by a vector a
with a 1 in position ai and 0 in all others
51
• The data mining algorithm finds a vector of
length k containing the probabilities that an
instance falls into one of the classifications
• This fact can be represented by a vector p
containing values pi
52
• Now for a single instance, consider this
sum over the different classes and their
probabilities:
 (p
j
j
 aj)
2
53
 (p
j
j
 aj)
2
• Ignore the squaring for a moment
• Excluding the ith case, this is a sum of the
probabilities that didn’t come true
• For a good prediction, you’d like this to be
small
54
 (p
j
j
 aj)
2
• For the ith case you have pi – ai = pi – 1
• For a good scheme you’d like the
probability of predicting the correct
classification to be as near to 1 as
possible
• In other words, you’d like the difference
between the probability and 1 to be as
small as possible
55
• Putting the two parts back together again,
you’d like this sum to be small:
 (p
j
j
 aj)
• As for the squaring, this is basically just
statistical standard operating procedure
• Squaring makes the function amenable to
calculus techniques
56
 (p
j
j
 aj)
2
• The final outcome is this
• The goal is to minimize the mean square
error (MSE)
• Because of the squaring this is called the
quadratic loss function
• Recall that this discussion was for one
instance
• Globally you want to minimize the MSE
over all instances
57
Informational Loss Function
• The informational loss function is another
way of evaluating a probabilistic predictor
• Like the discussion of the quadratic loss
function, this will be discussed in a terms
of a single instance
• For a whole data set you would have to
consider the losses over all instances
58
• The book explains that the informational
loss function is related to the information
function for tree formation
• It is also related to logistic regression
• Suppose that a particular instance is
classified into the ith category
• For that instance, this is the loss function:
• - log2 pi
59
• I will give a simplified information theoretic
explanation at the end
• In the meantime, it is possible to
understand what the function does with a
few basic definitions and graphs
• Remember the following:
• The sum of all of the pi = 1
• Each individual pi < 1
60
• Now observe that the information loss
function involves only pi, the probability
value for the category that the instance
actually classified into
• The information loss function does not
include any information about the
probabilities of the categories that the
instance didn’t classify into
61
• So, restrict your attention to the ith
category, the one the instance classified
into
• A measure of a good predictor would be
how close the prediction probability, pi, is
to the value 1 (perfect prediction) for that
instance
62
• Here is the standard log function
• log2 pi
• Its graph is shown on the following
overhead
63
64
• Here is the informational loss function
• - log2 pi
• Its graph is shown on the following
overhead
• We are only interested in that part of the
graph where 0 < pi < 1, since that is the
range of possible values for pi
65
66
• The goal is to minimize the loss function
• What does that mean, visually?
• It means pushing pi as close to one as
possible
• When pi = 1 the information loss function
takes on the value 0
• An information loss function value of 0 is
the optimal result
67
• In vague information theoretic terms, you
might say this:
• The greater the probability of your
prediction, the more (valid) information you
apparently had to base it on
• Or, the less information you would have
needed in order to make it optimal
• As pi  1, -log2 pi  0
68
• Conversely, the less the probability of your
prediction the less (valid) information you
apparently had to base it on
• Or the more information you would have
needed to make it optimal
• As pi  0, -log2 pi  ∞
69
• The book observes that if you assign a
probability of 0 to an event and that event
happens, the informational loss function
“cost” is punishing—namely, infinite
• A prediction that is flat wrong would
require an unmeasurable amount of
information to correct
70
• This leads to another very informal insight
into the use of the logarithm in the
function, and what that means
• If you view the graph from right to left
instead of right to left, as the probability of
a prediction gets smaller and smaller, the
information needed in order to make it
better grows “exponentially”
71
• In practice, because of the potential for an
infinite informational loss function value,
no event is given a probability of 0
72
• Consider the graph again
• Assume that information was measured in
bits rather than decimal numbers
• A more theoretical statement about the
relative costs would go like this:
• The informational loss function represents
the relative number of bits of information
that would be needed to effect a change in
the probability of a prediction
73
Discussion
• Which is better, the quadratic loss function
or the informational loss function?
• It’s a question of preference
• The quadratic loss function is based on
the probability of cases that didn’t occur as
well as cases that occurred
• The information loss function is based only
on the probability of the case that occurred
74
• The quadratic loss function is bounded (a
finite sum of squares of differences)
• The informational loss function is
unbounded
• Statisticians would probably find the
quadratic loss function (MSE criterion)
familiar and useful
75
• Information theorists would probably find
the informational loss function familiar and
useful
• The book points out that in theory you can
measure the information contained in a
structural representation in bits
• Advanced analyses can be based on
formulas that combine the information in a
representation and the bits involved in
measuring the error rate (bits still needed)76
5.7 Counting the Cost
• Preliminary note:
• The book discusses cost in this section
• It also discusses things that don’t really
involve cost
• Incidentally, it also discusses things that
would involve (differential) cost, but
simplify by using cost coefficients of 1 and
0 rather than anything more complicated
77
• In a way, the contents of this section might
come across as a mish-mash
• Hopefully the overheads will add clarity to
the book’s presentation
• Whether mish-mash or not, ultimately the
topics covered will be important because
they are aspects of what is available in
Weka
78
Fundamental Cost Ideas
• For any prediction scheme there will be
successes and failures, successes and
errors
• Consider a binary classification scheme
where the output is yes/no, true/false,
positive/negative
79
Two Kinds of Successes
• Correct predictions of positive, true
positives, TP
• Correct predictions of negative, true
negatives, TN
• Quite often, the cost (i.e., the benefit) of
the two kinds of successes are taken to be
the same
• You can deal with TP + TN together
80
Two Kinds of Errors
• Incorrect predictions of positive, false
positives, FP
• Incorrect predictions of negative, false
negatives, FN
• In virtually every applied situation the
costs of false positives and false negatives
materially differ
81
Difference in Error Costs
• The book illustrates the idea with several
examples
• One is direct mail advertising, which it
uses again later in the section
• Consider a mailing to a predicted potential
customer who doesn’t respond
• This is a false positive
• The individual cost of the unsuccessful
mailing is low
82
• Consider a mailing that was never sent to
what would have been a customer
• This is a false negative
• The algorithm incorrectly predicted
negative for that potential customer
• The cost of the lost business is high
compared to the cost of a mailing
83
• In order to stay in business, the cost of
mailing has to be less than the cost of
business generated
• You can afford lots of false positives,
although in aggregate their cost adds up
• In this domain the cost of lost opportunity
is invisible, but in practical terms, the cost
of a few false negatives quickly outweighs
the cost of many false positives
84
Confusion Matrices
• The idea of a confusion matrix will recur in
the section
• A confusion matrix is a graphical way of
summarizing information about successes
and failures in prediction
85
• The simplest of confusion matrices, one
for two classifications, is illustrated in
Table 5.3, shown on the following
overhead
• The entries in the matrix are the counts of
the different kinds of successes and
failures for a given situation
86
87
• The book shows Table 5.4a and Table
5.4b together
• I will break up the presentation
• Consider Table 5.4a, shown on the
following overhead
• It shows the confusion matrix resulting
from the application of a non-binary
classification scheme
88
89
• The rows represent the actual class for the
instances in a data set
• The columns represent the class
predictions given by the classification
scheme
• The true predictions are on the main
diagonal of the matrix
• The errors are off of the main diagonal
90
False Positives
• Look at column a, for example
• The entry for row a, 88, is the count of the
number of correct predictions
• The entries for rows b and c, 14 and 18,
are the counts of false positives with
respect to class a
• The algorithm falsely predicted this many
instances of b and c were of class a
91
False Negatives
• Now look at row a
• The entries for columns b and c, 10 and 2,
are the counts of false negatives with
respect to class a
• The algorithm falsely predicted this many
instances of class a were in classes b and
c
92
Food for Thought
• Now look at row b
• The entry for column a, 14, is a false
negative with respect to class b
• This entry was identified as a false positive
with respect to class a
93
• So for a classification that isn’t binary,
whether something is a false positive or
false negative is a matter of perspective
• Geometrically, it depends on whether
you’re looking from the perspective of a
row or a column
94
• The rows represent the actual
classification
• Taken from the perspective of a row, an
error entry is a failure to correctly place
something into that classification
• This is a false negative
95
• The rows represent the predicted
classification
• Taken from the perspective of a column,
an error entry is an incorrect placement of
something into that classification
• This is a false positive
96
• The row b, column a entry is an error,
false positive of false negative, depending
on perspective
• The kind of error doesn’t really matter
• It is simply a unique error case, distinct
from the other error entries in the table
• It might have a cost/weight associated with
it that is also distinct from the cost/weight
of any other error entry in the matrix
97
Evaluating Classification Performance
with a Confusion Matrix
• Next, the book shows a way of evaluating
the performance of a classification scheme
• This evaluation method has nothing to do
with costs
• It’s simply an application built on top of,
and graphically illustrated by confusion
matrices
98
• The previous sections considered
evaluating classification performance in
terms of estimating error rates of schemes
• It also considered comparing the
performance of two different schemes
• The idea presented now is that
performance can be evaluated on the
basis of how much better it is than what
would be achieved by random results
99
• Table 5.4a and Table 5.4b combined
illustrate the idea
• These tables are shown on the following
overhead
• Explanations follow that
100
101
• First consider the bottom row of 5.4a
• The values 120, 60, and 20 are the totals
predicted for classes a, b, and c overall
• Now consider the rightmost column of 5.4a
• 100, 60, and 40 are the counts of the
actual occurrence of instances of classes
a, b, and c
102
• Now consider Table 5.4b
• The rightmost column of that table
contains the same values as the rightmost
column of 5.4a
• The count of the actual number of
instances of each class does not change
• The bottom row of 5.4b is also the same
as the bottom row of 5.4a
103
• Table 5.4a represents the actual predictor
we are trying to evaluation
• Table 5.4b represents a hypothetical
predictor that we are comparing it against
• The fact that the bottom rows of the two
tables are the same means that the
hypothetical predictor of interest gives the
same overall count of predictions for each
class
104
• It’s the bodies of the two tables that differ
• They differ in how the correct and incorrect
predictions for each class are distributed
• The hypothetical predictor’s values work in
this way:
• The values in the rows for every actual
class are distributed in proportion to the
overall totals in the bottom row
105
•
•
•
•
•
120 is to 60 is to 20
As 60 is to 30 is to 10
As 36 is to 18 is to 6
As 24 is to 12 is to 4
These sets of values represent the socalled random predictor
• (It’s not exactly random. It’s based on the
prediction counts for the actual scheme.)
106
• The random predictor is the point of
comparison
• Can you make a quantitative comparison
between the performance of the actual
classification scheme and the random
predictor?
• (If your classification scheme isn’t better
than random, you’ve got a big problem.
Your predictor is sort of an anti-predictor.)
107
Comparing with the Random
Predictor
• Sum the counts on the main diagonal of
the matrix for the actual scheme
• It got 88 + 40 + 12 = 140 correct
• Do the same for the random predictor
• It got 60 + 18 + 4 = 82 correct
• Clearly the actual predictor is better
• Can this be quantified?
108
• There are 200 instances altogether
• The random predictor left 200 – 82 = 118
remaining incorrectly predicted
• The actual predictor got 140 – 82 = 58
more correct than the random predictor
• The actual predictor got 58/118 = 49.2% of
those remaining instances correct
109
• This computation is known as the Kappa
statistic
• In practice, its value can range from 0 to 1
• A negative value would mean a predictor
worse than random prediction
• 0 means no better than random
• 1 means perfect prediction
110
• Note again that it’s a bit of an anomaly that
this method appears in the section on
counting the cost
• The method doesn’t count the cost of FP
or FN
• It simply counts the successes vs. the
errors and compares with the hypothetical
case
111
Cost-Sensitive Classification
• Suppose you don’t assign different cost
values to different success types (TP, TN)
and different failure types (FP, FN)
• Cost of success = 0
• Cost of failure = 1
112
• The the default cost matrices for the twoclass and three-class cases are quite
simple
• These are given in Table 5.5, shown on
the following overhead
• For all practical purposes, when you do
this, you’re back simply counting
113
114
• The book doesn’t give a concrete example
where the costs are distinguished, and
neither will these overheads
• However, the relative cost of successes
and errors can be included in a pretty
straightforward way when evaluating data
mining schemes
115
Performance Evaluation with
Costs
• Suppose you have a data mining scheme
that gives straight classification predictions
• With a test set, you have both a prediction
and the known classification
• Up until now, performance was evaluated
on the basis of a count of errors
116
• Potentially, each element of the cost
matrix could contain a different decimal
value which was the weight for that case
• Successes (benefits) could be positive
• Failures/errors (costs) would be negative
• You would evaluate performance by
summing up the total of successes and
errors multiplied by the corresponding
weight in the cost matrix
117
Complete Financial Analysis
• Data mining doesn’t exist in a vacuum
• It is a real technique applied in businesslike situations
• Suppose you were comparing the cost of
two different algorithms
• Not only might the costs associated with
the predictions be taken into account
118
• The better performing algorithm might
have data preprocessing or computational
costs that are significant
• Its predictions might be X dollars more
valuable than those of another scheme
• But the comparative costs of running the
algorithm may be greater than X
• A complete analysis would take this into
account when picking an algorithm
119
Classification Including Costs
• This section represents a shift of focus
• So far we’ve been concerned with doing a
cost evaluation of prediction performance
after the fact
• Now we want to try and take cost into
account when making the prediction
120
• Suppose you’re using a classification
scheme that generates probabilities
• Up until now, if you had to classify under
this condition, you picked the classification
with the highest probability
• With cost information, another possibility
presents itself
121
• For a given instance that you want to
classify:
• For each different classification, find (1 –
p), the probability that it’s not correct
• Multiple each of the (1 – p) times the cost
factor of a misclassification
• Give the instance the classification where
that product is the smallest
• In other words, classify on the basis of 122
minimum cost
Probabilities vs. Straight
Classification
• The book notes that there are ways of
converting straight classification results
into probabilities
• It doesn’t give details (although it’s pretty
clear this would involve counting and
dividing in Bayesian style)
• The point is that you could then apply cost
in the same way when deciding
classification for those schemes too
123
Summary of the previous idea
• The approach for factoring cost into
predictions could be summarized as
follows:
• When running the data mining algorithm,
do not take cost into account when
deriving the rules
• However, after the rules are derived, take
cost into account when using them to
make a classification
124
Cost-Sensitive Learning
• Now we would like to try and take cost into
account when deriving the rules
• It is possible to follow a line of reasoning
that explains how this is accomplished
• Not all the details will be given
• However, the approach can be outlined
based on concepts that have already been
presented
125
There are Schemes that give
Probabilities as Results
• We have repeatedly been told that some
data mining algorithms will give
probabilities instead of classifications
• Naïve Bayes gave these kinds of results
• We infer that there are other, more
sophisticated data mining schemes that
also give probabilistic results
• In other words, probabilistic based
schemes have general application
126
What about the Loss Function?
• In this chapter the idea of a loss function
was introduced
• The statement was made that the way to
improve a classification scheme was to
minimize the loss
• In fact, we have not yet been told how
data mining schemes can be guided to
give better results by a process of
minimizing loss
127
Cost and Loss
• As understood so far, cost is external to a
classification scheme
• Cost is associated with particular
classification outcomes, successful or
unsuccessful
• Loss is internal to a classification scheme
• It is a measure of how greatly predicted
classification probabilities differ from
observed actual classifications
128
• Cost and loss are not the same
• However, when evaluating a scheme, the
cost is calculated by multiplying the
probability of an outcome times the cost of
that outcome
• If the mining scheme is guided by
minimizing loss, this will tend to minimize
cost
129
Making the Jump—Bias to Cost
• This is where you make a jump
• Recall this idea from earlier in the chapter:
• You achieve quality evaluations of a
scheme by cross-validation
• If you want accurate performance
estimates, the distribution of classifications
in the test set should match the distribution
of classifications in the training set
130
• If there was a mismatch between training
and test set some test set instances might
not classify correctly
• The mismatch in the sets means that the
training set was biased
• If it was biased against certain
classifications, this suggests that it was
biased in favor of other classifications
131
Including Cost in Rule
Derivation
• The approach to including cost in the
derivation in classification rules is this:
• Intentionally bias the training set by
increasing the representation of some
classifications
• The algorithm will produce rules more
likely to correctly predict these
classifications
132
• The practical questions are these:
• Which classifications to over-represent?
• This is based on knowledge of the cost of
errors for those classifications
• How much to over-represent them?
• The answer to this is pretty much just
empirical rule of thumb
133
Concretely…
• Take this as a concrete example:
• Suppose that false positives are more
costly than false negatives
• Over-represent “no” instances in the
training set
• When you run the classification algorithm
on this training set, it will overtrain on no
134
• That means the rules derived by the
algorithm will be more likely to reach a
conclusion of no
• Incidentally, this will increase the number
of false negatives
• However, it will decrease the number of
false positives for these classifications
• In this way, cost has been taken into
account in the rules derived, reducing the
135
likelihood of expensive FP results
• The book notes that you aren’t limited to
accomplishing this by over-representation
• You can allow duplicate instances in the
training set (like sampling with
replacement)
• In practice, some data mining schemes
just allow heavier or lighter weights to be
assigned to instances of given
classifications
136
Cost and Loss, Again
• Technically, we still didn’t use an algorithm
that was guided by loss
• Skewing the sample was the proxy for
tuning loss, which in turn was also the
proxy for cost in the rule derivation
137
Lift Charts
• Consider the direct mailing example
• Suppose you know as a starting point that
out of a population (training set/test set) of
1,000,000, there will be 1,000 yeses
• This is a response rate of .1%
138
Defining the Lift Factor
• Now suppose you have a data mining
algorithm that gives probabilities of yes for
the instances
• This means you can identify people more
likely to respond yes and those less likely
to respond yes
• You can identify subpopulations-subsetssamples with a greater likelihood of yes
139
• The increase in the response rate is
known as the lift factor
• This idea is summarized in the table on
the following overhead
140
Sample
Yeses
Response Rate
Lift Factor
1,000,000
1,000
.1%
400,000
800
.2%
2
100,000
400
.4%
4
141
• Assuming that you can do this, note this
obvious result:
• You’ve increased the response rate—good
• You’ve accomplished this by reducing the
sample size—also good
• On the other hand, the absolute number of
yeses has dropped
• In other words, you are excluding some
yeses from consideration
142
Finding Lift Factors
• Given a training set or population, how do
you come up with the figures for lift
factors?
• The data mining algorithm gives a
probability of a yes
• The training data also tells you whether a
given instance is actually yes
143
• Rank all of the instances by their
probability and keep track of their actual
class
• Note that we’re operating under the
(reasonable) assumption that the data
mining algorithm really works
• The higher the predicted probability, the
more likely an instance really is to take on
a certain value
144
• Table 5.6, given on the following
overhead, shows instances ranked by their
predicted probability, and with their actual
classification also listed
145
146
• Using a table like this, you can find the lift
factor for a given sample size, or you can
find the sample size for a given lift factor
147
Finding the Lift factor for a
Given Sample Size
• If you want the lift factor for a particular
sample size, just count down the list until
you reach that sample size
• The response rate = count(yeses) /
count(number of instances in sample)
• The lift factor is response rate(sample) /
response rate(population)
148
Finding the Sample Size for a
Given Lift Factor
• If you want a particular lift factor, keep a
running computation of response rate and
lift factor as you go down the list
• When you reach the desired lift factor, the
number of instances you’ve gone through
is your sample size
149
Making a Lift Chart
• To make a lift chart, you need the lift factor
for each possible sample size
• In other words, work your way down the
ranked list one instance at a time
• For each one, calculate the response rate
so far and the lift factor so far
• You then plot the results of this
150
• The x-axis is the percent of the population
you’ve worked through
• The y-axis is the actual number of yes
responses generated
• This plot is what is known as a lift chart
• Figure 5.1 shown on the following
overhead illustrates the idea for the direct
mailing example
151
152
What a Lift Chart Tells You
• The lift chart tells you something about
your data mining algorithm
• If the algorithm is any good at all, the
curve should be above the diagonal
• Otherwise, the algorithm is determining
the yes probability for instances with a
success rate lower than a random sample
153
• In general, the closer the lift chart curve
comes to the upper left hand corner, the
better
• A minimal sample size and a maximum
response rate is good
• The hypothetical ideal would be a mailing
only to those who would respond yes,
namely a 100% success rate, with no one
left out (no false negatives)
154
Cost Benefit Curves
• The costs and benefits of differing sample
sizes/mailing scenarios will differ
• As noted earlier, the cost of the algorithm
alone is a factor
• But for the time being we will ignore it
• The assumption is that we have already
done the work of running the algorithm
• We want to know what this completed
work can tell us
155
• The cost of interest now is the cost of
mailing an individual item
• We’ll assume that the cost is constant per
item
• The benefit of interest is the value of the
business generated per yes response
• We’ll assume that the benefit is constant
per positive response
156
• It is now possible to form a cost/benefit
curve across the same domain (percent of
sample size) as the lift chart
• Figure 5.2, given on the following
overhead, shows lift charts on the left and
cost/benefit curves on the right for various
scenarios
• Figure 5.2 will be discussed in greater
detail on the overheads following it
157
158
Comments on Figure 5.2
• For Figure 5.2, a benefit of +15.00 per
positive response is assumed
• The charts at the top assume a mailing
cost per item of .50
• A lift chart is shown on the left and a
cost/benefit curve is shown on the right
• The cost/benefit curve keeps on rising all
the way to 100% of the population, so it
makes sense to send mail to everybody if
159
you can afford it
• The charts at the bottom assume a cost
per item of .80
• A lift chart is shown on the left and a
cost/benefit curve is shown on the right
• The cost/benefit curve has a peak in the
middle
• This means you only want to mail to this
proportion of the population
160
• Keep in mind that the sample is selected
by ranking by the algorithm, not randomly
• This means roughly speaking (gauging by
the shape of the curve)
• You would like to mail to those prospective
customers in the top half of the data set
• Mailing to more than these gives
diminishing returns
161
What Else is in Figure 5.2?
• The details of Figure 5.2 are impossible to
see
• In brief, the GUI includes the following:
• A confusion matrix
• A cost matrix where you input the costs of
TP, TN, FP, FN
• A slider for selecting the proportion
• Numeric results of the computation as well
as the graphs
162
A Side Note on Where You
Might Be in Exploring Weka
• Figure 5.2 was generated by Weka
• If you haven’t downloaded and installed
Weka yet, it’s not too soon to do so
• The book also has links to the example
data sets, already in ARFF (format)
• Looking through the GUI, you can find the
options to run various data mining
algorithms
163
• You can also find the options to run
various tools that allow you to explore and
compare the results of running the
algorithms
• There are no homework assignments, so
it’s up to you to start looking into these
things on a schedule that agrees with you
164
A Preliminary Note on the
Project
• The syllabus contained some information
on the general form of the project
• Here is a little more preliminary detail
• You can do the project using the UCI data
sets or any of the other data sets made
available on the Weka Web site
• If you do so, you are on the hook for the
full 8 combinations of data and algorithm
165
• If you use another data set, the number of
combinations you have to run will be
reduced
• In either case, for full points you will also
have to do some exploring and comparing
of results
• Further details on these aspects of the
project will be posted later
166
ROC Curves
• The acronym ROC stands for receiver
operating characteristic
• This term originated in the analysis of
signal to noise ratio in a communication
channel
• It is a way of comparing true positives to
false positives
167
• In general terms, you can think of it as
“pushing” for greater positives and
measuring your success against the false
positives generated
• An ROC curve is similar to a lift chart
• It provides a way of drawing conclusions
about one data mining algorithm
168
• Multiple ROC curves can also be used to
compare data mining algorithms
• ROC curves also graphically illustrate a
means of getting the best possible results
over a range of parameter values by
combining two different data mining
algorithms
• For starters, take a look at the data in
Table 5.6 again (repeated on the following
169
overhead)
170
• For those instances with a predicted
probability > .5 the prediction is yes
• (The book never talks about whether the
prediction is reversed when the predicted
probability < .5)
171
• Anyway, looking at the probability
predictions vs. the actual class in the
table, the TP’s are clearly identified
• Likewise, the FP’s are clearly identified
• Looking at the right hand column, the
universes of actual positives and actual
negatives are also clearly identified
172
• Incidentally, you could break things down
in this way:
• Actual positives = TP + FN
• Actual negatives = TN + FP
• The book does this in part of its
explanation
• Most of the time I will just refer to actual
positives and negatives
173
Forming the ROC Curve
• Let a set of data be given that has been
sorted by the probability prediction like in
Table 5.6
• The ROC curve plots the TP rate on the yaxis and the FP rate on the x-axis
• The ROC curve for the Table 5.6 data is
given in Figure 5.3, shown on the following
overhead
• The jagged line represents a discrete plot
174
of the values in the table
175
Similarity between the ROC
Curve and the Lift Chart
• Note that the ROC curve is similar to the
lift chart curve shown earlier
• In the lift chart the x-axis was the sample
size as a percent of the population
• In the ROC curve the x-axis is the # of FP
as a % of the actual number of negatives
• The similarity here stems from the fact that
in the direct mailing example the number
of negatives for a sample or a population
176
is nearly the size of the total
The Upper Left is Good
• With a different example an ROC curve
and a lift curve may not be so similar
• However, there is still an overall similarity:
• The upper left hand corner of the graph is
“good”
• The graph displays TP vs. FP
• You’d like to maximize TP and minimize
FP, which would occur in the upper left
177
Arriving at the Smooth Curve
• Figure 5.3 is repeated on the following
overhead for reference
• Note the smooth curve that is shown in
addition to the jagged one
178
179
• The jagged line represented a single
display of one data set
• For general analytical purposes, you might
assume that across the whole problem
domain, with different data sets, the
relationship between TP and FP would be
a smooth curve
• This is shown in the figure as a dashed
line
180
• The book gives several explanations of
how to get the smooth curve
• The basic idea is to use cross-validation
• To me the explanations are foggy
• I’d like to appeal to a visual argument
again
181
• Suppose you did 10-fold cross-validation
• You’d have 10 test sets
• If you graphed TP vs. FP for each test set,
you’d get 10 different jagged lines
• If you averaged the y values for each x
value, the results would approximate a
smooth curve
• Since the axes are percents, the shape of
the curve would be the same as for the 182
population
ROC Curves for Different
Learning Schemes
• Figure 5.4, given on the following
overhead, shows ROC curves for two data
mining schemes
• Scheme A is further to the upper left for
smaller sample sizes
• Scheme B is further to the upper left for
larger sample sizes
• You can compare and choose between
the two if you had a preference based on
183
sample size
184
• The shaded area on the graph is referred
to as the convex hull of the two curves
• Part of the hull lies outside the boundaries
of the curves for A and B
• For sample sizes in that range, it is
possible to achieve ROC results on the
outer edge with a linear combination of A
and B
185
• The outer boundary of the hull is tangent
to A and B
• Let (x1, y1) and (x2, y2) be the tangent
points for A and B, respectively
• Suppose you wanted to find the point 1/nth
of the way from (x1, y1) to (x2, y2)
186
•
•
•
•
•
•
xnew = x1 + 1/n(x2 – x1)
ynew = y1 + 1/n(y2 – y1)
xnew = 1/n x2 + (x1 – 1/n x1)
ynew = 1/n y2 + (y1 – 1/n y1)
xnew = 1/n x2 + (1 – 1/n) x1
ynew = 1/n y2 + (1 – 1/n) y1
187
• 1/n and (1 – 1/n) can work like
probabilities
• They sum to 1
• You can use a Monte Carlo like approach
to achieving the best possible ROC value
for any point between the tangent points to
A and B
188
• This would be cross-validation like in
nature
• For multiple data sets, apply A at the ratio
of FP to FN that gives its tangent point
• For multiple data sets, apply B at the ratio
of FP to FN that gives its tangent point
189
• For a given problem where you want an
ROC (ratio of FP to FN) that lies 1/nth of
the way from one tangent point to the
other:
• Let the ratio of B data sets to A data sets
be 1/n to (1 – 1/n)
• Combine the results of the computations
generated in this proportion and it will be
the desired point on the line
190
• Note that the foregoing overheads are my
attempt to explain what the book says
• I don’t claim that they are wrong
• I did find the book’s explanation kind of
incomplete
• My review of it is no better
• There seem to be some missing links
191
Recall-Precision Curves
• Even though the section heading is recallprecision curves, we won’t actually see a
graph in this section…
• The general idea behind lift charts and
ROC curves is trade-off
• They are measures of good outcomes vs.
unsuccessful outcomes
• This is cost-benefit in simple terms (not in
terms of a cost factor and cost equation)
192
• Trying to measure the tradeoff between
desirable and undesirable outcomes
occurs in many different problem domains
• Different measures may be used in
different domains
• In the area of information retrieval, two
measures are used:
• Recall and precision
193
•
# of relevant docs retrieved
• Recall = ------------------------------------•
total # of relevant docs
•
# of relevant docs retrieved
• Precision = ------------------------------------•
total # of docs retrieved
194
• Notice the general idea illustrated by these
measures and inherent in the other ones
we’ve looked at:
• You can increase the number of relevant
documents you retrieve by increasing the
total number of documents you retrieve
• But as you do so, the proportion of
relevant documents falls
195
• Consider the extreme case
• How would you be guaranteed of always
retrieving all relevant documents?
• Retrieve all documents in total
• But you’ve gained nothing
• If you retrieve all documents, the task still
remains of winnowing the relevant
documents from the rest
196
Discussion
• The book notes that in non-computer
fields, like medical testing, the same kinds
of ideas crop up
• For a given medical test:
• Sensitivity = proportion of people with the
disease who test positive
• Specificity = proportion of people without
the disease who test negative
• For both of these measures, high is good 197
• Incidentally, these concepts also remind
me of association rule mining
• Support and confidence seem to have
something in common with the
dichotomies presented in this section
198
One Figure Measures
• In addition to 2-dimensional graphs or
curves like lift charts and ROC curves,
there are also techniques for trying to
express the goodness of a scheme in a
single number
• For example, in information retrieval there
is the concept of “average recall”
measures
199
• Average recall measures are actually
averages of precision over several recall
values
• Three-point average recall is the average
precision for recall figures of 20%, 50%,
and 80%
• Eleven-point average recall is the average
precision for figures of 0%-100% by 10’s
200
•
•
•
•
The book also cites the F-measure
(2 x recall x precision) / (recall + precision)
= (2 x TP) / (2 x TP + FP + FN)
If you go back to the definitions you can
derive the 2nd expression from the 1st
• Looking at the second expression, the
simple point is that a larger value is better
201
• The simplest one-figure measure of
goodness is simply the success rate:
• (TP + TN) / (TP + FP + TN + FN)
• One-figure measures don’t contain as
much information as graphs
• By definition, graphs tell you the outcome
over a range
202
• Table 5.7, shown on the following
overhead, summarizes lift charts, ROC
curves, and recall precision curves
• Even though we haven’t been given a
graph of a recall-precision curve, this is
where such curves are put in context with
lift charts and ROC curves
203
204
Cost Curves
• In the previous section two ROC curves
were graphed together
• There performance on samples of different
sizes could be compared
• This comparison didn’t include cost
• It is not straightforward to see how cost
can be incorporated into lift charts or ROC
curves
205
• The book introduces error and cost curves
as a way of including cost in scheme
comparisons
• Even though error and cost curves have a
purpose similar to lift charts and ROC
curves, they are quite different from them
• In trying to understand error curves and
cost curves it’s important not to confuse
them with lift charts and ROC curves
206
The Error Curve
• The book presents an error curve and a
cost curve together in Figure 5.5
• I will deal with the two curves one at a time
• The presentation of error curves will come
first
• Consider the error curve, Figure 5.5a,
shown on the following overhead
207
208
• The error curve shown is based on binary
classification
• In the figure, a yes classification is
symbolized by “+”, so the probability of a
yes is p[+]
• By extension a no is symbolized by “-”
• The solid curve (line) shows the
performance of a classification scheme, A
209
The Axes of the Graph
• The x-axis is the probability of a yes, p[+],
classification in a data set that A is applied
to
• In other words, A’s performance may differ
for different probabilities of yes in the data
set
• The domain is the interval of probabilities
from 0 to 1
210
• The y-axis, the performance measure, is
the expected error
• This is the probability of a misclassification
by A for a given value of p[+]
• The range is the interval of expected
errors from 0 to 1
211
• The expected error is not the count of
errors, but the error rate from 0 to 100%
• The error rates for false negatives and
false positives are given using italics as fn
and fp, respectively
212
Characteristics of the Curve for
Data Mining Algorithm A
• A is shown as linear
• It is not known whether classification
scheme performance is necessarily linear,
but for the purposes of discussion, this
one is
• Also note that the slope of A is not zero
• A’s performance does differ for different
values of x
213
• Under the assumption that A is linear, A is
not hard to plot
• The left hand endpoint, the y intercept, is
defined as the false positive rate, fp, that A
gives when the probability of yes in the
data set is 0
• The y value for the right hand endpoint is
defined as the false negative rate, fn, that
A gives when the probability of a yes in the
214
data set is 1
• For a given probability of yes on the x-axis
between 0 and 1:
• The y value would be the expected
misclassification rate, whether fp or fn, for
that probability p[+]
215
• A is lower on the left than on the right
• For data sets more likely to contain no, A
has fewer errors
• For data sets less likely to contain no, A
has more errors
• In a sense, you might say that the error
curve (line) shows that A is biased
• A is apparently more likely to give a
classification of no than yes
216
Other Things Shown in the
Figure
• The book indicates other things in the
figure
• The horizontal line at the bottom, the xaxis, would represent the performance of a
perfect predictor—expected error = 0
• The horizontal line at the top would
represent the performance of a predictor
that was always wrong—expected error =
1
217
• The dashed diagonal from the origin to the
upper right would represent the
performance if you always predicted yes
• The dashed diagonal from the upper left to
the lower right would represent the
performance if you always predicted no
218
• The two diagonals seem uninteresting, but
they bring out a useful point
• You don’t have to use the same predictor
across the whole domain
• Below a certain point, simply predicting no
outperforms A
• Above a certain point, simply predicting
yes outperforms A
219
• In theory, you could take a random sample
of a data set and see what the probability
of a yes was
• You could then choose which predictor to
use accordingly
220
The Cost Curve
• An example of a cost curve, Figure 5.5b, is
shown on the following overhead
• It looks somewhat similar to the error
curve of Figure 5.5a
• This is not an accident, but the axes of the
curves are different
• As the name implies, this curve contains
information about costs, not just error
rates
221
222
• It is not reasonable to try and derive or
prove the formulas they use in creating the
cost curve
• It’s also not reasonable to think we’ll come
to much of a theoretical or intuitive
understanding of the formulas
223
• My goal in presenting this will be the
following:
• Try to figure out what practical outcomes
in the graph result from the formulas used
• Then try to figure out how these practical
outcomes make it possible to read the
graph and understand what it represents
overall, even if the derivations are not
given in detail
224
The Axes of the Cost Curve—
The X-Axis
• The starting point for understanding the
cost curve has to be the axes
• The x-axis is labeled as the probability
cost function
• The book’s notation for this is pc[+]
• This is an extension of the notation p[+],
the probability of a yes, which was the xaxis of the error curve
225
• By way of introduction, consider the
following:
• Yes, we want a curve that graphs cost
• You might assume that you graph cost on
the y-axis against something else entirely
different on the x-axis
226
• It turns out that cost characteristics are
going to be part of the x-axis
• Part of the goal of this discussion will be to
explain how a cost curve can be based on
an x-axis that contains cost information
• The book’s expression for the x-axis,
along with an explanatory version, are
given on the following overhead
227
• pc[+], the x-axis of the cost curve
•
p[+] x C[-|+]
• = ---------------------------------•
p[+] x C[-|+] + p[-] x C[+|-]
•
prob(yes) x cost(FN)
• = --------------------------------------------------------• prob(yes) x cost (FN) + prob(no) x cost(FP)
228
• We don’t know what this “means” or
“where it came from”, but certain things
about it can be observed
• pc[+] is a transformation of p[+]
• p[+] appears in both the numerator and
denominator
• You might interpret p[-] as (1 – p[+])
229
• If you did a little simple graphing of the
function, you would find that it has curves
and discontinuities (where the
denominator goes to 0)
• So the x-axis of the cost curve is a
function of the x-axis of the error curve
230
• If you substitute p[+] = 0 and p[+] = 1, you
get the following
• If prob(yes) = p[+] = 0, pc[+] = 0
• If prob(yes) = p[+] = 1, pc[+] = 1
• Expressing the x-axis as a function, this is:
• pc[+] = fcn(p[+])
• For p[+] = 0, fcn(p[+]) = 0
• For p[+] = 1, fcn(p[+]) = 1
231
• In words, the x-axis of the cost curve is a
transformation of the x-axis of the error
curve
• The endpoints of the interval of interest of
the error curve map to the same, desired
endpoints for the interval of interest for the
cost curve, 0 and 1
• In between these points, the error curve is
actually a curvilinear function of the error
232
The Y-Axis
• With the x-axis established, what is the y
value being mapped against it?
• The y-axis represents what is known as
the normalized expected cost
• The normalized expected cost is a function
of the probability cost function
• y = NEC = f(pc[+]) = f(x),
233
• This is the formula for the normalized
expected cost, namely the formula given in
the book for the function that defines the y
coordinate of the graph:
• fn x pc[+] + fp x (1 – pc[+])
• Note that fn and fp are small letters in
italics
• These italicized things are not the counts,
FP and FN
234
• fn and fp are the rates or ratios of false
negatives and false positives
• This was also the notation used in the
error curve graph
• In the error curve, fn and fp varied
according to p[+]
235
• The formula given for the y coordinate of
the cost curve appears quite simple
• However, it would be wise to take into
account that when calculating it, fn and fp
would themselves vary across the range of
p[+]
236
•
•
•
•
•
In practical terms, what have we got?
The normalized expected cost—
Is the sum of two terms—
Each term consisting of two factors—
Where each of the factors falls between 0
and 1—
237
• The sum of fp and fn can’t be greater than
1
• You hope it’s less than 1
• Otherwise all of your predictions are in
error
• The sum of pc[+] and (1 – pc[+]) is clearly 1
238
• These arithmetic facts mean that the
normalized expected cost can only fall
between 0 and 1
• This is why it’s called normalized
• From a purely pragmatic point of view this
normalization is the reason (perhaps
among others) why the x-axis is defined
the way it is and the function y = f(x) is
defined the way it is
239
What Does the Graph Portray?
• The cost curve is shown again on the
following overhead as a memory refresher
• With the axes defined, what information
can be read from the graph?
240
241
• The line A represents the cost curve for a
particular data mining algorithm
• The x-axis, the probability cost function,
contains information on two things at once:
• The probability of a yes or no for instances
of various different data sets
• The cost of FP and FN for a particular
problem domain
242
• The x-axis doesn’t include any information
about a data mining algorithm or its
performance
• The x-axis contains information about the
domain only (as it should)
243
• The y-axis is the dimension along which
you graph algorithm dependent
information
• The fp and fn in the function being
graphed are performance measures for an
algorithm at given probabilities of yes/no in
the data set
244
• Remember, for example, that if an
algorithm has been trained to lean towards
no, but the data set leans towards yes,
you will tend to get more FN’s and fn will
be higher
• This is the kind of information that the
factors fn and fp include in the formula
245
• After all this blah blah blah, presumably
the following simple description of the cost
curve will make some sense:
• The cost curve graphs the normalized cost
of applying a data mining algorithm to data
sets where the p[+] ranges from 0 to 1
246
What Else is in the Figure?
• Given the explanation so far, what else is
in the figure and what does it signify?
• Once again, the figure is repeated as a
reminder of what’s in it
247
248
• The performance cost function of a given
algorithm is shown as a straight, solid line
• The book still doesn’t say why the lines
would necessarily be straight, but we’ll
take them as they’re given
• There are lines for A and B in the figure,
plus shadowy lines for other hypothetical
data mining algorithms
249
• A and B cross and have different
performance/cost characteristics over
different ranges of p[+] or pc[+] values
• The always yes/always no predictors are
shown as crossed, dashed, diagonal lines
• The curved, dotted line is referred to as an
envelope
250
• The idea behind the shadowy lines and
the envelope is that you might have
multiple data mining algorithms
• If you used each algorithm only in that
range of pc[+] where it was the best, the
envelope is the minimum cost curve over
the problem
251
• Finding the best algorithm to use, given
this envelope, is not hard to accomplish
• A statistical sample will tell you the p[+] for
any given data set out of a problem
domain
• Then use the algorithm that performs best
for that value
252
5.8 Evaluating Numeric
Prediction
• All the discussion up to this point has
concerned evaluation of classification
schemes
• Fundamentally, this is based on counting
errors (incorrect predictions)
• Numeric prediction gives a decimal value
• Evaluation is an exercise in estimating
“how far off” the numeric values are
253
• The general approaches to evaluation still
hold:
• Training set vs. test set
• Holdout of instances for these purposes
• Cross-validation (partitioning the sets nfold, averaging results across partitions)
• Having done this, the question remains,
how to evaluate the results
254
• The book discusses 7 different measures
of numeric error that can be used
• They all may have a “statistical feel” to
them
• Some may look more familiar than others
• Rather than discuss them all, I will simply
wave my hands at Table 5.8, shown on
the following overhead
255
256
• So of the seven, which way of evaluating a
numeric scheme is the best?
• On the one hand, answering this question
fully would require statistically analyzing
the errors that occur in a problem domain
for a given data mining algorithm
257
• However, there’s good news
• In practice, if a particular data mining
algorithm tends to be better than others, it
will also tend to win regardless of the
evaluation technique used
• This idea is illustrated in Table 5.9, shown
on the following overhead
258
259
• The previous table does invite one more
question
• How do you know that one data mining
algorithm is better than another when
comparing their evaluation results for
particular error measure?
260
• The answer is essentially the same as it
was earlier in the chapter
• Since these are statistical measures
themselves, it is possible to use statistical
tests on them
• Given two results, you can check to see
whether the difference between them is
statistically significant, meaning that one
algorithm is actually better than another
261
5.9 Minimum Description
Length Principle
• This discussion stems from the
informational loss function of section 5.6
• This is the informational loss function
• -log2 pi
• As pi  1
• -log2 pi  0
262
• The basic idea was this:
• Let pi represent the probability that a
mining scheme assigns an instance to
classification i
• A data mining scheme is good if the
informational loss function value, -log2 pi,
is small
263
• Given a data mining scheme that
generates probabilities, pi, the loss
function has an informational interpretation
• -log2 pi is the measure of how much
additional information is needed to
correctly predict all instances that belong
in the class
264
• Another way of putting it is that the data
mining algorithm gives you 1 – (-log2 pi) =
1 + log2 pi information
• -log2 pi is the amount of additional
information needed due to the fact that the
predictions aren’t perfect and they include
(1 – pi) uncertainty
265
The Minimum Description
Length (MDL) Principle Defined
• The principle can be informally stated in
this way:
• The best “theory” for a data set, namely
the best outcome from applying a data
mining algorithm, has the characteristic
stated on the following overhead
266
• The amount of information needed to
encode the theory (the size of the theory)
plus the amount of information needed to
encode those exceptional cases that the
theory doesn’t handle correctly (the loss
due to incorrect predictions) is a minimum
• In brief:
• The size of the theory plus the size of the
exceptions is a minimum
267
Occam’s Razor
• The previous statement about a good data
mining scheme is related to Occam’s
Razor, which can be stated in this way:
• All else being equal, the simplest theory is
the best
• Occam’s Razor is more accurately stated:
• The theory that makes the fewest
assumptions is the best
268
William of Ockham
From Wikipedia, the free encyclopedia
• William of Ockham ( /ˈɒkəm/; also Occam, Hockham,
or several other spellings; c. 1288 – c. 1348) was an
English Franciscan friar and scholastic philosopher, who
is believed to have been born in Ockham, a small village
in Surrey.[1] He is considered to be one of the major
figures of medieval thought and was at the centre of the
major intellectual and political controversies of the
fourteenth century. Although he is commonly known for
Occam's razor, the methodological principle that bears
his name, William of Ockham also produced significant
works on logic, physics, and theology. In the Church of
England, his day of commemoration is 10 April.[2]
269
William of Ockham, from stained
glass window at a church in Surrey
270
• William of Ockham – Sketch labelled
"frater Occham iste", from a manuscript of
Ockham's Summa Logicae, 1341
271
272
Occam’s Razor Applied to Data
Mining
• Suppose that you have a scheme for
encoding a “theory” of a data set, namely
a rule set, a decision tree, some output of
a data mining scheme
• Add to this the information needed to
handle the cases that the theory doesn’t
correctly predict
273
• Together, this sum is the measure of the
complexity of the theory derived by the
data mining scheme
• The phrase “minimum description length”
is a simple way of describing this
complexity
• The shorter the encoding of the
information, the simpler the theory
274
• The range of “theories” comes from the
application of different data mining
schemes
• At various points throughout the book the
statement has been made that we’re
willing to accept some misclassifications
as long as the scheme is simple enough
275
• Recall the greedy algorithm for making
decisions about splits when forming a tree
• Ultimately, impure leaf nodes might result,
but at least we had a simple heuristic that
didn’t explode computationally that arrived
at a useful result
• So the question is, where’s the trade-off
point between a good theory and the
number of exceptions it generates?
276
• Here’s one extreme:
• Take the training set and use a technique
like covering that results in rules where
there are no misclassifications
• There are no exceptions and the
complexity is simply the description length
of the rule set
277
• In some sense, this rule set is overtrained
• It is specific to the training set, since it
makes no mistakes on it
• In a sense, the encoding of the rule set is
an encoding of the complete training set
• The information contained in both is
essentially the same
278
• This is the alternative to the extreme case:
• You have an algorithm that gives
misclassifications
• That algorithm can be encoded in a
particular length
• The exceptional cases can also be
encoded in a particular length
279
• The sum of the lengths of the two parts of
the imperfect algoritym is less than the
length of the encoding of the perfect
algorithm
• If this holds true, then Occam’s Razor
says that the imperfect algorithm is
actually the simpler, and preferred
explanation of the data
280
Measures of Goodness
• In some sense the foregoing discussion
has been about finding an absolute
measure of goodness
• It hasn’t been about counting the error rate
for a single algorithm or comparing two
algorithms base on their error rates
281
• It turns out that minimizing the description
length is equivalent to maximizing the
probability—so superficially it sounds like
a tractable statistical problem
• In practice, what makes this hard to
calculate is the fact that you need coding
schemes that are as efficient as possible
in order to make fair comparisons
282
• You can’t fairly compare the MDL’s of two
different algorithms if they were encoded
differently and the encodings weren’t
equally efficient
• In the absence of that, it is possible that
differences in the comparison just result
from differences in the encoding schemes,
not differences in the algorithms
283
Another Philosophical Note
• There is a philosophical principle that is in
competition with Occam’s Razor
• It is known as Epicurus’s principle of
multiple solutions
• A sentence from the Wikipedia article on
Epicurus is given on the following
overhead, which states the principle
284
• Epicurus emphasized the senses in his
epistemology, and his Principle of Multiple
Explanations ("if several theories are
consistent with the observed data, retain
them all") is an early contribution to the
philosophy of science.
285
• The authors of the book point out that
there are advanced data mining
techniques that achieve better results by
combining several different data mining
schemes
• This lends credence to the idea that it’s
worth retaining all different theories that
explain the data
• Some additional goodies from Wikipedia
are on the following overheads
286
Roman marble bust of Epicurus
287
288
• Bust of Epicurus leaning against his
disciple Metrodorus in the Louvre Museum
289
290
Epicurus, Nuremberg Chronicle
291
Hickam’s Dictum
• This is a concrete statement of Epicurus’s
general idea in a specific problem domain,
medicine
• It is a pithy summary of the idea that it can
be a big mistake to try and simplify too
much
• If complexity is there, don’t ignore it
• Wikipedia information is given on the
following overheads
292
•
•
•
•
Hickam's dictum
From Wikipedia, the free encyclopedia
Jump to: navigation, search
Hickam's dictum is a counterargument to the use of
Occam's razor in the medical profession.[1] The principle
is commonly stated: "Patients can have as many
diseases as they damn well please". The principle is
attributed to John Hickam, MD. Hickam was a faculty
member at Duke University in the 1950s, and was later
chairman of medicine at Indiana University.[2]
293
• When discussing Occam's razor in
contemporary medicine, doctors and
philosophers of medicine speak of diagnostic
parsimony. Diagnostic parsimony advocates that
when diagnosing a given injury, ailment, illness,
or disease a doctor should strive to look for the
fewest possible causes that will account for all
the symptoms. However, this principle has very
important limits in medical practice.
294
• The actual process that occurs when diagnosing a
patient is a continuous flow of hypothesis and testing of
that hypothesis, then modifying the hypothesis, and so
on. In the context of this method, the principle of
Hickam's dictum asserts that at no stage should a
particular diagnosis be excluded solely because it
doesn't appear to fit the principle of Occam's razor. The
principle of Occam's razor, or parsimony, does not
demand that the diagnostician necessarily opt for the
simplest explanation, but instead guides the medical
practitioner to seek explanations, without unnecessary
additional assumptions, which are capable of accounting
for all relevant evidence.
295
• A key reason for using Hickam's dictum as
a limiting principle to that of Occam's razor
is that it is often statistically more likely
that a patient has several common
diseases rather than having a single, rarer
disease that explains his myriad
symptoms. Another key reason is that,
independent of statistical likelihood, some
patients do in fact turn out to have multiple
diseases.
296
• In such cases, multiple categories of
diagnosis may indeed have independent
causes rather than a single source, i.e.,
may be due to separate events or
combinations of events to which the
patient may have been subjected or
exposed. Thus, Hickam's dictum provides
physicians with a counterbalancing
principle to the unfettered use of Occam's
razor in diagnosis.
297
• An example of the utility of Hickam's
dictum is Saint's triad of hiatus hernia,
gallbladder disease, and diverticulosis.
C.F.M. Saint was a British surgeon. His
triad has no known pathophysiological
relationship, nullifying the usefulness of
Occam's razor.[3] Hickam's dictum is
similar in principle to Walter Chatton's antirazor.
298
The End
299
300
301
302