csce462chapter5Part2PowerPointOlderXx

Download Report

Transcript csce462chapter5Part2PowerPointOlderXx

Data Mining
Chapter 5
Credibility: Evaluating What’s
Been Learned
Kirk Scott
1
2
3
4
• This set of overheads consists of just one
section in the book:
• 5.7 Counting the Cost
• There are approximately 200 overheads
• This is a long and important section
5
5.7 Counting the Cost
• Preliminary note:
• The book discusses cost in this section
• It also discusses things that don’t really
involve cost
• Incidentally, it also discusses things that
would involve (differential) cost, but
simplifies by using cost coefficients of 1
and 0 rather than anything more
complicated
6
• In a way, the contents of this section might
come across as a mish-mash
• Hopefully the overheads will add clarity to
the book’s presentation
• Whether mish-mash or not, ultimately the
topics covered will be important because
they are aspects of what is available in
Weka
7
Fundamental Cost Ideas
• For any prediction scheme there will be
successes and failures, successes and
errors
• Consider a binary classification scheme
where the output is yes/no, true/false,
positive/negative
8
Two Kinds of Successes
• Correct predictions of positive, true
positives, TP
• Correct predictions of negative, true
negatives, TN
• Quite often, the cost (i.e., the benefit) of
the two kinds of successes are taken to be
the same
• You can deal with TP + TN together
9
Two Kinds of Errors
• Incorrect predictions of positive, false
positives, FP
• Incorrect predictions of negative, false
negatives, FN
• In virtually every applied situation the
costs of false positives and false negatives
materially differ
10
Difference in Error Costs
• The book illustrates the idea with several
examples
• One is direct mail advertising, which it
uses again later in the section
• Consider a mailing to a predicted potential
customer who doesn’t respond
• This is a false positive
• The individual cost of the unsuccessful
mailing is low
11
• Consider a mailing that was never sent to
what would have been a customer
• This is a false negative
• The algorithm incorrectly predicted
negative for that potential customer
• The cost of the lost business is high
compared to the cost of a mailing
12
• In order to stay in business, the cost of
mailing has to be less than the cost of
business generated
• You can afford lots of false positives,
although in aggregate their cost adds up
• In this domain the cost of lost opportunity
is invisible, but in practical terms, the cost
of a few false negatives quickly outweighs
the cost of many false positives
13
Confusion Matrices
• The idea of a confusion matrix will come
up more than once in this section
• A confusion matrix is a graphical way of
summarizing information about successes
and failures in prediction
14
• The simplest of confusion matrices, one
for two classifications, is illustrated in
Table 5.3, shown on the following
overhead
• The entries in the matrix are the counts of
the different kinds of successes and
failures for a given situation
15
16
• Next the book presents Table 5.4, which
consists of parts (a) and (b)
• I will break up the presentation
• Consider Table 5.4a, shown on the
following overhead
• It shows the confusion matrix resulting
from the application of a non-binary
classification scheme
17
18
• The rows represent the actual class for the
instances in a data set
• The columns represent the class
predictions given by the classification
scheme
• The true predictions are on the main
diagonal of the matrix
• The errors are off of the main diagonal
19
False Positives
• Look at column a, for example
• The entry for row a, 88, is the count of the
number of correct predictions
• The entries for rows b and c, 14 and 18,
are the counts of false positives with
respect to class a
• The algorithm falsely predicted this many
instances of b and c were of class a
20
False Negatives
• Now look at row a
• The entries for columns b and c, 10 and 2,
are the counts of false negatives with
respect to class a
• The algorithm falsely predicted this many
instances of class a were in classes b and
c
21
Food for Thought
• Now look at row b
• The entry for column a, 14, is a false
negative with respect to class b
• This entry was identified as a false positive
with respect to class a
22
• So for a classification that isn’t binary,
whether something is a false positive or
false negative is a matter of perspective
• Geometrically, it depends on whether
you’re looking from the perspective of a
row or a column
23
• The rows represent the actual
classification
• Taken from the perspective of a row, an
error entry is a failure to correctly place
something into that classification
• This is a false negative
24
• The columns represent the predicted
classification
• Taken from the perspective of a column,
an error entry is an incorrect placement of
something into that classification
• This is a false positive
25
• The row b, column a entry is an error,
false positive of false negative, depending
on perspective
• It is simply a unique error case, distinct
from the other error entries in the table
26
• It might have a cost associated with it that
is distinct from the cost of any other error
entry in the matrix
• At this point, the kind of error doesn’t really
matter
• However, the kind of error may eventually
be figured into the calculation of the cost
27
Evaluating Classification Performance
with a Confusion Matrix
• Next, the book shows a way of evaluating
the performance of a classification scheme
• This evaluation method has nothing to do
with costs
• It’s simply an application built on top of,
and graphically illustrated by, confusion
matrices
28
• Previous sections of the book evaluated
data mining algorithm performance by
estimating error rates of classification
schemes
• It also considered comparing the
performance of two different schemes
29
• The idea presented now is that the
performance of a data mining algorithm
can be evaluated by comparing its results
to the results of random classification
• The results of an actual predictor will be
compared with a hypothetical predictor
that does classification randomly
30
• Table 5.4 (a) and Table 5.4 (b) combined
illustrate the idea
• Table 5.4 (a) represents the actual
predictor
• Table 5.4 (b) represents the hypothetical
predictor
• These tables are shown on the following
overhead
• Explanations follow that
31
32
• First consider the bottom row of 5.4 (a)
• The values 120, 60, and 20 are the totals
predicted for classes a, b, and c overall
• Now consider the rightmost column of 5.4
(a)
• 100, 60, and 40 are the counts of the
actual occurrence of instances of classes
a, b, and c
33
• Now consider Table 5.4 (b)
• The bottom row of 5.4 (b) is the same as
the bottom row of 5.4 (a)
• The totals predicted for the classes are the
same
• The rightmost column of 5.4 (b) is the
same values as the rightmost column of
5.4a
• The count of the actual number of
34
instances of each class does not change
• Table 5.4 (a) represents the actual
predictor we are trying to evaluation
• Table 5.4 (b) represents a hypothetical
predictor that we are comparing it against
• The fact that the bottom rows of the two
tables are the same means that the
hypothetical predictor of interest gives the
same overall count of predictions for each
class
35
• It’s the bodies of the two tables that differ
• They differ in how the correct and incorrect
predictions for each class are distributed
• The hypothetical predictor’s values work in
this way:
• The values in the rows for every actual
class are distributed in proportion to the
overall totals in the bottom row
36
•
•
•
•
•
120 is to 60 is to 20
As 60 is to 30 is to 10
As 36 is to 18 is to 6
As 24 is to 12 is to 4
These sets of values represent the socalled random predictor
• (It’s not exactly random. It’s based on the
prediction counts for the actual scheme.)
37
• The random predictor is the point of
comparison
• Can you make a quantitative comparison
between the performance of the actual
classification scheme and the random
predictor?
• If your classification scheme isn’t better
than random, you’ve got a big problem
• Your predictor is sort of an anti-predictor
38
Comparing with the Random
Predictor
• Sum the counts on the main diagonal of
the matrix for the actual scheme
• It got 88 + 40 + 12 = 140 correct
• Do the same for the random predictor
• It got 60 + 18 + 4 = 82 correct
• Clearly the actual predictor is better
• Can this be quantified?
39
• There are 200 instances altogether
• The random predictor had 200 – 82 = 118
incorrectly classified instances
• The actual predictor got 140 – 82 = 58
more correct than the random predictor
• Of the 118 incorrectly classified by the
random predictor, the actual predictor
classified 58 of them correctly
40
• You measure performance by finding the
ratio 58/118 = 49.2%
• This ratio is known as the Kappa statistic
• For a predictor that is no worse than
random, the value of the statistic can
range from 0 to 1
• 0 means no better than random
• 1 means perfect prediction
41
Counting as Opposed to Cost
• The method given here doesn’t count the
cost of FP or FN
• It is similar to previous approaches
• It simply counts the successes vs. the
errors and compares with the hypothetical
case
42
• What we just looked at were confusion
matrices
• These structures contained counts of
correct and incorrect classifications
• It was possible to compare these values
for an actual and a hypothetical predictor
43
Cost-Sensitive Classification
and Cost Matrices
• The next step will be to include cost in the
performance calculation
• Suppose you don’t assign different cost
values to different success types (TP, TN)
and different failure types (FP, FN)
• Cost of success = 0
• Cost of failure = 1
44
• The simple cost matrices for the two-class
and three-class cases are given in Table
5.5, shown on the following overhead
45
46
Performance Evaluation with
Costs
• Suppose you have a data mining scheme
that gives straight classification predictions
• With a test set, you have both a prediction
and the known classification
• Up until now, performance was evaluated
on the basis of a count of errors
47
• Potentially, each element of the cost
matrix could contain a different decimal
value which was the weight for that case
• Successes (benefits) could be positive
• Failures/errors (costs) would be negative
• You would evaluate performance by
summing up the total of successes and
errors multiplied by the corresponding
weight in the cost matrix
48
Complete Financial Analysis
• Data mining doesn’t exist in a vacuum
• It is a real technique applied in businesslike situations
• Suppose you were comparing the cost of
two different algorithms
• Not only might the costs associated with
the predictions be taken into account
49
• The better performing algorithm might
have data preprocessing or computational
costs that are significant
• Its predictions might be X dollars more
valuable than those of another scheme
• But the comparative costs of running the
algorithm may be greater than X
• A complete analysis would take this into
account when picking an algorithm
50
Classification Including Costs
• This section represents a shift of focus
• So far we’ve been concerned with doing a
cost evaluation of prediction performance
after the fact
• Now we want to try and take cost into
account when making the prediction
51
• Suppose you’re using a classification
scheme that generates probabilities
• Up until now, if you had to classify under
this condition, you picked the classification
with the highest probability
• With cost information, another possibility
presents itself
52
• For a given instance that you want to
classify:
• pi is the predicted probability that the
instance falls into the ith class
• Then for each class, (1 – pi) is the
probability of a false positive
• In other words, classify on the basis of
minimum cost
53
• Multiple each of the (1 – pi) times the cost
factor of a misclassification (false positive)
• Do not classify according to the largest pi
• Instead, give the instance the classification
where the product of the probability of a
misclassification and the cost of a
misclassification is highest
54
Probabilities vs. Straight
Classification
• The book notes that there are ways of
converting the results of straight
classification schemes into probabilities
• It doesn’t give details (although it’s pretty
clear this would involve counting and
dividing in Bayesian style)
• The idea is that you could then apply cost
metrics when deciding classification for
those schemes
55
Summary of the previous idea
• The approach for factoring cost into
predictions given thus far could be
summarized as follows:
• When running the data mining algorithm,
do not take cost into account when
deriving the rules
• However, after the rules are derived, take
cost into account when using them to
make a classification
56
Cost-Sensitive Learning
• Now we would like to try and take cost into
account when deriving the rules
• It is possible to follow a line of reasoning
that explains how this is accomplished
• Not all the details will be given
• However, the approach can be outlined
based on concepts that have already been
presented
57
There are Schemes that give
Probabilities as Results
• We have repeatedly been told that some
data mining algorithms will give
probabilities instead of classifications
• Naïve Bayes gave these kinds of results
• We infer that there are other, more
sophisticated data mining schemes that
also give probabilistic results
• In other words, probabilistic based
schemes have general application
58
What about the Loss Function?
• In this chapter the idea of a loss function
was introduced
• The statement was made that the way to
improve a classification scheme was to
minimize the loss
• In fact, we have not yet been told how
data mining schemes can be guided to
give better results by a process of
minimizing loss
59
Cost and Loss
• Cost is external to a classification scheme
• There are costs in the real world
• Total cost depends on how successful or
unsuccessful classification is
60
• Loss is internal to a classification scheme
• It is a measure of how greatly predicted
classification probabilities differ from
observed actual classifications
• Cost and loss are not the same, but they
are related
61
• When evaluating a scheme, the cost is
calculated by multiplying the probability of
an outcome times the cost of that outcome
• If the mining scheme is guided by
minimizing loss, that means it’s guided by
improving the probability estimates
• This will tend to minimize cost
62
Making the Jump—Bias to Cost
• This is where you make a jump
• Recall this idea from earlier in the chapter:
• You achieve quality evaluations of a
scheme by cross-validation
• If you want accurate performance
estimates, the distribution of classifications
in the test set should match the distribution
of classifications in the training set
63
• If there was a mismatch between training
and test set some test set instances might
not classify correctly
• The mismatch in the sets means that the
training set was biased
• If it was biased against certain
classifications, this suggests that it was
biased in favor of other classifications
64
• Note that bias is closely related to the
concept of fitting or overfitting
• In nonjudgmental terms, there is simply a
mismatch between training and test set
• In more judgmental terms, the training set
is overfitted or biased in favor of giving
some predictions or probabilities which are
not as prevalent in the test set
65
Including Cost in Rule Derivation by
Intentionally Using Bias
• The approach to including cost in the
derivation in classification rules is this:
• Intentionally bias the training set by
increasing the representation of some
classifications
• The algorithm will produce rules more
likely to correctly predict these
classifications
66
• The practical questions are these:
• Which classifications to over-represent?
• In general you want to over-represent
classifications where misclassification has
a low cost
• How much to over-represent them?
• The answer to this is pretty much just
empirical rule of thumb
67
Concretely…
• Take this as a concrete example:
• Suppose that false positives are more
costly than false negatives
• Over-represent “no” instances in the
training set
• When you run the classification algorithm
on this training set, it will overtrain on no
68
• That means the rules derived by the
algorithm will be more likely to reach a
conclusion of no
• Incidentally, this will increase the number
of false negatives
• However, it will decrease the number of
false positives for these classifications
• In this way, cost has been taken into
account in the rules derived, reducing the
69
likelihood of expensive FP results
• The book notes that you aren’t limited to
accomplishing this by over-representation
• You can allow duplicate instances in the
training set (like sampling with
replacement)
• In practice, some data mining schemes
just allow heavier or lighter weights to be
assigned to instances of given
classifications
70
Cost and Loss, Again
• Technically, we still didn’t use an algorithm
that was guided by loss
• Skewing the sample was the proxy for
tuning loss, which in turn was also the
proxy for cost in the rule derivation
71
Lift Charts
• The term lift describes certain techniques
based on the following idea
• You may not be able to perfectly classify
instances and identify only those with the
desired classification
• However, you may be able to identify
subsets of a data set where the probability
of the desired classification is higher
72
• Consider the direct mailing example which
was mentioned quite a while back
• Suppose you know as a starting point that
out of a population (training set/test set) of
1,000,000, there will be 1,000 yeses
• This is a response rate of .1%
73
Defining the Lift Factor
• Now suppose you have a data mining
algorithm that gives probabilities of yes for
the instances
• This means you can identify people more
likely to respond yes and those less likely
to respond yes
• You can identify subpopulations-subsetssamples with a greater likelihood of yes
74
• The proportional increase in the response
rate is known as the lift factor
• This idea is summarized in the table on
the following overhead
75
Sample
Yeses
Response Rate
Lift Factor
1,000,000
1,000
.1%
400,000
800
.2%
2
100,000
400
.4%
4
76
• Assuming that you can do this, note this
obvious result:
• You’ve increased the response rate—good
• You’ve accomplished this by reducing the
sample size—also good
• On the other hand, the absolute number of
yeses has dropped
• In other words, you are excluding some
yeses from consideration
77
Finding Lift Factors
• Given a training set or population, how do
you come up with the figures for lift
factors?
• The data mining algorithm gives a
probability of a yes
• The training data also tells you whether a
given instance is actually yes
78
• Rank all of the instances by their
probability and keep track of their actual
class
• Note that we’re operating under the
(reasonable) assumption that the data
mining algorithm really works
• The higher the predicted probability, the
more likely an instance really is to take on
a certain value
79
• Table 5.6, given on the following
overhead, shows instances ranked by their
predicted probability, and with their actual
classification also listed
80
81
• Using a table like this, you can find the lift
factor for a given sample size, or you can
find the sample size for a given lift factor
82
Finding the Lift factor for a
Given Sample Size
• If you want the lift factor for a particular
sample size, just count down the list until
you reach that sample size
• The response rate = count(yeses) /
count(number of instances in sample)
• The lift factor is response rate(sample) /
response rate(population)
83
Finding the Sample Size for a
Given Lift Factor
• If you want a particular lift factor, keep a
running computation of response rate and
lift factor as you go down the list
• When you reach the desired lift factor, the
number of instances you’ve gone through
is your sample size
84
Making a Lift Chart
• To make a lift chart, you need the lift factor
for each possible sample size
• In other words, work your way down the
ranked list one instance at a time
• For each one, calculate the response rate
so far and the lift factor so far
• You then plot the results of this
85
• The x-axis is the percent of the population
you’ve worked through
• The y-axis is the actual number of yes
responses generated
• This plot is what is known as a lift chart
• Figure 5.1 shown on the following
overhead illustrates the idea for the direct
mailing example
86
87
What a Lift Chart Tells You
• The lift chart tells you something about
your data mining algorithm
• If the algorithm is any good at all, the
curve should be above the diagonal
• Otherwise, the algorithm is determining
the yes probability for instances with a
success rate lower than a random sample
88
• In general, the closer the lift chart curve
comes to the upper left hand corner, the
better
• A minimal sample size and a maximum
response rate is good
• The hypothetical ideal would be a mailing
only to those who would respond yes,
namely a 100% success rate, with no one
left out (no false negatives)
89
Cost Benefit Curves
• The costs and benefits of differing sample
sizes/mailing scenarios will differ
• As noted earlier, the cost of the algorithm
alone is a factor
• But for the time being we will ignore it
• The assumption is that we have already
done the work of running the algorithm
• We want to know what this completed
work can tell us
90
• The cost of interest now is the cost of
mailing an individual item
• We’ll assume that the cost is constant per
item
• The benefit of interest is the value of the
business generated per yes response
• We’ll assume that the benefit is constant
per positive response
91
• It is now possible to form a cost/benefit
curve across the same domain (percent of
sample size) as the lift chart
• Figure 5.2, given on the following
overhead, shows lift charts on the left and
cost/benefit curves on the right for various
scenarios
• Figure 5.2 will be discussed in greater
detail on the overheads following it
92
93
Comments on Figure 5.2
• For Figure 5.2, a benefit of +15.00 per
positive response is assumed
• The charts at the top assume a mailing
cost per item of .50
• A lift chart is shown on the left and a
cost/benefit curve is shown on the right
• The cost/benefit curve keeps on rising all
the way to 100% of the population, so it
makes sense to send mail to everybody if
94
you can afford it
• The charts at the bottom assume a cost
per item of .80
• A lift chart is shown on the left and a
cost/benefit curve is shown on the right
• The cost/benefit curve has a peak in the
middle
• This means you only want to mail to this
proportion of the population
95
• Keep in mind that the sample is selected
by ranking by the algorithm, not randomly
• This means roughly speaking (gauging by
the shape of the curve)
• You would like to mail to those prospective
customers in the top half of the data set
• Mailing to more than these gives
diminishing returns
96
What Else is in Figure 5.2?
• The details of Figure 5.2 are impossible to
see
• In brief, the GUI includes the following:
• A confusion matrix
• A cost matrix where you input the costs of
TP, TN, FP, FN
• A slider for selecting the proportion
• Numeric results of the computation as well
as the graphs
97
A Side Note on Where You
Might Be in Exploring Weka
• Figure 5.2 was generated by Weka
• If you haven’t downloaded and installed
Weka yet, it’s not too soon to do so
• The book also has links to the example
data sets, already in ARFF (format)
• Looking through the GUI, you can find the
options to run various data mining
algorithms
98
• You can also find the options to run
various tools that allow you to explore and
compare the results of running the
algorithms
• There are no homework assignments, so
it’s up to you to start looking into these
things on a schedule that agrees with you
99
A Preliminary Note on the
Project
• The syllabus contained some information
on the general form of the project
• Here is a little more preliminary detail
• You can do the project using the UCI data
sets or any of the other data sets made
available on the Weka Web site
• If you do so, you are on the hook for the
full 8 combinations of data and algorithm
100
• If you use another data set, the number of
combinations you have to run will be
reduced
• In either case, for full points you will also
have to do some exploring and comparing
of results
• Further details on these aspects of the
project will be posted later
101
ROC Curves
• The acronym ROC stands for receiver
operating characteristic
• This term originated in the analysis of
signal to noise ratio in a communication
channel
• It is a way of comparing true positives to
false positives
102
• In general terms, you can think of it as
“pushing” for greater positives and
measuring your success against the false
positives generated
• An ROC curve is similar to a lift chart
• It provides a way of drawing conclusions
about one data mining algorithm
103
• Multiple ROC curves can also be used to
compare data mining algorithms
• ROC curves also graphically illustrate a
means of getting the best possible results
over a range of parameter values by
combining two different data mining
algorithms
• For starters, take a look at the data in
Table 5.6 again (repeated on the following
104
overhead)
105
• For those instances with a predicted
probability > .5 the prediction is yes
• (The book never talks about whether the
prediction is reversed when the predicted
probability < .5)
106
• Anyway, looking at the probability
predictions vs. the actual class in the
table, the TP’s are clearly identified
• Likewise, the FP’s are clearly identified
• Looking at the right hand column, the
universes of actual positives and actual
negatives are also clearly identified
107
• Incidentally, you could break things down
in this way:
• Actual positives = TP + FN
• Actual negatives = TN + FP
• The book does this in part of its
explanation
• Most of the time I will just refer to actual
positives and negatives
108
Forming the ROC Curve
• Let a set of data be given that has been
sorted by the probability prediction like in
Table 5.6
• The ROC curve plots the TP rate on the yaxis and the FP rate on the x-axis
• The ROC curve for the Table 5.6 data is
given in Figure 5.3, shown on the following
overhead
• The jagged line represents a discrete plot
109
of the values in the table
110
Similarity between the ROC
Curve and the Lift Chart
• Note that the ROC curve is similar to the
lift chart curve shown earlier
• In the lift chart the x-axis was the sample
size as a percent of the population
• In the ROC curve the x-axis is the # of FP
as a % of the actual number of negatives
• The similarity here stems from the fact that
in the direct mailing example the number
of negatives for a sample or a population
111
is nearly the size of the total
The Upper Left is Good
• With a different example an ROC curve
and a lift curve may not be so similar
• However, there is still an overall similarity:
• The upper left hand corner of the graph is
“good”
• The graph displays TP vs. FP
• You’d like to maximize TP and minimize
FP, which would occur in the upper left
112
Arriving at the Smooth Curve
• Figure 5.3 is repeated on the following
overhead for reference
• Note the smooth curve that is shown in
addition to the jagged one
113
114
• The jagged line represented a single
display of one data set
• For general analytical purposes, you might
assume that across the whole problem
domain, with different data sets, the
relationship between TP and FP would be
a smooth curve
• This is shown in the figure as a dashed
line
115
• The book gives several explanations of
how to get the smooth curve
• The basic idea is to use cross-validation
• To me the explanations are foggy
• I’d like to appeal to a visual argument
again
116
• Suppose you did 10-fold cross-validation
• You’d have 10 test sets
• If you graphed TP vs. FP for each test set,
you’d get 10 different jagged lines
• If you averaged the y values for each x
value, the results would approximate a
smooth curve
• Since the axes are percents, the shape of
the curve would be the same as for the 117
population
ROC Curves for Different
Learning Schemes
• Figure 5.4, given on the following
overhead, shows ROC curves for two data
mining schemes
• Scheme A is further to the upper left for
smaller sample sizes
• Scheme B is further to the upper left for
larger sample sizes
• You can compare and choose between
the two if you had a preference based on
118
sample size
119
• The shaded area on the graph is referred
to as the convex hull of the two curves
• Part of the hull lies outside the boundaries
of the curves for A and B
• For sample sizes in that range, it is
possible to achieve ROC results on the
outer edge with a linear combination of A
and B
120
• The outer boundary of the hull is tangent
to A and B
• Let (x1, y1) and (x2, y2) be the tangent
points for A and B, respectively
• Suppose you wanted to find the point 1/nth
of the way from (x1, y1) to (x2, y2)
121
•
•
•
•
•
•
xnew = x1 + 1/n(x2 – x1)
ynew = y1 + 1/n(y2 – y1)
xnew = 1/n x2 + (x1 – 1/n x1)
ynew = 1/n y2 + (y1 – 1/n y1)
xnew = 1/n x2 + (1 – 1/n) x1
ynew = 1/n y2 + (1 – 1/n) y1
122
• 1/n and (1 – 1/n) can work like
probabilities
• They sum to 1
• You can use a Monte Carlo like approach
to achieving the best possible ROC value
for any point between the tangent points to
A and B
123
• This would be cross-validation like in
nature
• For multiple data sets, apply A at the ratio
of FP to FN that gives its tangent point
• For multiple data sets, apply B at the ratio
of FP to FN that gives its tangent point
124
• For a given problem where you want an
ROC (ratio of FP to FN) that lies 1/nth of
the way from one tangent point to the
other:
• Let the ratio of B data sets to A data sets
be 1/n to (1 – 1/n)
• Combine the results of the computations
generated in this proportion and it will be
the desired point on the line
125
• Note that the foregoing overheads are my
attempt to explain what the book says
• I don’t claim that they are wrong
• I did find the book’s explanation kind of
incomplete
• My review of it is no better
• There seem to be some missing links
126
Recall-Precision Curves
• Even though the section heading is recallprecision curves, we won’t actually see a
graph in this section…
• The general idea behind lift charts and
ROC curves is trade-off
• They are measures of good outcomes vs.
unsuccessful outcomes
• This is cost-benefit in simple terms (not in
terms of a cost factor and cost equation)
127
• Trying to measure the tradeoff between
desirable and undesirable outcomes
occurs in many different problem domains
• Different measures may be used in
different domains
• In the area of information retrieval, two
measures are used:
• Recall and precision
128
•
# of relevant docs retrieved
• Recall = ------------------------------------•
total # of relevant docs
•
# of relevant docs retrieved
• Precision = ------------------------------------•
total # of docs retrieved
129
• Notice the general idea illustrated by these
measures and inherent in the other ones
we’ve looked at:
• You can increase the number of relevant
documents you retrieve by increasing the
total number of documents you retrieve
• But as you do so, the proportion of
relevant documents falls
130
• Consider the extreme case
• How would you be guaranteed of always
retrieving all relevant documents?
• Retrieve all documents in total
• But you’ve gained nothing
• If you retrieve all documents, the task still
remains of winnowing the relevant
documents from the rest
131
Discussion
• The book notes that in non-computer
fields, like medical testing, the same kinds
of ideas crop up
• For a given medical test:
• Sensitivity = proportion of people with the
disease who test positive
• Specificity = proportion of people without
the disease who test negative
• For both of these measures, high is good 132
• Incidentally, these concepts also remind
me of association rule mining
• Support and confidence seem to have
something in common with the
dichotomies presented in this section
133
One Figure Measures
• In addition to 2-dimensional graphs or
curves like lift charts and ROC curves,
there are also techniques for trying to
express the goodness of a scheme in a
single number
• For example, in information retrieval there
is the concept of “average recall”
measures
134
• Average recall measures are actually
averages of precision over several recall
values
• Three-point average recall is the average
precision for recall figures of 20%, 50%,
and 80%
• Eleven-point average recall is the average
precision for figures of 0%-100% by 10’s
135
•
•
•
•
The book also cites the F-measure
(2 x recall x precision) / (recall + precision)
= (2 x TP) / (2 x TP + FP + FN)
If you go back to the definitions you can
derive the 2nd expression from the 1st
• Looking at the second expression, the
simple point is that a larger value is better
136
• The simplest one-figure measure of
goodness is simply the success rate:
• (TP + TN) / (TP + FP + TN + FN)
• One-figure measures don’t contain as
much information as graphs
• By definition, graphs tell you the outcome
over a range
137
• Table 5.7, shown on the following
overhead, summarizes lift charts, ROC
curves, and recall precision curves
• Even though we haven’t been given a
graph of a recall-precision curve, this is
where such curves are put in context with
lift charts and ROC curves
138
139
Cost Curves
• In the previous section two ROC curves
were graphed together
• There performance on samples of different
sizes could be compared
• This comparison didn’t include cost
• It is not straightforward to see how cost
can be incorporated into lift charts or ROC
curves
140
• The book introduces error and cost curves
as a way of including cost in scheme
comparisons
• Even though error and cost curves have a
purpose similar to lift charts and ROC
curves, they are quite different from them
• In trying to understand error curves and
cost curves it’s important not to confuse
them with lift charts and ROC curves
141
The Error Curve
• The book presents an error curve and a
cost curve together in Figure 5.5
• I will deal with the two curves one at a time
• The presentation of error curves will come
first
• Consider the error curve, Figure 5.5a,
shown on the following overhead
142
143
• The error curve shown is based on binary
classification
• In the figure, a yes classification is
symbolized by “+”, so the probability of a
yes is p[+]
• By extension a no is symbolized by “-”
• The solid curve (line) shows the
performance of a classification scheme, A
144
The Axes of the Graph
• The x-axis is the probability of a yes, p[+],
classification in a data set that A is applied
to
• In other words, A’s performance may differ
for different probabilities of yes in the data
set
• The domain is the interval of probabilities
from 0 to 1
145
• The y-axis, the performance measure, is
the expected error
• This is the probability of a misclassification
by A for a given value of p[+]
• The range is the interval of expected
errors from 0 to 1
146
• The expected error is not the count of
errors, but the error rate from 0 to 100%
• The error rates for false negatives and
false positives are given using italics as fn
and fp, respectively
147
Characteristics of the Curve for
Data Mining Algorithm A
• A is shown as linear
• It is not known whether classification
scheme performance is necessarily linear,
but for the purposes of discussion, this
one is
• Also note that the slope of A is not zero
• A’s performance does differ for different
values of x
148
• Under the assumption that A is linear, A is
not hard to plot
• The left hand endpoint, the y intercept, is
defined as the false positive rate, fp, that A
gives when the probability of yes in the
data set is 0
• The y value for the right hand endpoint is
defined as the false negative rate, fn, that
A gives when the probability of a yes in the
149
data set is 1
• For a given probability of yes on the x-axis
between 0 and 1:
• The y value would be the expected
misclassification rate, whether fp or fn, for
that probability p[+]
150
• A is lower on the left than on the right
• For data sets more likely to contain no, A
has fewer errors
• For data sets less likely to contain no, A
has more errors
• In a sense, you might say that the error
curve (line) shows that A is biased
• A is apparently more likely to give a
classification of no than yes
151
Other Things Shown in the
Figure
• The book indicates other things in the
figure
• The horizontal line at the bottom, the xaxis, would represent the performance of a
perfect predictor—expected error = 0
• The horizontal line at the top would
represent the performance of a predictor
that was always wrong—expected error =
1
152
• The dashed diagonal from the origin to the
upper right would represent the
performance if you always predicted yes
• The dashed diagonal from the upper left to
the lower right would represent the
performance if you always predicted no
153
• The two diagonals seem uninteresting, but
they bring out a useful point
• You don’t have to use the same predictor
across the whole domain
• Below a certain point, simply predicting no
outperforms A
• Above a certain point, simply predicting
yes outperforms A
154
• In theory, you could take a random sample
of a data set and see what the probability
of a yes was
• You could then choose which predictor to
use accordingly
155
The Cost Curve
• An example of a cost curve, Figure 5.5b, is
shown on the following overhead
• It looks somewhat similar to the error
curve of Figure 5.5a
• This is not an accident, but the axes of the
curves are different
• As the name implies, this curve contains
information about costs, not just error
rates
156
157
• It is not reasonable to try and derive or
prove the formulas they use in creating the
cost curve
• It’s also not reasonable to think we’ll come
to much of a theoretical or intuitive
understanding of the formulas
158
• My goal in presenting this will be the
following:
• Try to figure out what practical outcomes
in the graph result from the formulas used
• Then try to figure out how these practical
outcomes make it possible to read the
graph and understand what it represents
overall, even if the derivations are not
given in detail
159
The Axes of the Cost Curve—
The X-Axis
• The starting point for understanding the
cost curve has to be the axes
• The x-axis is labeled as the probability
cost function
• The book’s notation for this is pc[+]
• This is an extension of the notation p[+],
the probability of a yes, which was the xaxis of the error curve
160
• By way of introduction, consider the
following:
• Yes, we want a curve that graphs cost
• You might assume that you graph cost on
the y-axis against something else entirely
different on the x-axis
161
• It turns out that cost characteristics are
going to be part of the x-axis
• Part of the goal of this discussion will be to
explain how a cost curve can be based on
an x-axis that contains cost information
• The book’s expression for the x-axis,
along with an explanatory version, are
given on the following overhead
162
• pc[+], the x-axis of the cost curve
•
p[+] x C[-|+]
• = ---------------------------------•
p[+] x C[-|+] + p[-] x C[+|-]
•
prob(yes) x cost(FN)
• = --------------------------------------------------------• prob(yes) x cost (FN) + prob(no) x cost(FP)
163
• We don’t know what this “means” or
“where it came from”, but certain things
about it can be observed
• pc[+] is a transformation of p[+]
• p[+] appears in both the numerator and
denominator
• You might interpret p[-] as (1 – p[+])
164
• If you did a little simple graphing of the
function, you would find that it has curves
and discontinuities (where the
denominator goes to 0)
• So the x-axis of the cost curve is a
function of the x-axis of the error curve
165
• If you substitute p[+] = 0 and p[+] = 1, you
get the following
• If prob(yes) = p[+] = 0, pc[+] = 0
• If prob(yes) = p[+] = 1, pc[+] = 1
• Expressing the x-axis as a function, this is:
• pc[+] = fcn(p[+])
• For p[+] = 0, fcn(p[+]) = 0
• For p[+] = 1, fcn(p[+]) = 1
166
• In words, the x-axis of the cost curve is a
transformation of the x-axis of the error
curve
• The endpoints of the interval of interest of
the error curve map to the same, desired
endpoints for the interval of interest for the
cost curve, 0 and 1
• In between these points, the error curve is
actually a curvilinear function of the error
167
The Y-Axis
• With the x-axis established, what is the y
value being mapped against it?
• The y-axis represents what is known as
the normalized expected cost
• The normalized expected cost is a function
of the probability cost function
• y = NEC = f(pc[+]) = f(x),
168
• This is the formula for the normalized
expected cost, namely the formula given in
the book for the function that defines the y
coordinate of the graph:
• fn x pc[+] + fp x (1 – pc[+])
• Note that fn and fp are small letters in
italics
• These italicized things are not the counts,
FP and FN
169
• fn and fp are the rates or ratios of false
negatives and false positives
• This was also the notation used in the
error curve graph
• In the error curve, fn and fp varied
according to p[+]
170
• The formula given for the y coordinate of
the cost curve appears quite simple
• However, it would be wise to take into
account that when calculating it, fn and fp
would themselves vary across the range of
p[+]
171
•
•
•
•
•
In practical terms, what have we got?
The normalized expected cost—
Is the sum of two terms—
Each term consisting of two factors—
Where each of the factors falls between 0
and 1—
172
• The sum of fp and fn can’t be greater than
1
• You hope it’s less than 1
• Otherwise all of your predictions are in
error
• The sum of pc[+] and (1 – pc[+]) is clearly 1
173
• These arithmetic facts mean that the
normalized expected cost can only fall
between 0 and 1
• This is why it’s called normalized
• From a purely pragmatic point of view this
normalization is the reason (perhaps
among others) why the x-axis is defined
the way it is and the function y = f(x) is
defined the way it is
174
What Does the Graph Portray?
• The cost curve is shown again on the
following overhead as a memory refresher
• With the axes defined, what information
can be read from the graph?
175
176
• The line A represents the cost curve for a
particular data mining algorithm
• The x-axis, the probability cost function,
contains information on two things at once:
• The probability of a yes or no for instances
of various different data sets
• The cost of FP and FN for a particular
problem domain
177
• The x-axis doesn’t include any information
about a data mining algorithm or its
performance
• The x-axis contains information about the
domain only (as it should)
178
• The y-axis is the dimension along which
you graph algorithm dependent
information
• The fp and fn in the function being
graphed are performance measures for an
algorithm at given probabilities of yes/no in
the data set
179
• Remember, for example, that if an
algorithm has been trained to lean towards
no, but the data set leans towards yes,
you will tend to get more FN’s and fn will
be higher
• This is the kind of information that the
factors fn and fp include in the formula
180
• After all this blah blah blah, presumably
the following simple description of the cost
curve will make some sense:
• The cost curve graphs the normalized cost
of applying a data mining algorithm to data
sets where the p[+] ranges from 0 to 1
181
What Else is in the Figure?
• Given the explanation so far, what else is
in the figure and what does it signify?
• Once again, the figure is repeated as a
reminder of what’s in it
182
183
• The performance cost function of a given
algorithm is shown as a straight, solid line
• The book still doesn’t say why the lines
would necessarily be straight, but we’ll
take them as they’re given
• There are lines for A and B in the figure,
plus shadowy lines for other hypothetical
data mining algorithms
184
• A and B cross and have different
performance/cost characteristics over
different ranges of p[+] or pc[+] values
• The always yes/always no predictors are
shown as crossed, dashed, diagonal lines
• The curved, dotted line is referred to as an
envelope
185
• The idea behind the shadowy lines and
the envelope is that you might have
multiple data mining algorithms
• If you used each algorithm only in that
range of pc[+] where it was the best, the
envelope is the minimum cost curve over
the problem
186
• Finding the best algorithm to use, given
this envelope, is not hard to accomplish
• A statistical sample will tell you the p[+] for
any given data set out of a problem
domain
• Then use the algorithm that performs best
for that value
187
The End
188