Chapter 15 - VCU DMB Lab.

Download Report

Transcript Chapter 15 - VCU DMB Lab.

Chapter 15
ASSESSMENT OF DATA MODELS
Cios / Pedrycz / Swiniarski / Kurgan
Outline
• Introduction
• Models, Their Assessment and Selection
-
The Bias-Variance Dilemma
• Simple Split and Cross-Validation
• Bootstrap
• Occam’s Razor Heuristic
• Minimum Description Length Principle
• Akaike’s Information Criterion and Bayesian Information
Criterion
• Sensitivity, Specificity and ROC Analyses
• Interestingness Criteria
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
2
Introduction
Assessment of a data model is done by a data miner
before selecting the “best” model and presenting it
to the user/owner/expert of data.
The user then makes a final decision whether to accept
the model in its current form, and use it for some
purpose, or ask to generate a new one (a frequent
outcome) .
The user’s expectation of a DM project is to find some
new information/knowledge, hidden in the data, so
that it can be used for some advantage.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
3
Introduction
Problem:
A data miner assesses quality of the generated model often by
using the same data that were used to generate the model itself
(albeit divided into training and testing data sets),
while the owner/user depends not only on data and DM results but
also on his deep (expert) domain knowledge.
In spite of the KDP requirement that the data miner learns about
the domain and the data as much as possible,
his or her knowledge obviously constitutes only a subset of the
knowledge of the experts (data owners).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
4
Introduction
In practice:
• A data miner generates several models of the data and must choose
one “best” model, in terms of how well it explains the training data
and its predictive power, before presenting it to the data owner. We
will discuss the methods for selecting such “best” models.
• The process of selecting the “best” model by a data miner is called
model selection.
• The owner of the data then utilizes his deep domain knowledge for
model assessment.
Data miners attempt to mimic experts by designing artificial
interestingness measures.
• Next, we focus on the heuristic, data-reuse (data re-sampling), and
analytic methods for model selection and validation.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
5
Introduction
Terms model, classifier, and estimator will be used interchangeably.
• A model can be defined as a description of causal relationships
between input and output variables.
• A classifier is a model of data used for a classification purpose:
given a new input, it assigns it to one of the classes it was
designed/trained to recognize.
• An estimator is a method used to calculate a parameter; it is a
variable defined as function of the sample values.
• The number of independent pieces of information required for
estimating the model is called model’s degree of freedom.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
6
Introduction
• One of the simplest heuristics for model selection is to choose a
parsimonious model: one that uses the fewest number of
parameters – from among several acceptably well-performing
models.
• There is always a model error associated with any model.
It is calculated as a difference between the true value and the
model output value, and is expressed either as absolute or
squared error between the observed/true and model output
values.
• Model error can be calculated only if training data, meaning
known inputs corresponding to known outputs, are available.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
7
Introduction
• When we generate a model of the data, we fit the model
to the data.
• In addition to fitting the model to the data we are also
interested in using the model for prediction.
• Once we have generated several models and have
selected the “best” one, we need to validate the model
for its goodness of fit (fit error) and its
goodness of prediction (prediction error).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
8
Introduction
• In neural network and ML literature, goodness of
prediction is often referred to as the generalization
error. The latter term ties it into the concepts of
overfitting and underfitting the data.
• Overfitting means unnecessary increase of the model
complexity. For example, increasing the number of
parameters and the model degrees of freedom beyond
what is necessary.
• Underfitting is opposite to overfitting, i.e., too simple a
model will not fit the data well.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
9
Introduction
We can divide model assessment techniques based on the nature of
the methods:
• data-reuse / re-sampling
(simple split, cross-validation, bootstrap)
• heuristic
(parsimonious model, Occam’s razor)
• analytical
(Minimum Description Length, Aikake’s Information Criterion,
and Bayesian Information Criterion)
• interestingness measures
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
10
Bias-Variance Dilemma
Bias is defined as the error that cannot be reduced by increasing
the sample size:
• it is present even if an infinite sample is available
• it is a systematic error.
Some example sources of bias:
- measurement error (experimental error that cannot be removed)
- sample error (sample may not be correctly drawn from the
population, and thus may not represent the data correctly)
- error associated with a particular form of an estimator, etc.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
11
Bias-Variance Dilemma
Bias is calculated as a difference between the estimated expected value and
the true value of some parameter:
B  E( p)  p
Its squared value, B2, is one of the two components (the other is variance, S)
of the mean square error, MSE, which calculates the mean square difference
between a true value of a parameter, p, and its estimated value:
MSE( p)  E( p  p) 2  S (p)  B 2 (p)
where the variance is:
S 2  ( ( p i  p i ) 2 ) / (N  1)
i
S 2  ( ( p i  p i ) 2 ) / N
i
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
12
Bias-Variance Dilemma
Variance is an additional (to bias) error that is incurred given a finite
sample (because of sensitivity to random fluctuations).
Estimator is a method used to calculate a parameter.
Some examples:
- histogram density estimator, which estimates density based on counts per
bin
- Bayesian estimator, which estimates the a posteriori probability from the
a priori probability via Bayes’ rule.
The simplest nonparametric estimator
(meaning one that does not depend on complete knowledge of the
underlying distribution)
is the sample mean that estimates the population mean; it constitutes the
simplest model of the data.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
13
Bias-Variance Dilemma
Biased estimators have a non-zero bias
(meaning that the estimated expected value is different
from the true value)
while
unbiased estimators have zero bias of the estimate.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
14
Bias-Variance Dilemma
There exist a trade-off between bias and
variance, known as the bias-variance dilemma.
For a given MSE, if it has large bias then it has a small
accompanying variance, and vice versa.
We are interested in finding a model/estimator which is
neither too complex (may overfit the data)
nor too simple (may underfit the data).
Such a model can be found by minimizing the MSE value, with
acceptable bias and variance.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
15
Bias-Variance Dilemma
High Bias, Low
Variance
Low Bias, High
Variance
Medium Bias,
Medium Variance
16
Bias-Variance Dilemma
MSE
Bias
Variance
MSE
Variance
Bias
optimal
Model complexity
Data size
Illustration of the bias -variance dilemma.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
17
Bias-Variance Dilemma
A model should be chosen in such a way that it does not
overfit the data,
which means that it should perform acceptably well on
both training and test data,
as measured by the error between the true/desired value
and the model output value.
High dimensionality increases variance (curse of dimensionality)
Removing irrelevant features (feature selection) can improve accuracy.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
18
Bias-Variance Dilemma
Fit/Training &
Prediction/Test
Errors
High bias & low variance
Low bias & high variance
underfitting
Test data
Training data
overfitting
optimal
Model complexity
Data size
Choosing optimally complex model with acceptable fit and prediction errors.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
19
Assessment Techniques
Model assessment techniques can be divided based on the nature
of the methods:
• data-reuse / re-sampling
(simple split, cross-validation, bootstrap)
• heuristic
(parsimonious model, Occam’s razor)
• analytical
(Minimum Description Length, Aikake’s Information Criterion,
and Bayesian Information Criterion)
• interestingness measures
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
20
Simple Split
Test Data
Training Data
Given a large data set composed of known inputs
corresponding to known outputs,
we can evaluate the model by splitting the available
data into two parts:
the training part used for fitting the model, and
the test part used for evaluation of its goodness of
prediction.
Holding back 1/3 (or ½) of data for testing is typical.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
21
Simple Split and Cross-Validation
If the results are unsatisfactory, then a more expensive
computational method, called
cross-validation, is used to estimate the goodness of
prediction of the model.
Informally, cross validation should be used in situations
where the data set is relatively small but “difficult”
(meaning that splitting into just two parts does not result
in good prediction).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
22
Cross-Validation
Let n be the number of data points in the training data set. Let k be
an integer index that is much smaller than n.
In a k-fold cross validation, we divide the entire data set into k
equal-size subsets, and use k-1 parts for training and the
remaining part for testing and calculation of the prediction error
(goodness of prediction).
We repeat the procedure k number of times, and report the average
error from the k runs.
In an extreme situation, when data set is very small, n-fold cross
validation, also known as leave-one-out method, is used.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
23
Nested Cross-Validation
24
Nested Cross-Validation
From “Repeated double cross validation” by Filzmoser et al.
25
Nested Cross-Validation
26
Cross-Validation
Advantages:
• All data used for training
• All data used for test (once)
• No data wasted
Disadvantages:
• Computationally more expensive than simple split
(brute force)
• Good bias performance can have high variance.
– Increasing k gives better bias, worse variance
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
27
Bootstrap
Bootstrap gives a nonparametric estimate of the error of
a model in terms of its bias and variance.
It works as follows:
We draw x samples of equal size from a population
consisting of n samples,
with the purpose of calculating confidence intervals for
the estimates.
A strong assumption is made that the available data set
of n samples constitutes the population itself.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
28
Bootstrap
First, from this “population”, x samples of size n
(notice that it is of the same size as the original data set)
called bootstrap samples, are drawn, with replacement.
Next, we fit a model to each of the x bootstrap samples and assess
goodness of fit (error) for each of the bootstrap samples and
average it over the x bootstrap samples
(let us denote by xi the ith realization of the bootstrap sample, where i=1 , …, x)
and calculate the bootstrap estimate of the bias and variance as:
1 x
B (t)   x i  t
x i
x
S (t)   (x i  t ave ) / (x  1)
2
2
i
t ave
1

x
x
x
i
i
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
29
Bootstrap
• Assume the available data set of n samples is the population
itself.
• Draw “x
bootstrap samples”, with replacement, from the
population of size n (the same size as the original data)
– With replacement means that we don’t remove the points
from the population when selected, so they may appear
multiple times in a single sample.
• Fit the model to each bootstrap sample and test it against the
original data.
• Average the error over the x bootstrap samples.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
30
Data
(original):
Bootstrap
abcdefghij
Sample 1:
fgeajgbddi
Sample 2:
dfIegjaejh
Sample 3:
bdjebghjdf
Train
Test
Train
Test
Train
Test
Err1
Err2
Err3
Calculate the bootstrap estimate of the bias and variance as:
1 x
B (t)   x i  t
x i
x
S2 (t)   (x i  t ave ) 2 / (x  1)
i
t ave
1 x
  xi
x i
xi is the ith realization of the bootstrap sample, from i=1,…,x
and t is the estimate calculated from the original data
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
31
Bootstrap
To use bootstrap for calculating goodness of prediction we treat the
set of bootstrap samples as the training data set, and the original
data set as the test set.
We thus fit the model to all bootstrap samples, and calculate the
prediction error on the original data.
The problem with this approach is that prediction error is
optimistic (good), since the bootstrap samples are drawn with
replacement.
In other words, we calculate the prediction/generalization error on
highly overlapping data (with many data items in the test set
being the same as in the training data).
Kind of “cheating” (note that CV is less optimistic).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
32
Bootstrap Cross Validation
To address the optimism of bootstrap:
•Generate x bootstrap samples of size n
•Perform leave-one-out CV on each bootstrap sample
•Average the CV estimates for each of the x bootstrap samples
This reduces the issue of testing on the same data as used for
training, but the same instance can still appear multiple times in one
bootstrap sample. Thus, bootstrap CV still underestimates the true
error.
Fu, Carrol, Wang 2005. Estimating misclassification error with small samples via bootstrap crossvalidation. Bioinformatics, Vol 21, Issue 9 pg 1979-1986.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
33
.632 Bootstrap
Bootstrap is an optimistically biased estimator, due to test/training
data overlap.
.632 Bootstrap corrects this bias by combining two biased
estimators: Zero Bootstrap estimator and Resubstitution estimator.
Zero bootstrap: Use bootstrap sample to train, and use only the
samples not in a bootstrap as test data (eliminates test/training
overlap).
This is a very pessimistic estimator.
Efron, Bradley 1983. Estimating the Error Rate of a Prediction Rule: Improvement on CrossValidation. Journal of the American Statistical Association, Vol 78, No. 382 pg 316-331
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
34
.632 Bootstrap
Resubstitution is an error when using full data set for both training
and test (this is very optimistic estimator / “true cheating”).
The probability of any given instance not being chosen from the
original data/population after n samples are drawn is
(1-1/n)n ~= e-1 ~= 0.368
thus, probability of the expected distinct instances in a (training)
sample is 0.632
We then combine these two biased estimators to correct the bias:
err = 1/x ∑xi=1(0.632 * c0i + 0.368 errs),
where c0i is error estimate on bootstrap sample i
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
35
.632 Bootstrap
Data:
abcdefghij
Train
Test
errs
ch
Sample 1:
fgeajgbddi
Train
Test
c01
bci
Sample 2:
dfIegjaejh
Train
Test
c02
acj
Sample 3:
bdiebghidf
Train
Test
c03
err = 1/3 [(0.632 c01 + 0.368 errs) + (0.632 c02 + 0.368 errs)
+ (0.632 c03 + 0.368 errs) ]
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
36
Occam’s Razor Heuristic
Occam’s razor states that the simplest explanation (a
model) of the observed phenomena, in a given
domain, is the most likely to be a correct one.
Although we do not know how to determine “the simplest
explanation” we intuitively agree with William of Ockham that
given several models
(specified, for example, in terms of production IF… THEN… rules)
a more “compact” model
(composed of a fewer number of rules, especially if on average the
rules in this model are shorter than in all other models)
should be chosen.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
37
Occam’s Razor Heuristic
Occam’s razor heuristic is built-in into most machine learning
algorithms,
namely,
that
simpler/shorter/more
compact
description of the data is preferred over more complex ones.
Thus, a problem with this heuristic is that we plan to use it for a
model assessment while we have already used it for generating
the model itself!
Another issue with Occam’s razor is that some researchers say that
it may be entirely incorrect, because:
if we model some data (process that generated the data)
known to be very complex why should a simple model of the data be
preferred at all?
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
38
Minimum Description Length Principle
MDL principle was designed to be general and, importantly,
independent of any underlying probability distribution.
Its “father”, Rissaneen, wrote: “We never want to make a false
assumption that the observed data actually were generated by a
distribution of some kind, say, Gaussian, and then go on and
analyze the consequences and make further deductions.
Our deductions can be entertaining but quite irrelevant…”.
This statement is in stark contrast to statistical methods, because the
MDL provides a clear interpretation regardless of whether some
underlying “true” model of data exists or not.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
39
Minimum Description Length Principle
The basic idea of the MDL is connected with a specific
understanding of learning (model building).
Namely, MDL is about finding regularities in the data,
where regularity is understood as the ability to
compress the data.
Learning can also be understood as the ability to
compress the data, i.e., we want to come up with
compact (e.g., with fewer number of rules) description
of the data (the model).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
40
Minimum Description Length Principle
In parlance of machine learning, we desire to select the
most general model but one that does not overfit the
data.
In parlance of MDL, having a set of models (M) about the
data (D) we want to select the model that most
compresses the data.
Both methods specify the same goal, but are stated
using different terminology.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
41
Minimum Description Length Principle
If a system can be defined in terms of input and corresponding
output data, then in the worst case (longest), it can be described
by supplying the entire data set, thus constituting the longest
(least compressed) model of the data.
However, if regularities can be discovered in the data, say,
expressed in the form of production rules, then a much shorter
description is possible, and it can be measured by the MDL
principle.
The MDL principle says that the complexity of a model/hypothesis
is measured by the number of bits needed to encode the model
itself, plus the number of bits needed to encode the data using
the model.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
42
Minimum Description Length Principle
Formally, from a set of models, we choose as the “best” the model that
minimizes the following sum:
L(M) + L(D│M)
where L(M) is length (in bits) of the description of the model, and
L(D│M) is length of the description of data encoded using model M.
The basic idea behind this definition can be also explained using
notions of underfitting and overfitting.
It is easy to find a complex model, meaning one having large L(M)
value that overfits the data (i.e., with small L(D│M) value).
It is equally easy to find a simple model, with the small L(M) value that
underfits the data, but which has large L(D│M) value.
Notice the similarity of the MDL principle to the bias-variance dilemma.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
43
Minimum Description Length Principle
What we are looking for is a model that constitutes the best compromise
between the two cases.
Suppose that we have generated two models that explain/fit the data equally
well, then the MDL principle tells us to choose one that is simpler
(for instance having smaller number of parameters; recall that a model using
the smallest number of parameters is called parsimonious),
and that allows for the most compact description of the data.
In that sense, the MDL can be seen as a formalization of the Occam’s razor
heuristic.
Small L(M) & Large L(D|M)
Large L(M) & small L(D|M)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
44
Akaike’s Information Criterion
Akaike’s Information Criterion (AIC) and the Bayesian Information
Criterion (BIC) are statistical measures used for choosing between
models that use different number of parameters, d.
They are closely related to each other, as well as to the MDL.
The idea behind both is as follows. We want to estimate the prediction
error (E) and use it for model selection.
We can easily calculate the training error (TrE), however, the test
instances most often do not coincide with the training instances, so
the TrE is too optimistic.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
45
Akaike’s Information Criterion
To remedy this we estimate the error of optimism (Eop), and calculate
the in-sample (given the sample) error as follows:
E = TrE + Eop
AIC is defined as:
AIC = - 2logL + 2 (d/N)
where logL is the maximized log-likelihood defined as:
L(θ │y) = ∑ log f(yi │ θ)
N
logL   log P  (y i )
logL of the model given the data,
reflects the overall fit of the model
i 1
where Pθ(Y) is a family of densities containing “true” density,

is the maximum–likelihood estimate of θ, and d is the
number of parameters in the model.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
46
Statement of the Maximum Likelihood Method (MLM):
Assume we have made N measurements of x {x1, x 2 ... xn }.
Assume we know the probability distribution function that describes x: f(x,)
Assume we want to determine the parameter .
The MLM says that we pick such as to maximize the probability of getting the
measurements (the xi's) that we obtained!
How do we use the MLM?
The probability of measuring x1 is f(x1,)dx
The probability of measuring x2 is f(x2,)dx
The probability of measuring xn is f(xn,)dx
If the measurements are independent, the probability of getting our measurements is:
L=f(x1,) f(x2,)…f(xn,)dxn.
L is called the Likelihood Function. It is convenient to write L as:
N
L   f (xi , )
i 1
We want to pick the  that maximizes L. Thus we want to solve:
L
0
   *
From Richard Kass
47
Information Criteria
AIC and BIC are defined as:
[-2logL + k*p], <- Training Error + Error of Optimism
where L is the likelihood function,
p is the number of parameters in the model,
and k is 2 for AIC and log(n) for BIC.
48
AIC
AIC = -2logL + 2p
Goal: Evaluate a set of models for a given data set and determine which model
best fits the data - with the least amount of information loss – corresponding
to the smallest AIC value.
Akaike weights ( wi ) provide another
measure of the strength of each model, and
represent the ratio of delta AIC (Δ i ) values
for each model relative to the set of R
candidate models.
Mazerolle 2004
49
AIC
Table 1. Akaike’s second-order information criterion (AICc) of the regression models of
mass lost by frogs after 2 h across three substrate types in or out of the shade. A total of
121 observations were retained for analysis.
<-
Mazerolle 2004
50
Bayesian Information Criterion
BIC is defined as:
BIC = - 2logL + d logN
AIC and BIC definitions are equivalent in the sense that 2 in the
second term was replaced with logN in the definition of the BIC.
For a Gaussian distribution with variance S2 we can write BIC as:
BIC = (N/S2) (TrE + d/N logN)
We choose the optimal model as the one corresponding to the
minimum value of the BIC.
BIC favors simpler models since it heavily penalizes more complex
models.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
51
Sensitivity and Specificity
Let us assume that some underlying “truth” (also known as gold
standard) exists; this is in contrast to the MDL principle that does
not require such assumption.
This means that training data must be available, i.e., known inputs
corresponding to known outputs.
It further implies that we know the total number of positive examples,
P, and the total number of negative examples, N.
Then, we are able to form a confusion matrix /
misclassification matrix / contingency table.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
52
Sensitivity and Specificity
Test result
Test result
Truth
(Gold standard)
Positive
Negative
Positive
True Positive
(TP)
(no error)
False Negative
(FN)
(Rejection error,
Type I error)
Negative
False Positive
(FP)
(Acceptance error,
Type II error)
True Negative (TN)
(no error)
Total of Test result
recognized as
Positive
Total of Test result
recognized as
Negative
P
(total of true
positives)
N
(total of true
negatives)
Total
population
TP is the case in which the test result and gold standard/truth are both positive; FP is the
case in which the test result is positive but the gold standard is negative; TN is the case
where both are negative; and FN is the case where the test result is negative but the gold
standard is positive.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
53
Sensitivity and Specificity
Sensitivity = TP / P = TP / (TP+FN) = Recall =TP rate
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
54
Sensitivity and Specificity
Sensitivity measures how often we find what we are looking for (say
for a certain disease).
In other words, sensitivity is 1 if all instances of the True class are
classified to the True class.
What we are looking for is P = (TP+FN), i.e., all the cases in which the
gold-standard (disease) is actually positive (but we found only TP of
them), thus the ratio.
Sensitivity is also known in the literature under a variety of nearsynonyms such as TP rate hand hit rate.
In information extraction / text mining literature sensitivity is known
under the term recall.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
55
Sensitivity and Specificity
Specificity = TN / N = TN / (FP+TN)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
56
Sensitivity and Specificity
Specificity measures how often what we find is what we are not
looking for (say, for a certain disease);
in other words, it measures the ability of a test to be negative when a
disease is not present.
Specificity is 1 if only instances of True class are classified to the True
class.
What we find is N= (FP+TN), i.e., all the cases in which the gold
standard is actually negative (but we have found only TN of them).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
57
Sensitivity and Specificity
People often report only accuracy of their results instead of
calculating both sensitivity and specificity – not acceptable in
medical data analyses.
In a situation where, say, sensitivity is high (say over 0.9) while
specificity is low (say about 0.3), the accuracy may look acceptable
(over 0.6 or 60%), as can be seen from its definition:
Accuracy = (TP+TN) / (P+N)
Sometimes a different definition for accuracy is used, one that ignores
the number of TN:
Accuracy2 = TP / (P+N)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
58
Sensitivity and Specificity
In text mining a measure called precision is used:
Precision = TP / (TP + FP)
where TP + FN = P are called relevant documents and TP +FP are called
retrieved documents.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
59
Sensitivity and Specificity
Test results
Test results
Sum
True;
Gold
standard
diagnosis
Classified as
A
Classified as
B
Classified as
C
Classified as
D
A
4
1
1
1
7
B
0
7
0
0
7
C
0
0
2
1
3
D
0
0
0
10
10
4
8
3
12
27
Sum
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
60
Sensitivity and Specificity
Test results
Test results
True
;
Gold
stan
dard
diag
nosi
s
Clas
sifie
d as
A
Clas
sifie
d as
C
A
4
1
1
1
7
B
0
7
0
0
7
C
0
0
2
1
3
D
0
0
0
10
10
Sum
4
8
3
12
27
Clas
sifie
d as
B
Sum
Clas
sifie
d as
D
Specificity = TN / N = TN / (FP+TN)
Accuracy = (TP+TN) / (P+N)
Class A
Class B
Class C
Class D
Sensitivity
.57
(4/7)
1.0
(7/7)
.67
(2/3)
1.0
(10/10)
Specificity
1.0
(20/20)
.95
(19/20)
.96
(23/24)
.88
(15/17)
Accuracy
.89
(4+20)/27
.96
(7+19)/27
.93
(2+23)/27
.93
(10+15)/27
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
61
Sensitivity and Specificity
Class A
Test result
positive (A)
Test result
negative
(N or U)
Sum
positive rules (for A)
13
0+2=2
15
negative rules (for N)
3
5+1=6
9
24 examples: 15 Abnormal and 9 Normal; 6 are left as unclassified (24-(13+5))=2+1+1+2)
Sensitivity = TP / (TP+FN) = 13/(13+2) = .867
Specificity = TN / (FP+TN) = 6/(3+(5+1)) = .667
Accuracy = (TP+TN) / (TP+TN+FP+FN) = (13+6) / (13+2+3+6) = .792
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
62
Sensitivity and Specificity
Class N
Test result
positive (N)
Test result
negative
(A or U)
Sum
negative rules (for N)
5
3+1=4
9
positive rules (for A)
0
13 + 2 = 15
15
Sensitivity = TP / (TP+FN) = 5/(5+4) = .556
Specificity = TN / (FP+TN) = 15/(15+0) = 1.0
Accuracy = (TP+TN) / (TP+TN+FP+FN) = (5+15) / (5+4+15+0) = .883
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
63
Sensitivity and Specificity
Results for class
Sensitivity
Specificity
Accuracy
A
.867
.667
.792
N
.556
1.0
.833
MEAN
.712
.834
.813
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
64
Sensitivity and Specificity
Other measures calculated from the misclassification matrix:
False Discovery Rate (FDR) = FP / (TP + FP)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
65
Sensitivity and Specificity
F-measure is calculated as the harmonic mean of recall/sensitivity and precision:
F-measure = (2 x Precision x Recall) / (Precision + Recall)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
66
Fß Measure
F-measure that places ß times more weight on recall vice
precision:
Fß-measure = (1+ ß2) x (Precision x Recall)
((ß2 x Precision) + Recall)
F-measure is also known as F1-measure (ß=1) while:
F.5 - measure places less emphasis on recall
F2 - measure places more emphasis on recall
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
67
ROC Analyses
Receiver Operating Characteristics (ROC) analysis is performed by
drawing curves in two-dimensional space, with axes defined by the
TP rate and FP rate, or equivalently, by using terms of sensitivity
and specificity.
Let us define the FP rate as being equal to FP/N; this is similar (but
using “negative” logic) to the definition of the TP rate
(which is TP/P and is better known as Sensitivity)
then we can re-write the specificity as
Specificity = 1 - FP rate
from which we obtain
FP rate = 1- Specificity
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
68
ROC Analyses
Sensitivity = TP rate
(1,1)
1
.
P
W
.5
(0,0)
.5
1
1-Specificity= FP rate
A ROC graph.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
69
ROC Analyses
ROC plots allow for visual comparison of several models (classifiers).
For each model, we calculate its sensitivity and specificity, and draw it
as a point on the ROC graph.
Confusion matrix represents the evaluation of a model, which when
drawn on the ROC graph, represents a single point, corresponding
to a (1-specificity, sensitivity) value, denoted as point P.
The ideal model would be one represented by a location (0,1) on the
graph, corresponding to 100% specificity and 100% sensitivity.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
70
ROC Analyses
Point (0,0) represents 100% specificity and 0% sensitivity, and point (1,1) 0%
specificity and 100% sensitivity; thus, neither corresponds to an acceptable
model.
All points lying on the line connecting points (0,0) and (1,1) represent random
guessing of the classes (equal values of 1-specificity and sensitivity, or in
other words, equal values of TP and FP rates).
That means that such models would recognize equal amounts of TP and FP,
with point W (0.5, 0.5) representing 50% specificity and 50% sensitivity.
These observations suggests the strategy for always “operating” in the region
above the diagonal line (y=x), since in this region the TP rate is higher than
the FP rate.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
71
ROC Analyses
How can we obtain a curve on the ROC plot corresponding to a
classifier, say A?
Let us assume that for a (large) class called Abnormal
(say one representing people with some disease)
we have drawn a distribution of the examples over some symptom
(feature).
And let us assume that we have done the same for (large) class
Normal (say representing people without a disease) over the same
symptom.
Note that such distributions would very often overlap.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
72
ROC Analyses
Normal Mean
Abnormal Mean
Distribution of Normal (healthy) Abnormal (sick) patients.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
73
ROC Analyses
Division into Normal and Abnormal patients using a threshold of 4.5.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
74
ROC Analyses
ROC curves for two classifiers: A and B
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
75
ROC Analyses
How to decide which of the two classifiers is a better model of the data?
By visual analysis: the curve more to the upper left indicates a better
classifier. However, the curves often overlap, and a decision may not be so
easy to make.
A popular method used to solve the problem is to calculate Area Under Curve
(AUC).
The area under the diagonal curve is 0.5. Thus, we are interested in choosing a
classifier which has maximum area under its ROC curve: the larger the area
the better performing the model is.
There exist a measure similar to the AUC for assessing the goodness of a
model, known as Gini coefficient, which is defined as twice the area
between the diagonal and the ROC curve; the two measures are equivalent
since
Gini + 1 = 2 AUC
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
76
ROC Analyses
ROC curves for two classifiers: A and B
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
77
ROC Analyses
ROC curves for two classifiers: A and B
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
78
Interestingness Measures
Similarly to research carried out in the area of expert systems, when
the computer scientists mimicked human decision-making process
by interviewing experts and codifying the rules they use (e.g., how
to diagnose a certain disease) in terms of production rules,
data scientists undertook a similar effort to formalize the measures
(supposedly) used by domain experts to evaluate models of data.
Such criteria can be roughly divided into those assessing
interestingness of the association rules, which are generated from
unsupervised data,
and those for assessing interestingness of rules generated by
inductive machine learning (i.e., decision trees / rule learners) from
supervised data.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
79
Interestingness Measures
Rule refinement is used to reduce the number of rules and focus
only on some “more important - according to some criterion” rules.
First, we identify potentially interesting rules that satisfy user-specified criteria
like strength, complexity etc., or that are similar to the rules that already
satisfy such criteria.
Second, a subset of theses rules, known as technically interesting rules
that are selected using methods such as chi-square and AIC/BIC.
Third, we remove all but the technically interesting rules.
Notice, that the just described process of rule refinement is “equivalent” to
selection of the “best” rules done by a user.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
80
Interestingness Measures
Interestingness measure is used to assess interestingness of
generated classification rules, one rule at a time.
Classification rules are divided into two types:
Discriminant when evidence, “e implies hypothesis h”; a rule specifies
conditions sufficient for distinguishing between the classes (they
are the most frequently used in practice)
Characteristic when hypothesis “h implies evidence e”; a rule
specifies conditions necessary for membership in a class (h->e)
Given a rule type e-> h or h->e:
The rule is X% COMPLETE if e satisfies
tuples satisfying h.
(true / covers) X% of the
The rule is Y% DISCRIMINANT if e satisfies (100-Y)% of the ¬ h tuples
(i.e., tuples for which h is false)
81
Interestingness Measures
To assess interestingness of a characteristic rule, measures of
sufficiency and necessity are calculated
S(e  h) 
N(e  h) 
p(e | h)
p(e | h)
p(e | h)
p(e | h)
where → stands for implication and the ¬ symbol stands for negation.
These two measures are used for determining interestingness of different
forms of characteristic rules using the following relations:
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
82
Interestingness Measures
 (1  N(e  h))  p(h),
IC(h  e)  
0
0  N(e  h)  1
otherwise
 (1  S(e  h))  p(h),
IC(h  e)  
0
0  S(e  h)  1
otherwise
 (1  1/N(e  h))  p(h),
IC(h  e)  
0
 (1  1/S(e  h))  p(h),
IC(h  e)  
0
1  N(e  h)  
otherwise
1  S(e  h)  
otherwise
The calculated values of interestingness are in the range from 0 to 1,
and the owner of the data then uses some threshold (say, .5)
to make a decision of retaining or removing the rule.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
83
IC++
 Data obtained for characteristic rules mined from synthetic database of 4000
tuples described by 10 attributes
 IC++ was compared against RI alg. (Piatetsky-Shapiro 1991), J alg. (Smyth &
Goodman 1992), and Certainty alg. (Hong & Mao 1991)
• IC++ has chosen rule 2 as the most interesting
also showed that
• Rule 1: S & N are statistically independent, CE did not
• Rule 4: IC+- shows that there is interestingness in that h-> ¬ e & e-> ¬ h
Kamber & Shingal 1996
84
Interestingness Measures
Distance measure is another criterion that uses distance between
any two of the generated rules to find the strongest (with the
highest coverage) rules:
 DA(R i , R j )  2DV(R i , R j )  2EV(R i , R j )
, NO(R i , R j )  0

D(R i , R j )  
N(R i )  N(R j )
0
otherwise

where Ri and Rj are rules; DA(Ri, Rj) is the sum of the number of attributes
present in Ri but not in Rj , and the sum of attributes present in Rj but not
in Ri;
DV(Ri, Rj) is the number of attributes in Ri and Rj that have slightly (less than
2/3 of the range) overlapping values;
EV(Ri, Rj) is the number of attributes in both rules that have strongly
overlapping (more than 2/3) values in the range;
N(Ri) and N(Rj) are the number of attributes in each rule, and
NO(Ri, Rj) is the number of attributes in Ri and Rj with non-overlapping
values.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
85
Interestingness Measures
The distance criterion calculates values in the range from
-1 to 1, indicating strong and slight overlap,
respectively.
The value of 1 means no overlap.
The most interesting rules are those with the highest
average distance to the other rules.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
86
References
Cios, K.J., Pedrycz, W. and Swiniarski, R. 1998. Data Mining Methods for Knowledge
Discovery. Kluwer
Grunwald PD, Myung I.J and Pitt MA (eds.). 2005. Advances in Minimum Description
Length Theory and Applications. MIT Press
Hastie T, Tibshirani R and Friedman J. 2001. The Elements of Statistical Learning: Data
Mining, Inference and Prediction. Springer
Hilderman RJ and Hamilton HJ. 1999. Knowledge Discovery and Interestingness
Measures: A Survey. Technical Report CS 99-04, University of Regina, Regina,
Saskatchewan, Canada
Moore GW and Hutchins GM. 1983. Consistency versus completeness in medical
decision-making: Exemplar of 155 patients autopsied after coronary artery bypass
graft surgery. Med Inform (London). Jul-Sep., 8(3):197-207
Kecman V. 2001. Learning and Soft Computing. MIT Press
Rissaneen J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific
van Rijsbergen, C.J. 1979. Information Retrieval. London. Butterworths
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
87
References
Kamber & Shingal. 1996 Evaluating the Interestingness of Characteristic Rules.
Application of Mathematical Theoreis. AAAI.org
Mehta, Agrawal & Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining
Cavanaugh J., 2012 Lecture V: The Bayesian Information Criterion, Department of
Biostatistics, Department of Statistics and Actuarial Science, Univ. of Iowa
Mazerolle, 2007, Making sense out of Akaike’s Information Criterion (AIC)
http://avesbiodiv.mncn.csic.es/estadistica/senseaic.pdf
Efron, B. 1983. Estimating the Error Rate of a Prediction Rule: Improvement on
Cross-Validation. Journal of the American Statistical Association, Vol 78,
No. 382 pg 316-331.
Fu, Carrol, Wang. 2005. Estimating misclassification error with small samples via
bootstrap cross-validation. Bioinformatics, Vol 21, Issue 9 pg 1979-1986.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
88