Transcript Uncertainty

Midterm Review
Constructing decision trees
• Normal procedure: top down in recursive divide-andconquer fashion
– First: an attribute is selected for root node and a branch is
created for each possible attribute value
– Then: the instances are split into subsets (one for each
branch extending from the node)
– Finally: the same procedure is repeated recursively for each
branch, using only instances that reach the branch
• Process stops if all instances have the same class, or
when most instances have the same class.
Weather data
Which attribute to select?
(b)
(a)
Choose
attribute that
results in
lowest entropy
of the children
nodes
(c)
(d)
Example: attribute “Outlook”
Information gain
 Usually people don’t use directly the entropy of a node.
Rather the information gain is being used.
 Clearly, greater the information gain better the purity of a
node. So, we choose “Outlook” for the root.
Continuing to split
Highly-branching attributes
• The weather data with ID code
Tree stump for ID code
attribute
Subsets are more likely to be pure if there is a large number of values
Information gain is biased towards choosing attributes with a large number of values
What’s the remedy?
Gain ratio
Gain ratios for weather data
Well, in this example of only 14 training instances, the “ID
code” has still greater gain ratio.
But its advantage is greatly reduced.
Numerical attributes
• Tests in nodes can be of the form xj > constant
• Divides the space into rectangles.
Considering splits
• The only thing we need to do differently in our algorithm is to
consider splitting between each data point in each dimension.
Bankruptcy Example
Bankruptcy Example
• We consider all the possible splits in each dimension, and
compute the average entropies of the children.
Bankruptcy Example
• Now, we consider all the splits of the remaining part of space.
• Note that we have to recalculate all the average entropies again,
because the points that fall into the leaf node are taken out of
consideration.
Regression Trees
• Like decision trees, but with real-valued constant outputs at the
leaves.
Splitting
• Use average variance of the children to evaluate the quality of
splitting on a particular feature.
• Here we have a data set, for which I've just indicated the y
values.
Splitting
• Compute a weighted average variance
• We can see that the average variance of splitting on feature 3 is much lower
than of splitting on f7, and so we'd choose to split on f3.
Stopping
• Stop when the variance at the leaf is small enough.
• Then, set the value at the leaf to be the mean of the y values of
the elements.
Rules: Coverage and Accuracy
• Coverage of a rule:
– Fraction of records that
satisfy the antecedent of a
rule
• Accuracy of a rule:
– Fraction of records that
satisfy both the antecedent
and consequent of a rule
(over those that satisfy the
antecedent)
Tid Refund Marital
Status
Taxable
Income Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
No
10
(Status=Single)  No
Coverage = 40%, Accuracy = 50%
A simple covering algorithm
• Generates a rule by adding tests that
maximize rule’s accuracy.
• Here, each new test (growing the
rule) reduces rule’s coverage.
• Goal: maximizing accuracy
– t: total number of instances covered
by rule
– p: positive examples of the class
covered by rule
– t-p: number of errors made by rule
Select test that maximizes the
ratio p/t
s pace o f
examp l es
ru le s o far
ru le aft er
add in g n ew
term
Pseudo-code for PRISM
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R, and each value v,
Consider adding the condition A = v to the left-hand side
of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A = v to R
Remove the instances covered by R from E
RIPPER Algorithm is similar. It uses instead of p/t the info gain.
Probabilistic Reasoning
Conditional Independence – Naïve Bayes
P(cause| effect1 , effect2 )
 P(effect1 | effect2 , cause) P(effect2 | cause) P(cause)
 P(effect1 | cause) P(effect2 | cause) P(cause)
P(cause| effect1,...,effectn )  P(effect1 | cause)...P(effectn | cause)P(cause)
Two assumptions:
• Attributes are equally important
• Conditionally independent (given the class value)
• This means that knowledge about the value of a particular attribute doesn’t
tell us anything about the value of another attribute (if the class is known)
• Although based on assumptions that are almost never correct, this scheme
works well in practice!
Weather Data
Here we don’t really have
effects, but rather
evidence.
The weather data example
P(play=yes | E) =
P(Outlook=Sunny | play=yes) *
P(Temp=Cool | play=yes) *
P(Humidity=High | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (2/9) *
(3/9) *
(3/9) *
(3/9) *
(9/14) / P(E) = 0.0053 / P(E)
Don’t worry for the 1/P(E); It’s
alpha, the normalization constant.
The weather data example
P(play=no | E) =
P(Outlook=Sunny | play=no) *
P(Temp=Cool | play=no) *
P(Humidity=High | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (3/5) *
(1/5) *
(4/5) *
(3/5) *
(5/14) / P(E) = 0.0206 / P(E)
Normalization constant
P(play=yes | E) + P(play=no | E) = 1
0.0053 / P(E) + 0.0206 / P(E) = 1
P(E) = 0.0053 + 0.0206
So,
P(play=yes | E) = 0.0053 / (0.0053 + 0.0206) = 20.5%
P(play=no | E) = 0.0206 / (0.0053 + 0.0206) = 79.5%
The “zero-frequency problem”
• What if an attribute value doesn’t P(play=yes | E) =
occur with every class value (e.g.
P(Outlook=Sunny | play=yes) *
“Humidity = High” for class
“Play=Yes”)?
P(Temp=Cool | play=yes) *
– Probability
P(Humidity=High | play=yes) *
P(Humidity=High|play=yes)
P(Windy=True | play=yes) *
will be zero!
P(play=yes) / P(E) =
• A posteriori probability will also be
= (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) =
zero!
0.0053 / P(E)
– No matter how likely the other
values are!
It will be instead:
Number of possible
values for ‘Outlook’
• Remedy:
– Add 1 to the count for every
= ((2+1)/(9+3)) * ((3+1)/(9+3)) *
attribute value-class
((3+1)/(9+2)) * ((3+1)/(9+2)) *(9/14) /
combination (Laplace
P(E) = 0.007 / P(E)
estimator);
– Add k (no of possible attribute
Number of possible
values) to the denumerator. (see
values for ‘Windy’
example on the right).
Missing values
• Training: instance is not included in frequency count for
attribute value-class combination
• Classification: attribute will be omitted from calculation
• Example:
P(play=yes | E) =
P(Temp=Cool | play=yes) *
P(Humidity=High | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (3/9) * (3/9) * (3/9) *(9/14) / P(E)
= 0.0238 / P(E)
P(play=no | E) =
P(Temp=Cool | play=no) *
P(Humidity=High | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (1/5) * (4/5) * (3/5) *(5/14) / P(E)
= 0.0343 / P(E)
After normalization: P(play=yes | E) = 41%,
P(play=no | E) = 59%
Dealing with numeric attributes
• Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class).
• Probability density function for the normal distribution is:
1
f ( x | class) 
e
 2

( x )2
2 2
• We approximate  by the sample mean:
1 n
x   xi
n i 1
• We approximate 2 by the sample variance:
1 n
s 
( xi  x ) 2

n  1 i 1
2
Weather Data
outlook temperature humidity windy
sunny
85
85 FALSE
sunny
80
90 TRUE
overcast
83
86 FALSE
rainy
70
96 FALSE
rainy
68
80 FALSE
rainy
65
70 TRUE
overcast
64
65 TRUE
sunny
72
95 FALSE
sunny
69
70 FALSE
rainy
75
80 FALSE
sunny
75
70 TRUE
overcast
72
90 TRUE
overcast
81
75 FALSE
rainy
71
91 TRUE
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
f(temperature=66 | yes)
=e(- ((66-m)^2 / 2*var) ) /
sqrt(2*3.14*var)
m=
(83+70+68+64+69+75+75+72
+81)/ 9 = 73
var = ( (83-73)^2 + (70-73)^2 +
(68-73)^2 + (64-73)^2 + (6973)^2 + (75-73)^2 + (75-73)^2
+ (72-73)^2 + (81-73)^2 )/ (9-1)
= 38
f(temperature=66 | yes)
=e(- ((66-73)^2 / (2*38) ) ) /
sqrt(2*3.14*38) = .034
Classifying a new day
• A new day E:
P(play=yes | E) =
P(Outlook=Sunny | play=yes) *
P(Temp=66 | play=yes) *
P(Humidity=90 | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (2/9) * (0.0340) * (0.0221) * (3/9)
*(9/14) / P(E) = 0.000036 / P(E)
P(play=no | E) =
P(Outlook=Sunny | play=no) *
P(Temp=66 | play=no) *
P(Humidity=90 | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (3/5) * (0.0291) * (0.0380) * (3/5)
*(5/14) / P(E) = 0.000136 / P(E)
After normalization: P(play=yes | E) = 20.9%,
P(play=no | E) = 79.1%
Bayesian Net Semantics
We order them according to
the topological of the given
BayesNet
Suppose we have the variables X1,…,Xn.
The probability for them to have the values x1,…,xn respectively is P(xn,…,x1):
 P( xn ,..., x1 )
 P( xn | xn 1 ,..., x1 ) P ( xn 1 ,..., x1 )
 P( xn | xn 1 ,..., x1 ) P ( xn 1 | xn  2 ,..., x1 ) P ( xn  2 ,..., x1 )
 ...
n
n
i 1
i 1
  P ( xi | xi 1 ,..., x1 )   P ( xi | parents( xi ))
e.g.,
P(j  m  a  b  e)
= P(j | a) P(m | a) P(a | b, e)
P(b) P(e)
=…
P(xn,…,x1):
is short for
P(Xn=xn,…, Xn= x1):
Inference in Bayesian Networks
• Notation:
– X denotes query variable
– E denotes the set of evidence variables E1,…,Em, and e is a
particular event, i.e. an assignment to the variables in E.
– Y will denote the set of the remaining variables (hidden variables).
• Typical query asks for the posterior probability P(x|e1,…,em)
• E.g. We could ask: What’s the probability of a burglary if both
Mary and John call, P(burglary | johhcalls, marycalls)?
Classification
• We compute and compare the following:
P( x | e1 ,...,em ) 
P( x, e1 ,...,em )
 P( x, e1 ,...,em )
P(e1 ,...,em )
P(x | e1 ,...,em ) 
P(x, e1 ,...,em )
 P(x, e1 ,...,em )
P(e1 ,...,em )
• However, how do we compute:
P( x, e1 ,...,em )
and
P(x, e1 ,...,em ) ?
What about the hidden
variables Y1,…,Yk?
Inference by enumeration
P ( x, e1 ,...,em )   y ... y P ( x, e1 ,...,em , y1 ,..., yk )
1
k
and
P (x, e1 ,...,em )   y ... y P (x, e1 ,...,em , y1 ,..., yk )
1
Example: P(burglary
k
| johhcalls, marycalls)? (Abbrev. P(b|j,m))
P (b | j , m)
 P(b, j , m)
  a e P(b, j , m, a, e)
  P (b, j , m, a, e)  P(b, j , m, a, e)  P (b, j , m, a, e)  P(b, j , m, a, e) 
Weather data
What is the Bayesian Network
corresponding to Naïve Bayes?
Play probability table
Based on the data…
P(play=yes) = 9/14
P(play=no) = 5/14
Let’s correct with Laplace …
P(play=yes) = (9+1)/(14+2) = .625
P(play=yes) = (5+1)/(14+2) = .375
Outlook probability table
Based on the data…
P(outlook=sunny|play=yes) =
(2+1)/(9+3) = .25
P(outlook=overcast|play=yes) =
(4+1)/(9+3) = .417
P(outlook=rainy|play=yes) =
(3+1)/(9+3) = .333
P(outlook=sunny|play=no) =
(3+1)/(5+3) = .5
P(outlook=overcast|play=no) =
(0+1)/(5+3) = .125
P(outlook=rainy|play=no) =
(2+1)/(5+3) = .375
Windy probability table
Based on the data…let’s
find the conditional
probabilities for “windy”
P(windy=true|play=yes,outlook=sunny)
= (1+1)/(2+2) = .5
Windy probability table
Based on the data…
P(windy=true|play=yes,outlook=sunny)
= (1+1)/(2+2) = .5
P(windy=true|play=yes,outlook=overcast)
= 0.5
P(windy=true|play=yes,outlook=rainy)
= 0.2
P(windy=true|play=no,outlook=sunny)
= 0.4
P(windy=true|play=no,outlook=overcast)
= 0.5
P(windy=true|play=no,outlook=rainy)
= 0.75
Final figure
Classify it
Classify it
Classification I
Classify it
P(play=yes|outlook=sunny,
temp=cool,humidity=high,
windy=true) =
*P(play=yes)
*P(outlook=sunny|play=yes)
*P(temp=cool|play=yes,
outlook=sunny)
*P(humidity=high|play=yes,
temp=cool)
*P(windy=true|play=yes,
outlook=sunny)
= *0.625*0.25*0.4*0.2*0.5
= *0.00625
Classification II
Classify it
P(play=no|outlook=sunny,
temp=cool,humidity=high,
windy=true) =
*P(play=no)
*P(outlook=sunny|play=no)
*P(temp=cool|play=no,
outlook=sunny)
*P(humidity=high|play= no,
temp=cool)
*P(windy=true|play=no,
outlook=sunny)
= *0.375*0.5*0.167*0.333*0.4
= *0.00417
Classification III
Classify it
P(play=yes|outlook=sunny,
temp=cool,humidity=high,
windy=true)
= *0.00625
P(play=no|outlook=sunny,
temp=cool,humidity=high,
windy=true)
= *.00417
 = 1/(0.00625+0.00417)
=95.969
P(play=yes|outlook=sunny,
temp=cool,humidity=high,
windy=true)
= 95.969*0.00625 = 0.60
Classification IV (missing values or hidden variables)
P(play=yes|temp=cool,
humidity=high, windy=true)
= *outlookP(play=yes)
*P(outlook|play=yes)
*P(temp=cool|play=yes,outlook)
*P(humidity=high|play=yes,
temp=cool)
*P(windy=true|play=yes,outlook)
=…(next slide)
Classification V (missing values or hidden variables)
P(play=yes|temp=cool, humidity=high, windy=true)
= *outlookP(play=yes)*P(outlook|play=yes)*P(temp=cool|play=yes,outlook)
*P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook)
= *[
P(play=yes)*P(outlook= sunny|play=yes)*P(temp=cool|play=yes,outlook=sunny)
*P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=sunny)
+P(play=yes)*P(outlook= overcast|play=yes)*P(temp=cool|play=yes,outlook=overcast)
*P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=overcast)
+P(play=yes)*P(outlook= rainy|play=yes)*P(temp=cool|play=yes,outlook=rainy)
*P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=rainy)
]
= *[
0.625*0.25*0.4*0.2*0.5 + 0.625*0.417*0.286*0.2*0.5 + 0.625*0.33*0.333*0.2*0.2 ]
=*0.01645
Classification VI (missing values or hidden variables)
P(play=no|temp=cool, humidity=high, windy=true)
= *outlookP(play=no)*P(outlook|play=no)*P(temp=cool|play=no,outlook)
*P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook)
= *[
P(play=no)*P(outlook=sunny|play=no)*P(temp=cool|play=no,outlook=sunny)
*P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=sunny)
+P(play=no)*P(outlook= overcast|play=no)*P(temp=cool|play=no,outlook=overcast)
*P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=overcast)
+P(play=no)*P(outlook= rainy|play=no)*P(temp=cool|play=no,outlook=rainy)
*P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=rainy)
]
= *[
0.375*0.5*0.167*0.333*0.4 + 0.375*0.125*0.333*0.333*0.5 +
0.375*0.375*0.4*0.333*0.75 ]
=*0.0208
Classification VII (missing values or hidden variables)
P(play=yes|temp=cool, humidity=high, windy=true) =*0.01645
P(play=no|temp=cool, humidity=high, windy=true) =*0.0208
=1/(0.01645 + 0.0208)= 26.846
P(play=yes|temp=cool, humidity=high, windy=true) = 26.846 * 0.01645 = 0.44
P(play=no|temp=cool, humidity=high, windy=true) = 26.846 * 0.0208 = 0.56
I.e. P(play=yes|temp=cool, humidity=high, windy=true) is 44% and
P(play=no|temp=cool, humidity=high, windy=true) is 56%
So, we predict ‘play=no.’
Evaluation
Predicting performance
•
We want to say: p (probability of success) lies within a certain specified
interval with a certain specified confidence
•
Example: S=750 successes in N=1000 trials
– Estimated success rate: 75%
–
How close is this to true success rate p?
• Answer: with 80% confidence p[73.2,76.7]
•
Another example: S=75 and N=100
– Estimated success rate: 75%
–
•
With 80% confidence p[69.1,80.1]
• I.e. the probability that p[69.1,80.1] is 0.8.
Bigger the N more confident we are, i.e. the surrounding interval is smaller.
– Above, for N=100 we were less confident than for N=1000.
Estimating p
•
•
•
•
•
Let Y be the random variable with possible values
1 for success and
0 for error.
Let probability of success be p.
Then probability of error is q=1-p.
Mean is p
Variance is pq
•
Let S be the random variable denoting the number of successes, i.e. S is the
sum of N value samplings of Y.
•
Now, we approximate p with the success rate in N trials, i.e. S/N.
•
By the Central Limit Theorem, when N is big, the probability distribution of
the random variable f=S/N is approximated by a normal distribution with
– mean p and
– variance pq/N.
Estimating p
 Suppose a random variable has mean 0 and symmetric
distribution. Then,
Pr[− z≤ X≤ z]= c
Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z]
 Confidence limits for the normal distribution with 0 mean and
variance of 1:
E.g.: Pr[−1.65≤ X≤1.65]=90%
Estimating p
Pr[−1.65≤ X≤1.65]=90%
To use this, we have to reduce our
random variable S/N to have 0 mean
and unit variance:
Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90%
Now we solve two equations:
(S/N – p) / S/N =1.65
(S/N – p) / S/N =-1.65
Let N=100, and S=70
S/N is sqrt(pq/N) and we
approximate it by
sqrt(.7*.3/100) = .046
The two equations become:
(0.7 – p) / .046 =1.65
p=
.624
(0.7 – p) / .046 =-1.65
p=
.776
With a 90% confidence we have
k-Fold Cross-Validation
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the remainder for training
• The error estimates are averaged to yield an overall error estimate
•
•
•
•
Frequent question: which of two learning schemes performs better?
Obvious way: compare for example 10-fold Cross Validation estimates
Problem: variance in estimate
We don’t know whether the results are reliable. Need to use statistical-test
for that
• Student’s t-test tells whether the means of two samples are significantly
different.
– In our case the samples are cross-validation estimates for different
datasets from the domain
• Use a paired t-test because the individual samples are paired
– The same Cross Validation is applied twice
Distribution of the means
•
•
•
•
•
x1, x2, … xk and y1, y2, … yk are the 2k samples for the k different datasets
mx and my are the means,
Estimated variances of the means are sx2 and sy2
With enough samples, the mean of a set of independent samples is normally
distributed
With small samples (k < 30) the mean follows Student’s distribution with
k–1 degrees of freedom (similar shape, but wider than normal distribution)
Distribution of the differences
• Let md = mx – my
• The difference of the means (md) also has a Student’s distribution with k–1
degrees of freedom
• Let sd2 be the estimated variance of the difference
md
t
• The standardized version of md is called the t-statistic:
2
sd k
• Fix a significance level
– If a difference is significant at the  % level, there is a (100- )% chance
that the true means differ
• Divide the significance level by two because the test is two-tailed
• Look up the value for z that corresponds to /2
• If t–z or tz then the difference is significant
– i.e. the null hypothesis (that the difference is zero) can be rejected
Example
•
•
We have compared two classifiers through cross-validation on
10 different datasets.
The error rates are:
Dataset
1
2
3
4
5
6
7
8
9
10
Classifier A
10.6
9.8
12.3
9.7
8.8
10.6
9.8
12.3
9.7
8.8
Classifier B
10.2
9.4
11.8
9.1
8.3
10.2
9.4
11.8
9.1
8.3
Difference
.4
.4
.5
.6
.5
.4
.4
.5
.6
.5
Example
• md = 0.48
• sd = 0.0789
md
0.48
t

 19.24
2
sd k 0.0789 10
 The critical value of t for a two-tailed statistical test,  = 10%
and 9 degrees of freedom is: 1.83
 19.24 is way bigger than 1.83, so classifier B is much better
than A.
True positives and False positives
and ROC Space
True positive rate is
TP = P correctly classified / P
False positive rate is
FP = N incorrectly classified / N
A discrete classier produces an
(fp rate, tp rate)
pair corresponding to a single point in
ROC space.
One point in ROC space is better than
another if it is to the northwest of the
first
–tp rate is higher, fp rate is lower,
Curves in ROC space
• Some classifiers, such as a Naive Bayes
classifier, yield an instance probability or score.
– Such a ranking or scoring classier can be used with a
threshold to produce a discrete (binary) classier:
• if the classier output is above the threshold, the classier
produces a Y,
• else a N.
– Each threshold value produces a different point in
ROC space (corresponding to a different confusion
matrix).
– Conceptually, we may imagine varying a threshold
from –infinity to + infinity and tracing a curve through
ROC space.
Algorithm
• Exploit monotonicity of thresholded classifications:
– Any instance that is classified positive with respect to a
given threshold will be classified positive for all lower
thresholds as well.
• Therefore, we can simply:
– sort the test instances decreasing by their scores and
– move down the list, processing one instance at a time and
– update TP and FP as we go.
• In this way, an ROC graph can be created from a linear
scan.
Example
Example
A threshold of +inf
produces the point (0; 0).
As we lower the
threshold to 0.9 the first
positive instance is
classified positive,
yielding (0;0.1).
As the threshold is
further reduced, the
curve climbs up and to
the right, ending up at
(1;1) with a threshold of
0.1.
Lowering this threshold
corresponds to moving
from
the “conservative” to the
“liberal” areas of the
Observations
• The ROC point at
(0.1, 0.5) produces
its highest
accuracy (70%).
• Note that the
classifier's best
accuracy occurs at
a threshold of .54,
rather than at .5
as we might
expect with a
balanced class
distribution.
Area under an ROC Curve
• AUC has an important
statistical property:
The AUC of a classifier is
equivalent to the
probability that the
classier will rank a
randomly chosen positive
instance higher than a
randomly chosen
negative instance.
• Often used to compare
classifiers:
– The bigger AUC the better
• AUC can be computed by a
slight modification to the
algorithm for constructing
ROC curves.
Convex Hull
If you aim to cover just 40% of
the true positives you should
choose method A, which gives a
false positive rate of 5%.
If you aim to cover 80% of the
true positives you should choose
method B, which gives a false
positive rate of 60% as compared
with A’s 80%.
If you aim to cover 60% of the
true positives then you should
combine A and B.
• The shaded area is called the convex hull of the two curves.
• You should operate always at a point that lies on the upper boundary of the
convex hull.
• What about some point in the middle where neither A nor B lies on the convex
hull?
• Answer: “Randomly” combine A and B
Combining classifiers
• Example (CoIL Symposium Challenge 2000):
– There is a set of 4000 clients to whom we wish to market a
new insurance policy.
– Our budget dictates that we can afford to market to only 800
of them, so we want to select the 800 who are most likely to
respond to the offer.
– The expected class prior of responders is 6%, so within the
population of 4000 we expect to have 240 responders
(positives) and 3760 non-responders (negatives).
Combining classifiers
• Assume we have generated two classifiers, A
and B, which score clients by the probability
they will buy the policy.
• In ROC space,
– A’s best point lies at (.1, .2) and
– B’s best point lies at (.25, .6)
• We want to market to exactly 800 people so
our solution constraint is:
fp rate * 3760 + tp rate * 240 = 800
• If we use A, we expect:
.1 * 3760 + .2*240 = 424 candidates, which
is too few.
• If we use B we expect:
.25*3760 + .6*240 = 1084 candidates,
which is too many.
• We want a classifier between A and B.
Combining classifiers
• The solution constraint is shown as a
dashed line.
• It intersects the line between A and B at
C,
approximately (.18, .42)
• A classifier at point C would give the
performance we desire and we can
achieve it using linear interpolation.
• Calculate k as the proportional distance
that C lies on the line between A and B:
k = (.18-.1) / (.25 – .1)  0.53
• Therefore, if we sample B's decisions
at a rate of .53 and A's decisions at a
rate of 1-.53=.47 we should attain C's
performance.
In practice this fractional sampling
can be done as follows:
For each instance (person),
generate a random number
between zero and one.
If the random number is
greater than k, apply classier A
to the instance and report its
decision, else pass the
Iso-Performance lines
•
•
•
•
•
Let c(Y,n) be the cost of a false positive error.
Let c(N,p) be the cost of a false negative error.
Let p(p) be the prior probability of a positive example
p(n) = 1- p(p) is the prior probability of a negative example
The expected cost of a classification by the classifier represented by a point
(TP, FP) in ROC space is:
p(p) * (1-TP) * c(N,p) +
p(n) * FP * c(Y,n)
• Therefore, two points (TP1,FP1) and (TP2,FP2) have the same performance if
(TP2 – TP1) / (FP2-FP1) = p(n)c(Y,n) / p(p)c(N,p)
Iso-Performance lines
• The equation defines the slope
of an isoperformance line, i.e.,
all classifiers corresponding to
points on the line have the same
expected cost.
• Each set of class and cost
distributions defines a family of
isoperformance lines.
– Lines “more northwest”--having a larger TP intercept--are better because they
correspond to classifiers with
lower expected cost.
Lines  and  show the
optimal classifier
under different sets of
conditions.
Discussion – Comparing Classifiers
Precision and Recall
Precision is the fraction of retrieved documents that are
relevant
Recall is the fraction of relevant documents that are retrieved
In terms of confusion matrix…
Why having two numbers?
• The advantage of having the two numbers for precision and recall is
that one is more important than the other in many circumstances.
• Typical web surfers:
– would like every result on the first page to be relevant (high
precision), but have not the slightest interest in knowing let alone
looking at every document that is relevant.
• In contrast, various professional searchers such as paralegals and
intelligence analysts:
– are very concerned with trying to get as high recall as possible,
and will tolerate fairly low precision results in order to get it.
• Nevertheless, the two quantities clearly trade off against one another:
– you can always get a recall of 1 (but very low precision) by
retrieving all documents for all queries!
• Recall is a non-decreasing function of the number of documents
retrieved.
– On the other hand, precision usually decreases as the number of
documents retrieved is increased.
What about a single number?
• The combined measure which is standardly used is called the F
measure, which is the weighted harmonic mean of precision and
recall:
where
• The default is to equally weight precision and recall, giving a balanced
F measure.
– This corresponds to making  = 1/2 or  =1.
– Commonly written as F1, which is short for F=1
•
Evaluation of ranked retrieval
results
Precision-recall curves have a
distinctive saw-tooth shape:
– if the (k+1)th document retrieved is
irrelevant then recall is the same
as for the top k documents, but
precision has dropped.
– If it is relevant, then both precision
and recall increase, and the curve
jags up and to the right.
•
It’s often useful to remove these
jiggles (by interpolation):
– the precision at a certain recall
level r is defined as the highest
precision found for any recall level
q  r.
•
Justification:
– almost anyone would be prepared
to look at a few more documents if
it would increase the percentage of
the viewed set that were relevant
(that is, if the precision of the larger
set is higher).
Interpolated precision
is shown by the red
line.
Precision at k
• The above measures precision at all recall levels.
• What matters is rather how many good results there are
on the first page or the first three pages.
• This leads to measures of precision at fixed low levels of
retrieved results, such as 10 or 30 documents.
• This is referred to as “Precision at k”, for example
“Precision at 10.”
Books and Slides
• Corresponding chapters in the book:
Ch 1,
Ch 2 (2.1, 2.2)
Ch 4 (4.1, 4.2, 4.3, 4.5, 4.6)
Ch 5 (5.1, 5.3, 5.4.1, 5.4.2, 5.4.3, 5.7.1, 5.7.2)
• However, the book isn’t very clear in the predictive data mining (Book is
pretty good for the second part of the course; that’s the reason for choosing
this book).
• The Witten&Eibe book (in library reserve) is clearer for predictive data
mining (but not good for the second part of the course).
– See my e-mail with respect to the relevant chapters.
• Exercises in the midterm will be similar to the examples in the slides and
assignment.
– Slides also expand more (than both books) in some topics (e.g. decision
trees, Bayesian Nets, ROC curves etc.)
• In the midterm: Calculators and a single-side help sheet are allowed.