Estimating P(j|x)

Download Report

Transcript Estimating P(j|x)

Learning and Making Decisions
When Costs and Probabilities are
Both Unknown
by B. Zadrozny and C. Elkan
Contents
•
•
•
•
•
•
•
Introduce the Problem
Previous work
Direct Cost Sensitive Decision Making
The Dataset
Estimating Class Membership Probabilities
Estimating Costs
Results and Conclusions
Introduction
• Costs/Benefits are the values assigned to
classification decisions.
• Cost are often different for different
examples
• Often we are interested in the rare class in
cost-sensitive learning
• Hence the problem of unbalanced data
Cost Sensitive Decisions
• Each training and test example has associated
cost
• General optimal prediction
arg min
i
 P( j | x)C (i, j, x)
j
• Methods differ w.r.t. P( j | x) and C (i, j , x)
• Previous literature has assumed cost are known
in advance and independent of examples.
C (i, j , x)  C (i, j , y ) x, y
MetaCost
• Estimation of P ( j | x)
– Estimated in training only.
• Estimation of C (i, j )
– Example independent
• Training changes labelling to its optimum
• Learns a classifier to predict labeling of test
examples.
Direct Cost-Sensitive-DecisionMaking
• Estimation of P ( j | x)
– Average of Naïve Bayes and Decision Trees .
– Estimated on training and test sets.
• Estimation of C (i, j , x)
– Multiple Linear Regression.
– Unbiased estimate using Heckman procedure
– Example dependant.
• Evaluation
– Evaluate against MetaCost and KDD competition
results.
– Using a large and difficult dataset, KDD.
MetaCost Implementation
• For evaluation MetaCost is adapted
– Probability class estimates found by simple methods
using decision trees
– Cost are made example dependant during training
• Adapted MetaCost vs. DCSDM
– DCSDM uses two models on test example MetaCost
one.
– Estimation P ( j | x) was made in both training and test
examples in DCSDM.
The Data Mining Task
• Data on persons who have donated in
the past to a certain charity, KDD '98
competition
– Non-donor and donor labelling based on
last campaign
• The task is to choose which donors to
ask for new donations
• Training/Test set
– 95,412 records, labelled donors or nondonors and donation amount
– 96,367 unlabelled records from same
donation campaign
Data Mining Task cont.
• Cost of soliciting $0.68
• Donations range from $1-200
• 5% donors and 95% non-donors
– Very low response rate and varying
donations make hard to beat soliciting to
everyone.
• The dataset set is hard
– Already been filtered to be a reasonable set
of prospects
– The task is to improve upon the unknown
method that produced the set
Applied DCSDM
• In KDD we will change C(i,j,x) to B(i,j,x)
– Costs become benefits
Predict
Non-donor
Predict Donor
Actual
Non-donor
0
Actual
Donor
0
-0.68
y(x)-0.68
• B(1,1,x) is example dependant
– Replaced by a constant by previous literature
Optimal policy
• The expected benefit of not soliciting, i = 0
P( j  0 | x) B(0,0, x)  P( j  1 | x) B(0,1, x)  0
• Expected benefit of soliciting, i = 1
P( j  0 | x) B(1,0, x)  P( j  1 | x) B(1,1, x)
 (1  P( j  1 | x))( 0.68)  P( j  1 | x)( y ( x)  0.68)
 P ( j  1 | x) y ( x)  0.68
• Optimal policy: P( j  1 | x) y ( x)  0.68  0
P( j | x)
• Optimal decisions require P ( j | x)
• Class sizes may be highly unbalanced
• Two methods proposed
– Decision Trees - Smoothing Curtailment
– Naïve Bayes - Binning
Problems w. Decision Trees
• Decision trees assign as a score to each leaf
the raw training frequency of that leaf.
• High Bias
k
p
n
– Decision trees growing methods try to make leaves
Homogeneous. p’s tend to be over or under
estimates
• High Variance
– When n is small p not to be trusted.
Smoothing
• Pruning is no good for us.
• To make the estimates less extreme lets replace:
k
k  b.m
p   p' 
n
nm
• b – base rate, m – heuristic value (smoothing strength)
• Effect
– where k, n small p’ essentially just base rate.
– If k, n larger then p’ is ‘combination’ of base rate and original
score
Smoothed Scores
Curtailment
• What if the leaves have enough training
examples to be statistically reliable?
– Then smoothing seems to be unnecessary.
• Curtailment searches through the tree and
removes nodes where n < v.
– V chosen either through cross-validation, or a
heuristic, like b.v = 10.
Curtailed Tree
Curtailed Scores
Naïve Bayes Classifiers
• Naïve Bayes
– Assumes that within any class the attribute values are
independent variables.
• This assumption gives inaccurate probability
estimates
• But, attributes tend to be positively correlated so
naïve Bayes estimates tend to be too extreme,
i.e. close to zero or one.
• So, they do rank examples well:
if n( x)  n( y ) then P( j  1 | x)  P( j  1 | y )
Calibrating Naïve Bayes Scores
• The Histogram method:
– Sort training examples by n.b. scores
– Divide sorted set into b subsets of equal size,
called bins
– For each bin compute lower and upper
boundary n.b. scores
• Given a new data point x
– Calculate n(x) and find the associated bin
– Let P ( j | x) = fraction of positive training
examples in that bin
Averaging Probability Estimates
• If probability estimates are partially uncorrelated
then it follows that averaging these estimates will
reduce their variance.
• Assuming all estimates have the same variance
the average estimate will have a variance given
by:
2
1   ( N  1) 2
 

N
2
The individual classifier variances
N
The number of classifiers

The correlation factor among all classifiers
Estimating Donation Amount
• Solicit the person based on policy.
• Policy
P( j  1 | x) y ( x)  0.68  0
y (x)
is estimated donation amount
Cost and Probability
• Good Decisions
– Estimating Cost well is more important than
estimating probabilities.
• Why?
– Relative variation of cost across different examples is
much greater than the relative variation of
probabilities
• Probability
– Estimating Donation probability is difficult.
• Estimating donation amount are easier because
past amount are excellent predictor of future
amounts.
Training and Test data
• Two random process
– Donate or not to.
– How much to donate?
Donation Amount.
Donor
Non donor
Training data
Known
-
Test data
unknown
unknown
• Method used for estimating donation amount is
called as Multiple Linear regression (MLR).
Multiple Linear Regression
Two attributes are used
– lastgift : dollar amount of most recent gift.
– ampergift : average gift amount in response to the last
22 promotions
• Linear Regression equation is used to estimate
donation amount.
• 46 of 4843 donations recorded have donation
amount more than $50.
• Donors that have donated at most $50 are used
as input for linear regression.
Problem of Sample Selection
Bias
• Reasoning outside your learning space.
• Donation Amount
Training data
Test data
Donor
Non donor
Known
unknown
unknown
• Estimating Donation Amount
– Any donation estimator is learned on the basis of
people who actually donated.
– This estimator is applied to different population
consisting of donors and non-donors.
Donation Amount and Probability
Estimates are Negatively
Correlated
Solution to Sample Selection
Bias
• Heckman’s procedure
– Estimate conditional probability p( j=1 | x) using linear
probit model.
– Estimate y(x) on training dataset for which j (x) = 1
by including a transformation for each x using the
estimated values of conditional probability.
• Their own procedure
– conditional probability is learned using decision tree
or Naïve bayes classifier.
– These probability estimates are added as additional
attributes by estimating y(x).
Experimental Results
Direct cost sensitive Decision Making
Meta cost
Experimental Results
Interpretation
• With Heckman
– profits on test set increases by $484, in all probability
estimation methods.
– Systematic improvement indicates that Heckman’s
procedure solves the problem of Sample Selection
Bias
• Meta cost
– Best result of Meta cost is $14113.
– Best result of Direct cost sensitive method is $15329.
– On an average, profit achieved in Meta Cost on test
set is $1751 lower than the profit achieved in case of
direct cost-sensitive decision making.
Statistical Significance of Results
• 4872 donors in fixed test set
• Average donation of $15.62
• Different Test set drawn randomly from same
probability distribution would expect a standard
deviation of sqrt(4872)
• Fluctuation will cause a change of about $1090.
sqrt(4872) * 15.62 = $1090.
• Profit Difference between two methods less than
$1090 is not significant.
Conclusions
• Cost sensitive learning is better than Meta cost.
• Provides solution to fundamental problem of cost
being example dependent.
• Identify and solves the problem of Sample
Selection Bias for KDD’98 dataset
Questions?