Learning with Positive and Unlabeled Examples using Weighted

Download Report

Transcript Learning with Positive and Unlabeled Examples using Weighted

Learning with Positive and
Unlabeled Examples using
Weighted Logistic
Regression
Wee Sun Lee
National University of Singapore
Bing Liu
University of Illinois, Chicago
Personalized Web Browser
• Learn web pages that are
of interest to you!
• Information that is
available to browser when
it is installed:
– Your bookmark (or cached
documents) – Positive
examples
– All documents in the web –
Unlabeled examples!!
Direct Marketing
• Company has database with details of its
customer – positive examples
• Want to find people who are similar to their
own customer
• Buy a database consisting of details of
people, some of whom may be potential
customers – unlabeled examples.
Assumptions
• All examples are drawn independently
from a fixed underlying distribution
• Negative examples are never labeled
• With fixed probability , positive example
is independently left unlabeled.
Are Unlabeled Examples
Helpful?
• Function known to be
either x1 < 0 or x2 > 0
• Which one is it?
x1 < 0
++u +
u +u +
+ ++ +
x2 > 0
uu u
u uu
uu
Not learnable with only positive
examples. However, addition of
unlabeled examples makes it
learnable.
Related Works
• Denis (1998) showed that function classes
learnable in the statistical query model is
learnable from positive and unlabeled examples.
• Muggleton (2001) showed that learning from
positive examples is possible if the distribution of
inputs is known.
• Liu et.al. (2002) give sample complexity bounds
and an algorithm based on EM
• Yu et.al. (2002) gives algorithm based on SVM
• …
Approach
• Label all unlabeled examples as negative (Denis
1998)
– Negative examples are always labeled negative
– Positive examples are labeled negative with
probability 
• Training with one-sided noise
• Problem:  is not known
• Also, what if there is some noise on the negative
examples? Negative examples occasionally
labeled positive with small probability.
Selecting Threshold and
Robustness to Noise
• Approach: Reweigh examples and
learn conditional probability P(Y=1|X)
• If you weight the examples by
– Multiplying the negative examples with
weight equal to the number of positive
examples and
– Multiplying the positive examples with
weight equal to the number of negative
examples
Selecting Threshold and
Robustness to Noise
• Then P(Y=1|X) > 0.5 when X is a positive
example and P(Y=1|X) < 0.5 when X is a
negative example, as long as
– + < 1 where
•  is probability that positive example is labeled negative
•  is probability that negative example is labeled positive
• Okay, even if some of the positive examples are
not actually positive (noise).
Weighted Logistic Regression
• Practical algorithm: Reweigh the examples and
then do logistic regression with linear function to
learn P(Y=1|X).
– Compose linear function with sigmoid then do
maximum likelihood estimation
• Convex optimization problem
• Will learn the correct conditional probability if it
can be represented
• Minimize upper bound to weighted classification
error if cannot be represented – still makes
sense.
Selecting Regularization Parameter
• Regularization important when learning with
noise
• Add c times sum of squared values of weights to
cost function as regularization
• How to choose the value of c?
– When both positive and negative examples available,
can use validation set to choose c.
– Can use weighted examples in a validation set to
choose c, but not sure if this makes sense?
Selecting Regularization Parameter
• Performance criteria pr/P(Y=1) can be estimated
directly from validation set as r2/P(f(X) = 1)
– Recall r = P(f(X) = 1| Y = 1)
– Precision p = P(Y = 1| f(X) = 1)
• Can use for
– tuning regularization parameter c
– also to compare different algorithms when only positive
and unlabeled examples (no negative) available
• Behavior similar to commonly used F-score
F = 2pr/(p+r)
– Reasonable when use of F-score reasonable
Experimental Setup
• 20 Newsgroup dataset
• 1 group positive, 19 others negative
• Term frequency as features, normalized to
length 1
• Randomly split
– 50% train
– 20% validation
– 30% test
• Validation set used to select regularization
parameter from small discrete set then retrain
on training+validation set
Results
F-score averaged over 20 groups

Opt
0.3
0.757
0.754
0.646
0.661
1-Cls
SVM
0.15
0.7
0.675
0.659
0.619
0.59
0.153
pr/P(Y=1) Weighted S-EM
Error
Conclusions
• Learning from positive and unlabeled
examples by learning P(Y=1|X) after setting
all unlabeled examples negative.
– Reweighing examples allows threshold at 0.5 and
makes it tolerant to negative examples that are
misclassified as positive
• Performance measure pr/P(Y=1) can be
estimated from data
– Useful when F-score is reasonable
– Can be used to select regularization parameter
• Logistic regression using linear regression
and these methods works well on text
classification