here - Columbia University

Download Report

Transcript here - Columbia University

Universal Learning over Related
Distributions and Adaptive Graph
Transduction
1. Go beyond transfer learning to sample selection bias
†, Wei Fan‡, Jing Peng*,
and uncertainty
mining
Erheng
Zhong
2. Unified framework
‡, and Jiangtao Ren†
Verscheure
3.Olivier
One single
solution: supervised
case
†Sun
Yat-Sen University
‡IBM T. J. Watson Research Center
*Montclair State University
Standard Supervised Learning
training
(labeled)
test
(unlabeled)
Classifier
New York Times
85.5%
New York Times
2
Sample Selection Bias
training
(labeled)
test
(unlabeled)
78.5%
85.5%
Classifier
New York Times
New York Times
Have a different word vector distribution
August: a lot about typhoon in Taiwan
3
September: a lot about US Open
Uncertainty Data Mining
• Training Data:
– Both feature vectors and class labels contain
noise (usually Gaussian)
– Common for data collected from sensor
network
• Testing data:
– Feature vector contain noises
Summary
• Traditional supervised learning:
– Training and testing data follow the identical distribution
• Transfer learning:
– from different domains
• Sample selection bias:
– from same domain but distribution is different
• such as, missing not at random
• Uncertain data mining:
– data contains noise
• In other words: in all three cases, training and testing
data are from different distributions.
• Traditionally, each problem is handled separately.
Main Challenge
Performance Drop
Domain Transfer
Sample Selection Bias
Noisy Training Data
Data from Different but
Related Distributions
Could one solve
these different
but similar
problems under a
uniform
framework?
With the same
solution?
Universal Learning
Universal Learning
•
is the subsets of X that are the support of some
hypothesis in a fixed hypothesis space
([Blitzer et
al, 2008]
• The distance between two distributions ([Blitzer et al, 2008]
h1
h2
Test Distribution
h3
b
c
Db=|ptr(b)-pte(b)|
h1(a)=0, h2(a)=1, h3(a)=1
Dc=|ptr(c)-pte(c)|
=max(Db,Dc)*2
b
h1(b)=1, h2(b)=1, h3(b)=1
a
b
c
d
h1(c)=1, h2(c)=1, h3(c)=1
h1(d)=0,
h3(d)=0
Training h2(d)=1,
Distribution
c
How to Handle Universal Learning?
• Most traditional classifiers could not
guarantee the performance when training
and test distributions are different.
• Could we find one classifier under weeker
assumption?
Graph Transduction?
Advantage of Graph Transduction
• Weaker assumption that the decision
boundary lies on the low density regions of the
unlabeled data.
Two-Gaussians
vs.
Two-arcs
Just Graph Transduction?
• “Un-smooth label”(more examples in low density
region) and “class imbalance” problems ([Wang
et al, 2008]) may mislead the decision boundary
to go through the high density regions.
• Bottom part closest
red square
• More red square than
blue square
Sample Selection:
which samples?
Maximum Margin Graph Transduction
• In margin-terms, unlabeled data with low
margin are likely misclassified!
3
3
2
Decision Boundary
Decision Boundary
2
1
1
0
0
-1
-1
-2
-2
-3
-2
-1.5
-1
Bad sample
-0.5
0
-3
-2
-1.5
-1
-0.5
Good sample
0
Main Flow
Predict the labels
of unlabeled
data
Maximize the
unlabeled
data margin
Labeled Source Maximum Margin
Domain Data Graph Transduction
Sample Selection
Unlabeled Target
Domain Data
Lift the
unlabeled
data margin
Update Graph
Label
Propagation
Average Ensemble
Properties
• Adaptive Graph Transduction can be bounded
Training error
in terms
of
Emprical
Error
of the distance
approximating
between
ideal training and
thetest
idealdistribution
hypothesis
hypothesis
Properties
• If one classifier has larger unlabeled data margin, it
will make the training error smaller (recall last
theorem)
• Average ensemble is likely to achieve larger margin
Reuters
Experiment – Data Set
org
• Transfer Learning
place
Target-Domain
org.subA
place.subA
– Reuters: 21758 Reuters news
First fillarticles
up the “GAP”, then use
knn classifier to do classification
– SyskillWebert: HTML source of web pages plus
org.subB
place.subB
SyskillWebert
the ratings
of a user on those web pages
Source-Domain
Target-Domain
Source-Domain
Sheep
First fill up the “GAP”,Bandsthen use
Biomedical
recording
knn classifier to do classification
Goats
Experiment – Data Set
Feature 1
• Sample Selection Bias Correction
– UCI data set: Ionosphere, Diabetes,
Haberman, WDBC
Feature 2
• Uncertainty1.Randomly
Mining select 50% of the features,
– Kent Ridge
Biomedical
high to
and
then sort theRepository:
data set according
eachlow
selected
features;
dimensional,
sample
size (HDLSS)
2.we attain top instances from every
sorted listGenerate
as trainingtwo
set;
different Gaussian
Noises and add them into training
and test set
Experiment -- Baseline methods
• Original graph transduction algorithm ([Zhu, 2005])
– Using the entire training data set
– Variation: choosing a randomly selected sample whose
size is equal to the one chosen by MarginGraph
• CDSC: transfer learning approach ([Ling et al,
2008])
– find a mapping space which optimizes over consistency
measure between the out-domain supervision and indomain intrinsic structure
• BRSD-BK/BRSD-DB: bias correction approach
([Ren et al, 2008])
– discover structure and re-balance using unlabeled data
Performance--Transfer Learning
Accuracy -- Transfer Learning
100
Graph
RandGraph
CDSC
MarginGraph
95
90
Accuracy
85
80
O vs Pl
Sheep
Pe vs Pl
O vs Pe
Goats
Biomedical
75
70
65
60
55
50
1
2
3
4
5
6
Data set
Perform best on 5 of 6 data sets!
AUC -- Transfer Learning
100
Graph
RandGraph
CDSC
MarginGraph
95
90
85
O vs Pe
O vs Pl
AUC
80
Pe vs Pl
Biomedical
75
Sheep
Goats
70
65
60
55
50
1
2
3
4
5
Data set
Perform best on 5 of 6 data sets!
6
Performance--Sample Selection Bias
AUC -- Sample Selection Bias
Accuracy -- Sample Selection Bias
100
100
Graph
RandGraph
BRSD-BK
BRSD-DB
MarginGraph
95
90
Ionosphere
Wdbc
90
85
80
Wdbc
80
75
Diabetes
AUC
Accuracy
85
Graph
RandGraph
BRSD-BK
BRSD-DB
MarginGraph
95
Haberman
75
70
70
65
65
60
60
55
55
Ionosphere
Diabetes
Haberman
50
1
2
3
Data set
4
50
1
2
3
Data set
Accuracy: Best on all 4 data sets!
AUC: Best on 2 of 4 data sets.
4
Performance--Uncertainty Mining
Accuracy -- Uncertainty Mining
AUC -- Uncertainty Mining
100
95
90
100
Leukemia
Graph
RandGraph
MarginGraph
90
ColonTumor
85
ColonTumor
ProCancer
ProCancer
80
80
75
AUC
Accuracy
Leukemia
Graph
RandGraph
MarginGraph
CNS
70
CNS
70
60
65
60
50
55
50
1
2
3
Data set
4
40
1
2
3
Data set
Accuracy: Best on all 4 data sets!
AUC: Best on all 4 data sets!
4
Margin Analysis
• MarginBase is the base classifier of
MarginGraph in each iteration.
• LowBase is a “minimal margin classifier” which
selects samples for building a classifier with
minimal unlabeled data margin.
• LowGraph is the averaging ensemble of
LowBase.
Maximal margin is better than minimal margin
Ensemble is better than any single classifiers
Conclusion
• Cover different formulations where the training and test
set are drawn from related but different distributions.
• Flow
– Step-1 Sample selection -- Select labeled data from different
distribution which could maximize the unlabeled data margin
– Step-2 Label Propagation -- Label the unlabeled data
– Step-3 Ensemble -- Further lift the unlabeled data margin
• Code and data available from
http://www.cs.columbia.edu/~wfan