Linear Model (III)
Download
Report
Transcript Linear Model (III)
Semi-supervised Learning
Rong Jin
Spectrum of Learning Problems
What is Semi-supervised Learning
Learning from a mixture of labeled and unlabeled examples
L = f (x lLabeled
; y1 ); : :Data
: ; (x l ; yn )g
1
Total
N
= number
n l + nofu examples:
nl
U =Unlabeled
f x u ; : : :Data
; xu g
l
f (x) : X ! Y
1
nu
Why Semi-supervised Learning?
Labeling is expensive and difficult
Labeling is unreliable
Ex. Segmentation applications
Need for multiple experts
Unlabeled examples
Easy to obtain in large numbers
Ex. Web pages, text documents, etc.
Semi-supervised Learning Problems
Classification
Transductive – predict labels of unlabeled data
Inductive – learn a classification function
Clustering (constrained clustering)
Ranking (semi-supervised ranking)
Almost every learning problem has a semisupervised counterpart.
Why Unlabeled Could be Helpful
Clustering assumption
Unlabeled data help decide the decision boundary
f (X ) = 0
Manifold assumption
Unlabeled data help decide decision function
f (X )
Clustering Assumption
?
Clustering Assumption
?
Suggest A Simple Alg. for
PointsSemi-supervised
with same label are Learning
connected through
high
?
density regions, thereby defining a cluster
Clusters are separated through low-density regions
Manifold Assumption
Graph representation
Vertex: training example
(labeled and unlabeled)
Edge: similar examples
Labeled
examples
x1
x2
Regularize the classification function f(x)
x 1 and x 2 are connect ed ¡ !
jf (x 1 ) ¡ f (x 2 )j is small
Manifold Assumption
Graph representation
Vertex: training example
(labeled and unlabeled)
Edge: similar examples
Manifold assumption
Data lies on a low-dimensional manifold
Classification function f(x) should “follow” the
data manifold
Statistical View
Generative model for classification
Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´)
θ
Y
X
Statistical View
Generative model for classification
Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´)
Unlabeled data help estimate
Clustering assumption
Pr(X jY ; µ)
Pr(X u ) Pr(X l jY )
nu
Y
Ã
!
XK
Yn l
Pr(x u jy) Pr(y)
y= 1
θ
Y
X
Pr(x l jyi )
i
j
i= 1
i= 1
Statistical View
Discriminative model for classification
Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ)
θ
μ
Y
X
Statistical View
Discriminative model for classification
Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ)
Unlabeled data help regularize θ
Pr(µjX )
1
via a prior 0
Manifold assumption
Pr(µjX ) / exp @
[f (x ui ) ¡ f (x uj )]2 wi ;j A
i ;j = 1
0
=
Xn u
exp @
Xn u
i ;j = 1
θ
μ
Y
X
1
[µ> (x u ) ¡ x u )]2 wi ;j A
i
j
Semi-supervised Learning Algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Label Propagation: Key Idea
A decision boundary
based on the labeled
examples is unable to
take into account the
layout of the data points
How to incorporate the
data distribution into the
prediction of class labels?
Label Propagation: Key Idea
Connect the data points
that are close to each
other
Label Propagation: Key Idea
Connect the data points
that are close to each
other
Propagate the class labels
over the connected graph
Label Propagation: Key Idea
Connect the data
points that are close
to each other
Propagate the class
labels over the
connected graph
Different from the
K Nearest Neighbor
Label Propagation: Representation
W 2 f 0; 1gN £ N
Adjancy
½ matrix
1 x i and x j connect
Wi ;j =
0 ot herwise
£N
W 2 RN
+
Similarity matrix
Wi ;j : similarity between x i and x j
D = diag(d1 ; : : : ; dN )
P
Matrix
di =
Wi ;j
j6
=i
Label Propagation: Representation
W 2 f 0; 1gN £ N
Adjancy
½ matrix
1 x i and x j connect
Wi ;j =
0 ot herwise
£N
W 2 RN
+
Similarity matrix
Wi ;j : similarity between x i and x j
Degree matrix
D = diag(d1 ; : : : ; dN ) di =
P
j6
=i
Wi ;j
Label Propagation: Representation
W 2 RN £ N
+
Given
Label information
y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l
l
y u = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn u
u
Label Propagation: Representation
W 2 RN £ N
+
Given
Label information
y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l
l
y = (y l ; y u )
Label Propagation
yb 2 f ¡ 1; 0; + 1gN
½ assignments
Initial class
§1 x i is labeled
ybi =
0 x i is unlabeled
Predicted class assignments
f 2 RN
First predict the confidence scores
y 2 f ¡ 1; + 1gN
Then predict
½ the class assignments
yi
=
+1 fi > 0
¡ 1 fi · 0
Label Propagation
yb 2 f ¡ 1; 0; + 1gN
½ assignments
Initial class
§1 x i is labeled
ybi =
0 x i is unlabeled
Predicted class assignments
f = (f 1 ; : : : ; f N )
First predict the confidence scores
y 2 f ¡ 1; + 1gN
Then predict
½ the class assignments
yi
=
+1 fi > 0
¡ 1 fi · 0
Label Propagation (II)
One round of propagation
½
ybi
x i is labeled
P
fi =
® N Wi ;j ybi ot herwise
i= 1
Weight for
each propagation
Weighted KNN
f 1 = yb + ®W yb
Label Propagation (II)
f2
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
How to generate any number
of iterations?
Xk
fk
=
yb +
®i W i yb
i= 1
Label Propagation (II)
f2
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Results for any number of
iterations
Xk
fk
=
yb +
®i W i yb
i= 1
Label Propagation (II)
f2
f1
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Results for infinite number
of iterationsX1
=
yb +
®i W i yb
i= 1
Label Propagation (II)
f2
f1
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Matrix Inverse
Results for infinite number
of iterations
=
(I ¡ ®W ) ¡ 1 yb
¹ = D¡
W
Normalized Similarity Matrix:
1=2 W D ¡ 1=2
Local and Global Consistency
[Zhou et.al., NIPS 03]
Local consistency:
Like KNN
Global consistency:
Beyond KNN
Summary:
Construct a graph using pairwise similarities
Propagate class labels along the graph
Key parameters
: the decay of propagation
W: similarity matrix
Computational complexity
Matrix inverse: O(n3)
Chelosky decomposition
Clustering
f
=
(I ¡ ®W ) ¡ 1 yb
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
Application: Text Classification
[Zhou et.al., NIPS 03]
SVM
20-newsgroups
Pre-processing
KNN
Propagation
autos, motorcycles, baseball,
and hockey under rec
stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Application: Image Retrieval
[Wang et al., ACM MM 2004]
Label propagation
5,000 images
Relevance feedback for the top
20 ranked images
Classification problem
SVM
Relevant or not?
f(x): degree of relevance
Learning relevance function f(x)
Supervised learning: SVM
Label propagation
Semi-supervised Learning Algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Graph Partition
Classification as graph partitioning
Search for a classification boundary
Consistent with labeled examples
Partition with small graph cut
Graph Cut = 2
Graph Cut = 1
Graph Partitioning
Classification as graph partitioning
Search for a classification boundary
Consistent with labeled examples
Partition with small graph cut
Graph Cut = 1
Min-cuts [Blum and Chawla, ICML 2001]
Additional nodes
V+ : source, V-: sink
Infinite weights connecting sinks and sources
High computational cost
V+
Source
Graph Cut = 1
V̲
Sink
Harmonic Function [Zhu et al., ICML 2003]
Weight matrix W
wi,j 0: similarity between xi and xi
f = (f 1 ; : : : ; f N )
½
Membership vector
fi =
+ 1 xi 2 A
¡ 1 xi 2 B
A
+1
+1
B
¡ 1
¡ 1
¡ 1 ¡ 1
+1
+1
¡ 1
¡ 1
¡ 1
¡ 1
¡ 1
Harmonic Function (cont’d)
C(f )
Graph cut
XN
C(f )
XN
=
(f i ¡ f j ) 2
wi ;j
4
A
+1
B
¡1
¡1
¡1 ¡1
+1
i= 1 j = 1
=
+1
+1
1
1
f > (D ¡ W )f = f > L f
4
4
D = diag(d1 ; : : : ; dN )
P
Degree matrix
di =
Wi ;j
j6
=i
Diagonal element:
¡1
¡1
¡1
¡1
Harmonic Function (cont’d)
Graph cut
XN
C(f )
C(f )
XN
=
(f i ¡ f j ) 2
wi ;j
4
A
+1
¡1
¡1
¡1 ¡1
+1
1
1
f > (D ¡ W )f = f > L f
4
4
Graph Laplacian L = D –W
B
+1
i= 1 j = 1
=
+1
Pairwise relationships among data poitns
Mainfold geometry of data
¡1
¡1
¡1
¡1
Harmonic Function
min
f 2 f ¡ 1;+ 1g N
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
Consistency with
graph structures
Challenge:
Consistent with labeled data
Discrete space Combinatorial Opt.
A
+1
+1
B
¡1
¡1
¡1 ¡1
+1
+1
¡1
¡1
¡1
¡1
Harmonic Function
min
f 2 f ¡ 1;+ 1g N
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
Relaxation: {-1, +1} continuous real number
min
f 2 RN
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
A
+1
+1
B
¡1
¡1
¡1 ¡1
+1
+1
¡1
Convert continuous f to binary ones
¡1
¡1
¡1
Harmonic Function
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
µ
¶
min
f 2 RN
L=
L l ;l
L l ;u
L u ;l
L u ;u
; f = (f l ; f u )
f u = ¡ L ¡ 1 L u ;l y l
u ;u
How to handle a large number of unlabeled data points
Harmonic Function
Local
Propagation
fu = ¡ L ¡
1L
yl
u
;l
u ;u
Harmonic Function
Local
Propagation
Sound familiar ?
fu = ¡ L ¡
1L
yl
u
;l
u ;u
Global
propagation
Spectral Graph Transducer [Joachim , 2003]
min
f 2 RN
s. t .
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
f i = yi ; 1 · i · n l
Soften hard constraints
Spectral Graph Transducer [Joachim , 2003]
min
f 2 RN
s. t .
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
f i = yi ; 1 · i · n l
Solved by Constrained Eigenvector
Problem
Xn
min
f 2 RN
1
C(f ) = f > L f + ®
4
l
i= 1
XN
s. t .
f
i= 1
2
i
= N
(f i ¡ yi ) 2
Manifold Regularization [Belkin, 2006]
min
f 2 RN
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
XN
s. t .
f
2
i
= N
i= 1
Regularize the norm
of classifier
Loss function for
misclassification
Manifold Regularization [Belkin, 2006]
min
f 2 RN
Xn l
1
f > L f + ® (f i ¡ yi ) 2
4
i= 1
XN
s. t .
f
2
i
= N
i= 1
min
f 2 RN
Manifold
Regularization
f > Lf + ®
Xn l
Loss funct ion: l(f (x i ); yi )
l(f (x i ); yi ) + °jf j 2
HK
i= 1
Summary
Construct a graph using pairwise similarity
Key quantity: graph Laplacian
Decision boundary is consistent
Captures the geometry of the graph
Graph structure
Labeled examples
Parameters
, , similarity
A +1
+1
+1
+1
B
¡1
¡1
¡1 ¡1
¡1
¡1
¡1
¡1
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
Application: Text Classification
SVM
20-newsgroups
KNN
Pre-processing
Propagation
Harmonic
autos, motorcycles, baseball,
and hockey under rec
stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Application: Text Classification
PRBEP: precision recall break even point.
Application: Text Classification
Improvement in PRBEP by SGT
Semi-supervised Classification
Algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine
(TSVM)
Co-training
Transductive SVM
Support vector machine
Classification margin
Maximum classification
margin
Decision boundary given a
small number of labeled
examples
Transductive SVM
Decision boundary given a
small number of labeled
examples
How to change decision
boundary given both
labeled and unlabeled
examples ?
Transductive SVM
Decision boundary given a
small number of labeled
examples
Move the decision
boundary to low local
density
Transductive SVM
! (X ; y ; f )
f
Classification margin
f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK
Semi-supervised learning
Optimize over both f(x) and yu
! (X ; y ; f )
f (x)
Transductive SVM
! (X ; y ; f )
f
Classification margin
f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK
Semi-supervised learning
Optimize over both f(x) and yu
f (x)
Transductive SVM
! (X ; y ; f )
f
Classification margin
f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK
Semi-supervised learning
f
¤
Optimize over both f(x) and yu
=
arg max
f 2 H K ;y u 2 f ¡ 1;+ 1gn u
! (X ; y l ; y u ; f )
f (x)
Transductive SVM
Decision boundary given
a small number of
labeled examples
Move the decision
boundary to place with
low local density
Classification results
How to formulate this
idea?
Transductive SVM: Formulation
Original SVM
A binary variables for
label of each example
Transductive SVM
{w* , b*}= argmin argmin w w
{w* , b*}= argmin w w
w, b
y1 w x1 b 1
y2 w x2 b 1 labeled
....
examples
yn w xn b 1
Constraints for
unlabeled data
yn 1 ,..., yn m
w, b
y1 w x1 b 1
y2 w x2 b 1 labeled
....
examples
yn w xn b 1
yn 1 w xn 1 b 1
unlabeled
....
examples
yn m w xn m b 1
Computational Issue
{w , b }= argmin argmin w w
*
*
yn1 ,..., yn m
w, b
y1 w x1 b 1 1
y2 w x2 b 1 2 labeled
....
examples
yn w xn b 1 n
n
i 1 i
n
i 1 i
yn 1 w xn 1 b 1 1
....
yn m w xn m
unlabeled
examples
b 1 m
No longer convex optimization problem.
Alternating optimization
Summary
Based on maximum margin principle
Classification margin is decided by
Labeled examples
Class labels assigned to unlabeled data
High computational cost
Variants: Low Density Separation (LDS), SemiSupervised Support Vector Machine (S3VM), TSVM
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
Text Classification by TSVM
10 categories from the
Reuter collection
3299 test documents
1000 informative words
selected by MI criterion
Semi-supervised Classification
Algorithms
Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Co-training [Blum & Mitchell, 1998]
Classify web pages into
category for students and category for professors
Two views of web pages
Content
“I am currently the second year Ph.D. student …”
Hyperlinks
“My advisor is …”
“Students: …”
Co-training for Semi-Supervised Learning
Co-training for Semi-Supervised Learning
It is easier to
classify this web
page using
hyperlinks
It is easy to
classify the type of
this web page
based on its
content
Co-training
Two representation for each web page
Content representation:
(doctoral, student, computer, university…)
Hyperlink representation:
Inlinks: Prof. Cheng
Oulinks: Prof. Cheng
Co-training
Train a content-based classifier
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Label the unlabeled examples that are
confidently classified
Co-training
Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Label the unlabeled examples that are
confidently classified
Co-training
Assume two views of objects
Key idea
Two sufficient representations
Augment training examples of one view by exploiting the
classifier of the other view
Extension to multiple view
Problem: how to find equivalent views
Active Learning
Active learning
Select the most informative examples
In contrast to passive learning
Key question: which examples are
informative
Uncertainty principle: most informative example
is the one that is most uncertain to classify
Measure classification uncertainty
Active Learning
Query by committee (QBC)
SVM based approach
Construct an ensemble of classifiers
Classification uncertainty largest degree of
disagreement
Classification uncertainty distance to decision
boundary
Simple but very effective approaches
Semi-supervised Clustering
Clustering data into two clusters
Semi-supervised Clustering
Must link
cannot link
Clustering data into two clusters
Side information:
Must links vs. cannot links
Semi-supervised Clustering
Also called constrained clustering
Two types of approaches
Restricted data partitions
Distance metric learning approaches
Restricted Data Partition
Require data partitions to be consistent
with the given links
Links hard constraints
E.g. constrained K-Means (Wagstaff et al.,
2001)
Links soft constraints
E.g., Metric Pairwise Constraints K-means
(Basu et al., 2004)
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Yes
cannot link
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Yes
cannot link
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
No
cannot link
Restricted Data Partition
Soft constraints
Penalize data clustering if it violates some links
must link
Penality = 0
cannot link
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Penality = 0
cannot link
Restricted Data Partition
Hard constraints
Cluster memberships must obey the link constraints
must link
Penality = 1
cannot link
Distance Metric Learning
Learning a distance metric from pairwise links
Enlarge the distance for a cannot-link
Shorten the distance for a must-link
Applied K-means with pairwise distance measured
by the learned distance metric
must link
Transformed by learned
distance metric
cannot link
Example of Distance Metric Learning
2D data projection using Euclidean
distance metric
Solid lines: must links
dotted lines: cannot links
2D data projection using learned
distance metric
BoostCluster [Liu, Jin & Jain, 2007]
General framework for semi-supervised clustering
Improves any given unsupervised clustering algorithm with
pairwise constraints
Key challenges
How to influence an arbitrary clustering algorithm by side
information?
Encode constraints into data representation
How to take into account the performance of underlying clustering
algorithm?
Iteratively improve the clustering performance
95
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Given: (a) pairwise constraints, (b) data
examples, and (c) a clustering algorithm
96
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Find the best data rep. that encodes the
unsatisfied pairwise constraints
97
BoostCluster
Data
Pairwise
Constraint
s
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Obtain the clustering results given the new
data representation
98
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Update the kernel with the clustering results
99
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Run the procedure iteratively
100
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Compute the final clustering result
101
Summary
Clustering data under given pairwise constraints
Two types of approaches
Must links vs. cannot links
Restricted data partitions (either soft or hard)
Distance metric learning
Questions: how to acquire links/constraints?
Manual assignments
Derive from side information: hyper links, citation, user
logs, etc.
May be noisy and unreliable
Application: Document Clustering
[Basu et al., 2004]
300 docs from topics
(atheism, baseball, space)
of 20-newsgroups
3251 unique words after
removal of stopwords and
rare words and stemming
Evaluation metric:
Normalized Mutual
Informtion (NMI)
KMeans-x-x: different
variants of constrained
clustering algs.
Kernel Learning
Kernel plays central role in machine learning
Kernel functions can be learned from data
Kernel alignment, multiple kernel learning, nonparametric learning, …
Kernel learning is suitable for IR
Similarity measure is key to IR
Kernel learning allows us to identify the optimal
similarity measure automatically
Transfer Learning
Different document categories are correlated
We should be able to borrow information of
one class to the training of another class
Key question: what to transfer between
classes?
Representation, model priors, similarity
measure …