Linear Model (III)

Download Report

Transcript Linear Model (III)

Semi-supervised Learning
Rong Jin
Spectrum of Learning Problems
What is Semi-supervised Learning

Learning from a mixture of labeled and unlabeled examples
L = f (x lLabeled
; y1 ); : :Data
: ; (x l ; yn )g
1
Total
N
= number
n l + nofu examples:
nl
U =Unlabeled
f x u ; : : :Data
; xu g
l
f (x) : X ! Y
1
nu
Why Semi-supervised Learning?


Labeling is expensive and difficult
Labeling is unreliable



Ex. Segmentation applications
Need for multiple experts
Unlabeled examples


Easy to obtain in large numbers
Ex. Web pages, text documents, etc.
Semi-supervised Learning Problems

Classification





Transductive – predict labels of unlabeled data
Inductive – learn a classification function
Clustering (constrained clustering)
Ranking (semi-supervised ranking)
Almost every learning problem has a semisupervised counterpart.
Why Unlabeled Could be Helpful

Clustering assumption


Unlabeled data help decide the decision boundary
f (X ) = 0
Manifold assumption

Unlabeled data help decide decision function
f (X )
Clustering Assumption
?
Clustering Assumption
?

Suggest A Simple Alg. for
PointsSemi-supervised
with same label are Learning
connected through
high
?
density regions, thereby defining a cluster

Clusters are separated through low-density regions
Manifold Assumption

Graph representation


Vertex: training example
(labeled and unlabeled)
Edge: similar examples
Labeled
examples
x1

x2
Regularize the classification function f(x)
x 1 and x 2 are connect ed ¡ !
jf (x 1 ) ¡ f (x 2 )j is small
Manifold Assumption

Graph representation
Vertex: training example
(labeled and unlabeled)
Edge: similar examples



Manifold assumption


Data lies on a low-dimensional manifold
Classification function f(x) should “follow” the
data manifold
Statistical View
Generative model for classification
Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´)


θ
Y
X
Statistical View
Generative model for classification
Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´)


Unlabeled data help estimate
 Clustering assumption
Pr(X jY ; µ)
Pr(X u ) Pr(X l jY )
nu
Y
Ã
!
XK
Yn l
Pr(x u jy) Pr(y)
y= 1
θ
Y
X
Pr(x l jyi )
i
j
i= 1

i= 1
Statistical View

Discriminative model for classification
Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ)
θ
μ
Y
X
Statistical View

Discriminative model for classification
Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ)

Unlabeled data help regularize θ
Pr(µjX )
1
via a prior 0
 Manifold assumption
Pr(µjX ) / exp @
[f (x ui ) ¡ f (x uj )]2 wi ;j A
i ;j = 1
0
=
Xn u
exp @
Xn u
i ;j = 1
θ
μ
Y
X
1
[µ> (x u ) ¡ x u )]2 wi ;j A
i
j
Semi-supervised Learning Algorithms




Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Label Propagation: Key Idea


A decision boundary
based on the labeled
examples is unable to
take into account the
layout of the data points
How to incorporate the
data distribution into the
prediction of class labels?
Label Propagation: Key Idea

Connect the data points
that are close to each
other
Label Propagation: Key Idea


Connect the data points
that are close to each
other
Propagate the class labels
over the connected graph
Label Propagation: Key Idea



Connect the data
points that are close
to each other
Propagate the class
labels over the
connected graph
Different from the
K Nearest Neighbor
Label Propagation: Representation
W 2 f 0; 1gN £ N
Adjancy
½ matrix
1 x i and x j connect
Wi ;j =
0 ot herwise

£N
W 2 RN
+
Similarity matrix
Wi ;j : similarity between x i and x j

D = diag(d1 ; : : : ; dN )
P
Matrix
di =
Wi ;j

j6
=i
Label Propagation: Representation
W 2 f 0; 1gN £ N
Adjancy
½ matrix
1 x i and x j connect
Wi ;j =
0 ot herwise

£N
W 2 RN
+
Similarity matrix
Wi ;j : similarity between x i and x j


Degree matrix
D = diag(d1 ; : : : ; dN ) di =
P
j6
=i
Wi ;j
Label Propagation: Representation
W 2 RN £ N
+
Given
 Label information
y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l

l
y u = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn u
u
Label Propagation: Representation
W 2 RN £ N
+
Given
 Label information
y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l

l
y = (y l ; y u )
Label Propagation
yb 2 f ¡ 1; 0; + 1gN

½ assignments
Initial class
§1 x i is labeled
ybi =
0 x i is unlabeled

Predicted class assignments


f 2 RN
First predict the confidence scores
y 2 f ¡ 1; + 1gN
Then predict
½ the class assignments
yi
=
+1 fi > 0
¡ 1 fi · 0
Label Propagation
yb 2 f ¡ 1; 0; + 1gN

½ assignments
Initial class
§1 x i is labeled
ybi =
0 x i is unlabeled

Predicted class assignments


f = (f 1 ; : : : ; f N )
First predict the confidence scores
y 2 f ¡ 1; + 1gN
Then predict
½ the class assignments
yi
=
+1 fi > 0
¡ 1 fi · 0
Label Propagation (II)
One round of propagation
½
ybi
x i is labeled
P
fi =
® N Wi ;j ybi ot herwise

i= 1
Weight for
each propagation
Weighted KNN
f 1 = yb + ®W yb
Label Propagation (II)

f2

Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
How to generate any number
of iterations?
Xk
fk
=
yb +
®i W i yb
i= 1
Label Propagation (II)

f2

Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Results for any number of
iterations
Xk
fk
=
yb +
®i W i yb
i= 1
Label Propagation (II)

f2

f1
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Results for infinite number
of iterationsX1
=
yb +
®i W i yb
i= 1
Label Propagation (II)

f2

f1
Two rounds of propagation
=
f 1 + ®W f 1
=
yb + ®W yb + ®2 W 2 yb
Matrix Inverse
Results for infinite number
of iterations
=
(I ¡ ®W ) ¡ 1 yb
¹ = D¡
W
Normalized Similarity Matrix:
1=2 W D ¡ 1=2
Local and Global Consistency
[Zhou et.al., NIPS 03]
Local consistency:
Like KNN
Global consistency:
Beyond KNN
Summary:



Construct a graph using pairwise similarities
Propagate class labels along the graph
Key parameters



: the decay of propagation
W: similarity matrix
Computational complexity



Matrix inverse: O(n3)
Chelosky decomposition
Clustering
f
=
(I ¡ ®W ) ¡ 1 yb
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
Application: Text Classification
[Zhou et.al., NIPS 03]

SVM
20-newsgroups


Pre-processing

KNN

Propagation
autos, motorcycles, baseball,
and hockey under rec
stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Application: Image Retrieval
[Wang et al., ACM MM 2004]


Label propagation

5,000 images
Relevance feedback for the top
20 ranked images
Classification problem


SVM

Relevant or not?
f(x): degree of relevance
Learning relevance function f(x)


Supervised learning: SVM
Label propagation
Semi-supervised Learning Algorithms




Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Graph Partition


Classification as graph partitioning
Search for a classification boundary


Consistent with labeled examples
Partition with small graph cut
Graph Cut = 2
Graph Cut = 1
Graph Partitioning


Classification as graph partitioning
Search for a classification boundary


Consistent with labeled examples
Partition with small graph cut
Graph Cut = 1
Min-cuts [Blum and Chawla, ICML 2001]

Additional nodes



V+ : source, V-: sink
Infinite weights connecting sinks and sources
High computational cost
V+
Source


Graph Cut = 1


V̲
Sink
Harmonic Function [Zhu et al., ICML 2003]


Weight matrix W
 wi,j 0: similarity between xi and xi
f = (f 1 ; : : : ; f N )
½
Membership vector
fi =
+ 1 xi 2 A
¡ 1 xi 2 B
A
+1
+1
B
¡ 1
¡ 1
¡ 1 ¡ 1
+1
+1
¡ 1
¡ 1
¡ 1
¡ 1
¡ 1
Harmonic Function (cont’d)

C(f )
Graph cut
XN
C(f )
XN
=
(f i ¡ f j ) 2
wi ;j
4
A
+1

B
¡1
¡1
¡1 ¡1
+1
i= 1 j = 1
=
+1
+1
1
1
f > (D ¡ W )f = f > L f
4
4
D = diag(d1 ; : : : ; dN )
P
Degree matrix
di =
Wi ;j
j6
=i
 Diagonal element:
¡1
¡1
¡1
¡1
Harmonic Function (cont’d)

Graph cut
XN
C(f )
C(f )
XN
=
(f i ¡ f j ) 2
wi ;j
4
A
+1


¡1
¡1
¡1 ¡1
+1
1
1
f > (D ¡ W )f = f > L f
4
4
Graph Laplacian L = D –W

B
+1
i= 1 j = 1
=
+1
Pairwise relationships among data poitns
Mainfold geometry of data
¡1
¡1
¡1
¡1
Harmonic Function
min
f 2 f ¡ 1;+ 1g N
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
Consistency with
graph structures
Challenge:
Consistent with labeled data
Discrete space  Combinatorial Opt.
A
+1
+1
B
¡1
¡1
¡1 ¡1
+1
+1
¡1
¡1
¡1
¡1
Harmonic Function
min
f 2 f ¡ 1;+ 1g N
s. t .

1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
Relaxation: {-1, +1}  continuous real number
min
f 2 RN
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
A
+1
+1
B
¡1
¡1
¡1 ¡1
+1
+1
¡1

Convert continuous f to binary ones
¡1
¡1
¡1
Harmonic Function
s. t .
1
C(f ) = f > L f
4
f i = yi ; 1 · i · n l
µ
¶
min
f 2 RN
L=
L l ;l
L l ;u
L u ;l
L u ;u
; f = (f l ; f u )
f u = ¡ L ¡ 1 L u ;l y l
u ;u
How to handle a large number of unlabeled data points
Harmonic Function
Local
Propagation
fu = ¡ L ¡
1L
yl
u
;l
u ;u
Harmonic Function
Local
Propagation
Sound familiar ?
fu = ¡ L ¡
1L
yl
u
;l
u ;u
Global
propagation
Spectral Graph Transducer [Joachim , 2003]
min
f 2 RN
s. t .
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
f i = yi ; 1 · i · n l
Soften hard constraints
Spectral Graph Transducer [Joachim , 2003]
min
f 2 RN
s. t .
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
f i = yi ; 1 · i · n l
Solved by Constrained Eigenvector
Problem
Xn
min
f 2 RN
1
C(f ) = f > L f + ®
4
l
i= 1
XN
s. t .
f
i= 1
2
i
= N
(f i ¡ yi ) 2
Manifold Regularization [Belkin, 2006]
min
f 2 RN
Xn l
1
C(f ) = f > L f + ® (f i ¡ yi ) 2
4
i= 1
XN
s. t .
f
2
i
= N
i= 1
Regularize the norm
of classifier
Loss function for
misclassification
Manifold Regularization [Belkin, 2006]
min
f 2 RN
Xn l
1
f > L f + ® (f i ¡ yi ) 2
4
i= 1
XN
s. t .
f
2
i
= N
i= 1
min
f 2 RN
Manifold
Regularization
f > Lf + ®
Xn l
Loss funct ion: l(f (x i ); yi )
l(f (x i ); yi ) + °jf j 2
HK
i= 1
Summary


Construct a graph using pairwise similarity
Key quantity: graph Laplacian


Decision boundary is consistent



Captures the geometry of the graph
Graph structure
Labeled examples
Parameters

, , similarity
A +1
+1
+1
+1
B
¡1
¡1
¡1 ¡1
¡1
¡1
¡1
¡1
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
Application: Text Classification

SVM
20-newsgroups

KNN

Pre-processing


Propagation
Harmonic
autos, motorcycles, baseball,
and hockey under rec
stemming, remove stopwords
& rare words, and skip header
#Docs: 3970, #word: 8014
Application: Text Classification
PRBEP: precision recall break even point.
Application: Text Classification

Improvement in PRBEP by SGT
Semi-supervised Classification
Algorithms




Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine
(TSVM)
Co-training
Transductive SVM

Support vector machine



Classification margin
Maximum classification
margin
Decision boundary given a
small number of labeled
examples
Transductive SVM


Decision boundary given a
small number of labeled
examples
How to change decision
boundary given both
labeled and unlabeled
examples ?
Transductive SVM


Decision boundary given a
small number of labeled
examples
Move the decision
boundary to low local
density
Transductive SVM
! (X ; y ; f )


f
Classification margin
 f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK

Semi-supervised learning

Optimize over both f(x) and yu
! (X ; y ; f )
f (x)
Transductive SVM
! (X ; y ; f )


f
Classification margin
 f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK

Semi-supervised learning

Optimize over both f(x) and yu
f (x)
Transductive SVM
! (X ; y ; f )


f
Classification margin
 f(x): classification function
Supervised learning
¤
= arg max ! (X ; y ; f )
f 2HK

Semi-supervised learning

f
¤
Optimize over both f(x) and yu
=
arg max
f 2 H K ;y u 2 f ¡ 1;+ 1gn u
! (X ; y l ; y u ; f )
f (x)
Transductive SVM




Decision boundary given
a small number of
labeled examples
Move the decision
boundary to place with
low local density
Classification results
How to formulate this
idea?
Transductive SVM: Formulation
Original SVM
A binary variables for
label of each example
Transductive SVM
{w* , b*}= argmin argmin w  w
{w* , b*}= argmin w  w
w, b
y1  w  x1  b   1 

y2  w  x2  b   1  labeled

....
 examples
yn  w  xn  b   1 
Constraints for
unlabeled data
yn 1 ,..., yn  m
w, b
y1  w  x1  b   1 

y2  w  x2  b   1  labeled

....
 examples
yn  w  xn  b   1 
yn 1  w  xn 1  b   1 
 unlabeled
....

examples

yn  m  w  xn  m  b   1 
Computational Issue
{w , b }= argmin argmin w  w  
*
*
yn1 ,..., yn m
w, b
y1  w  x1  b   1  1 

y2  w  x2  b   1   2  labeled

....
 examples
yn  w  xn  b   1   n 


n

i 1 i

n

i 1 i
yn 1  w  xn 1  b   1  1
....
yn  m  w  xn  m

 unlabeled

examples

 b   1  m 
No longer convex optimization problem.
Alternating optimization
Summary


Based on maximum margin principle
Classification margin is decided by



Labeled examples
Class labels assigned to unlabeled data
High computational cost

Variants: Low Density Separation (LDS), SemiSupervised Support Vector Machine (S3VM), TSVM
Questions
Cluster Assumption
?
Manifold Assumption
?
Transductive
Inductive
predict classes for unlabeled data
learn classification function
Text Classification by TSVM



10 categories from the
Reuter collection
3299 test documents
1000 informative words
selected by MI criterion
Semi-supervised Classification
Algorithms




Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training
Co-training [Blum & Mitchell, 1998]

Classify web pages into


category for students and category for professors
Two views of web pages

Content


“I am currently the second year Ph.D. student …”
Hyperlinks


“My advisor is …”
“Students: …”
Co-training for Semi-Supervised Learning
Co-training for Semi-Supervised Learning
It is easier to
classify this web
page using
hyperlinks
It is easy to
classify the type of
this web page
based on its
content
Co-training

Two representation for each web page
Content representation:
(doctoral, student, computer, university…)
Hyperlink representation:
Inlinks: Prof. Cheng
Oulinks: Prof. Cheng
Co-training

Train a content-based classifier
Co-training


Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Co-training



Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Co-training




Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Label the unlabeled examples that are
confidently classified
Co-training




Train a content-based classifier using
labeled examples
Label the unlabeled examples that are
confidently classified
Train a hyperlink-based classifier
Label the unlabeled examples that are
confidently classified
Co-training

Assume two views of objects


Key idea



Two sufficient representations
Augment training examples of one view by exploiting the
classifier of the other view
Extension to multiple view
Problem: how to find equivalent views
Active Learning

Active learning



Select the most informative examples
In contrast to passive learning
Key question: which examples are
informative


Uncertainty principle: most informative example
is the one that is most uncertain to classify
Measure classification uncertainty
Active Learning

Query by committee (QBC)



SVM based approach


Construct an ensemble of classifiers
Classification uncertainty  largest degree of
disagreement
Classification uncertainty  distance to decision
boundary
Simple but very effective approaches
Semi-supervised Clustering

Clustering data into two clusters
Semi-supervised Clustering
Must link
cannot link


Clustering data into two clusters
Side information:

Must links vs. cannot links
Semi-supervised Clustering


Also called constrained clustering
Two types of approaches


Restricted data partitions
Distance metric learning approaches
Restricted Data Partition


Require data partitions to be consistent
with the given links
Links  hard constraints


E.g. constrained K-Means (Wagstaff et al.,
2001)
Links  soft constraints

E.g., Metric Pairwise Constraints K-means
(Basu et al., 2004)
Restricted Data Partition

Hard constraints

Cluster memberships must obey the link constraints
must link
Yes
cannot link
Restricted Data Partition

Hard constraints

Cluster memberships must obey the link constraints
must link
Yes
cannot link
Restricted Data Partition

Hard constraints

Cluster memberships must obey the link constraints
must link
No
cannot link
Restricted Data Partition

Soft constraints

Penalize data clustering if it violates some links
must link
Penality = 0
cannot link
Restricted Data Partition

Hard constraints

Cluster memberships must obey the link constraints
must link
Penality = 0
cannot link
Restricted Data Partition

Hard constraints

Cluster memberships must obey the link constraints
must link
Penality = 1
cannot link
Distance Metric Learning

Learning a distance metric from pairwise links



Enlarge the distance for a cannot-link
Shorten the distance for a must-link
Applied K-means with pairwise distance measured
by the learned distance metric
must link
Transformed by learned
distance metric
cannot link
Example of Distance Metric Learning
2D data projection using Euclidean
distance metric
Solid lines: must links
dotted lines: cannot links
2D data projection using learned
distance metric
BoostCluster [Liu, Jin & Jain, 2007]

General framework for semi-supervised clustering


Improves any given unsupervised clustering algorithm with
pairwise constraints
Key challenges

How to influence an arbitrary clustering algorithm by side
information?
Encode constraints into data representation

How to take into account the performance of underlying clustering
algorithm?
Iteratively improve the clustering performance
95
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Given: (a) pairwise constraints, (b) data
examples, and (c) a clustering algorithm
96
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Find the best data rep. that encodes the
unsatisfied pairwise constraints
97
BoostCluster
Data
Pairwise
Constraint
s
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Obtain the clustering results given the new
data representation
98
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Update the kernel with the clustering results
99
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Run the procedure iteratively
100
BoostCluster
Data
Pairwise
Constraints
Kernel
Matrix
New data
Representation
Clustering
Results
Clustering
Algorithm
Clustering
Algorithm
Final Results
Compute the final clustering result
101
Summary

Clustering data under given pairwise constraints


Two types of approaches



Must links vs. cannot links
Restricted data partitions (either soft or hard)
Distance metric learning
Questions: how to acquire links/constraints?


Manual assignments
Derive from side information: hyper links, citation, user
logs, etc.

May be noisy and unreliable
Application: Document Clustering
[Basu et al., 2004]




300 docs from topics
(atheism, baseball, space)
of 20-newsgroups
3251 unique words after
removal of stopwords and
rare words and stemming
Evaluation metric:
Normalized Mutual
Informtion (NMI)
KMeans-x-x: different
variants of constrained
clustering algs.
Kernel Learning


Kernel plays central role in machine learning
Kernel functions can be learned from data


Kernel alignment, multiple kernel learning, nonparametric learning, …
Kernel learning is suitable for IR


Similarity measure is key to IR
Kernel learning allows us to identify the optimal
similarity measure automatically
Transfer Learning



Different document categories are correlated
We should be able to borrow information of
one class to the training of another class
Key question: what to transfer between
classes?

Representation, model priors, similarity
measure …