x - VideoLectures.NET

Download Report

Transcript x - VideoLectures.NET

G. Sanguinetti, EPSRC Winter School 01/08
Semi-supervised learning
Guido Sanguinetti
Department of Computer Science, University of Sheffield
G. Sanguinetti, EPSRC Winter School 01/08
Programme
• Different ways of learning
– Unsupervised learning
• EM algorithm
– Supervised learning
• Different ways of going semi-supervised
– Generative models
• Reweighting
– Discriminative models
• Regularisation: clusters and manifolds
G. Sanguinetti, EPSRC Winter School 01/08
Disclaimer
• This is going to be a high-level introduction
rather than a technical talk
• We will spend some time talking about
supervised and unsupervised learning as well
• We will be unashamedly probabilistic, if not fully
Bayesian
• Main reference C.Bishop’s book “Pattern
Recognition and Machine Learning” and M.
Seeger’s chapter in “Semi-supervised learning”,
Chapelle, Schölkopf and Zien, eds.
G. Sanguinetti, EPSRC Winter School 01/08
Different ways to learn
• Reinforcement learning: learn a strategy to
optimise a reward.
• Closely related to decision theory and
control theory.
• Supervised learning: learn a map
• Unsupervised learning: estimate a density
G. Sanguinetti, EPSRC Winter School 01/08
Unsupervised learning
• We are given data x.
• We want to estimate the density
that generated the data.
• Generally, assume a latent
variable y as in graphical model.
• A continuous y leads to
dimensionality reduction (cf
previous lecture), a discrete one
to clustering
π
θ
y
x
G. Sanguinetti, EPSRC Winter School 01/08
Example: mixture of Gaussians
• Data: vector measurements xi, i=1,...,N.
• Latent variable: K-dimensional binary vectors
(class membership) yi; yij=1 means point i
belongs to class j.
• The θ parameters are in this case the
covariances and the means of each component.
G. Sanguinetti, EPSRC Winter School 01/08
Estimating mixtures
• Two objectives: estimate the parameters , j
and j and estimate the posterior probabilities on
class membership
• ij are the responsibilities.
• Could use gradient descent.
• Better to use EM
G. Sanguinetti, EPSRC Winter School 01/08
Expectation-Maximization
• Iterative procedure to estimate the maximum
likelihood values of parameters in models with
latent variables.
• We want to maximise the log-likelihood of the
model
• Notice that log of a sum is not nice.
• The key mathematical tool is Jensen’s inequality
G. Sanguinetti, EPSRC Winter School 01/08
Jensen’s inequality
• For every concave function
f, random variable x and
probability distribution q(x)
• A cartoon of the proof,
which relies on the centre
of mass of a convex
polygon lying inside the
polygon.
G. Sanguinetti, EPSRC Winter School 01/08
Bound on the log-likelihood
• Jensen’s inequality leads to
• H[q] is the entropy of the distribution q. Notice
the absence of the nasty log of a sum.
• If and only if q(c) is the posterior q(c|x), we get
that the bound is saturated
G. Sanguinetti, EPSRC Winter School 01/08
EM
E-step
• For a fixed value of the
parameters, the posterior
will saturate the bound.
• In the M-step, optimise the
bound with respect to the
parameters.
• In the E-step, recompute
the posterior with the new
value of the parameters.
• Exercise: EM for mixture of
Gaussians.
M-step
θ
G. Sanguinetti, EPSRC Winter School 01/08
Supervised learning
• The data consists of a set of inputs x and a set
of output (target) values y.
• The goal is to learn the functional relation
f:x y
• Evaluated using reconstruction error
(classification accuracy), usually on a separate
test set.
• Continuous y leads to regression, discrete to
classification
G. Sanguinetti, EPSRC Winter School 01/08
Classification-Generative
• The generative approach starts with modelling
p(x|c) as in unsupervised learning.
π
θ
• The model parameters are
estimated (e.g. using Maximum
Likelihood).
c
x
• The assignment is based on the
posterior probabilities p(c|x)
• Requires estimation of many parameters.
G. Sanguinetti, EPSRC Winter School 01/08
Example: discriminant analysis
• Assume class conditional distributions to be
Gaussian
• Estimate means and covariances using
maximum likelihood (O(KD2) params).
• Classify novel (test) data using posterior
• Exercise: rewrite posterior in terms of sigmoids.
G. Sanguinetti, EPSRC Winter School 01/08
Classification-Discriminative
• Also called diagnostic
θ
μ
• Modelling the class-conditional
distributions involves significant
c
x
overheads.
• Discriminative techniques avoid
this by modelling directly the posterior
distribution p(cj|x).
• Closely related with the concept of transductive
learning.
G. Sanguinetti, EPSRC Winter School 01/08
Example: logistic regression
• Restrict to the two classes case.
• Model posteriors as
where we have introduced the logistic sigmoid
• Notice similarity with discriminant function in
generative models.
• Number of parameters to be estimated is D.
G. Sanguinetti, EPSRC Winter School 01/08
Estimating logistic regression
• Let
• Let ci be 0 or 1.
• The likelihood for the data (xi,ci) is
• The gradient of the negative log-likelihood is
• Overfitting, may need a prior on w
G. Sanguinetti, EPSRC Winter School 01/08
Semi-supervised learning
• In many practical applications, the labelling
process is expensive and time consuming.
• Plenty of examples of inputs x, but few of the
corresponding target c
• The goal is still to predict p(c|x) and is evaluated
using classification accuracy
• Semi-supervised learning is, in this sense, a
special case of supervised learning where we
use the extra unlabeled data to improve the
predictive power.
G. Sanguinetti, EPSRC Winter School 01/08
Notation
• We have a labeled or complete data set Dl,
comprising of Nl sets of pairs (x,c)
• We have an unlabeled or incomplete data set Du
comprising of Nu sets of vectors x
• The interesting (and common) case is when
Nu>>Nl
G. Sanguinetti, EPSRC Winter School 01/08
Baselines
• One attack to semi-supervised learning would
be to ignore the labels and estimate an
unsupervised clustering model.
• Another attack is to ignore the unlabeled data
and just use the complete data.
• No free lunch: it is possible to construct data
distributions for which either of the baselines
outperforms a given SSL method.
• The art in SSL is designing good models for
specific applications, rather than theory.
G. Sanguinetti, EPSRC Winter School 01/08
Generative SSL
• Generative SSL methods are most intuitive
• The graphical model for
π
θ
generative classification and
unsupervised learning is the
c
x
same
• The likelihoods are combined
• Parameters are estimated by EM
G. Sanguinetti, EPSRC Winter School 01/08
Discriminant analysis
• McLachlan (1977) considered SSL for a
generative model with two Gaussian classes
• Assume the covariance to be known, means
need to be estimated from the data.
• Assume we have Nu=M1+M2<<Nl
• Assume also that the labelled data is split evenly
among the two classes, so Nl=2n
G. Sanguinetti, EPSRC Winter School 01/08
A surprising result
• Consider the misclassification expected error R
• Let R be the expected error obtained using only
the labelled data, and R1 the expected error
obtained using all data (estimating parameters
with EM)
• If
, the first order expansion (in Nu/Nl)
of R-R1 is positive only for Nu<M*, with M* finite
• In other words, unlabelled data helps up to a
point only.
G. Sanguinetti, EPSRC Winter School 01/08
A way out
• Perhaps considering labelled and unlabelled
data on the same footing is not ideal.
• McLachlan went on to propose to estimate the
class means as
• He then shows that, for suitable , the expected
error obtained in this way is to first order always
lower than the one obtained without the
unlabelled data
G. Sanguinetti, EPSRC Winter School 01/08
A hornets’ nest
• In general, it seems sensible to reweight the loglikelihood as
• This is partly because, in the important case
when Nl is small, we do not want the label
information to be swamped
• But how do you choose ?
• Cross validation on the labels?
• Unfeasible for small Nl
G. Sanguinetti, EPSRC Winter School 01/08
Stability
• An elegant solution proposed by Corduneanu
and Jaakkola (2002)
• In the limit when all data is labelled the likelihood
is unimodal
• In the limit when all data is unlabelled the
likelihood is multimodal (e.g. permutations)
• When moving from =1 to =0, we must
encounter a critical * when the likelihood
becomes multimodal
• That is the “optimal” choice
G. Sanguinetti, EPSRC Winter School 01/08
Discriminative SSL
• The alternative paradigm for SSL
θ
μ
is the discriminative approach
• As in supervised learning, the
c
x
discriminative approach models
directly p(c|x) using the graphical model above
• The total likelihood factorises as
G. Sanguinetti, EPSRC Winter School 01/08
Surprise!
• Using Bayes’ theorem we obtain the posterior
over \theta as
• This is seriously worrying!
• Why?
G. Sanguinetti, EPSRC Winter School 01/08
Regularization
• Bayesian methods in a straight discriminative
SSL cannot be employed
• Some limited success in non-Bayesian methods
(e.g. Anderson 1978 for logistic regression over
discrete variables)
• Most promising avenue is to
θ
μ
modify the graphical model used
• This is known as regularization
c
x
G. Sanguinetti, EPSRC Winter School 01/08
Discriminative vs Generative
• The idea behind regularization is that
information about the generative structure of x
is feeding into the discriminative process
• Does it still make sense to talk about a
discriminative approach?
• Strictly speaking, perhaps no
• In practice, the hypothesis used for
regularization are much weaker than modelling
the full class conditional distribution
G. Sanguinetti, EPSRC Winter School 01/08
Cluster assumption
• Perhaps the most widely used form of
regularization
• It states that the boundary between classes
should cross areas of low data density
• It is often reasonable particularly in high
dimensions
• Implemented in a host of Bayesian and nonBayesian methods
• Can be understood in terms of smoothness of
the discriminant function
G. Sanguinetti, EPSRC Winter School 01/08
Manifold assumption
• Another convenient assumption is that the data
lies on a low-dimensional sub-manifold of a
high-dimensional space
• Often the starting point is to map the data to a
high dimensional space via a feature map
• The cluster assumption is then applied in the
high dimensional space
• Key concept is the Graph Laplacian
• Ideas pioneered by Belkin and Niyogi
G. Sanguinetti, EPSRC Winter School 01/08
Manifolds cont.
• We want discriminant functions which vary little
in regions of high data density
• Mathematically, this is equivalent to
being very small, where  is the LaplaceBeltrami operator
• The finite sample approximation to  is given by
the graph Laplacian
• The objective function for SSL is a combination
of the error given by f and of its norm under 