lec3 - Department of Computer Science

Download Report

Transcript lec3 - Department of Computer Science

How to learn a generative
model of images
Geoffrey Hinton
Canadian Institute for Advanced Research
&
University of Toronto
How the brain works
• Each neuron receives inputs from thousands of other neurons
– A few neurons also get inputs from the sensory receptors
- A few neurons send outputs to muscles.
- Neurons use binary spikes of activity to communicate
• The effect that one neuron has on another is controlled by a
synaptic weight
– The weights can be
positive or negative
• The synaptic weights adapt so that the whole network learns
to perform useful computations
– Recognizing objects, understanding language, making
plans, controlling the body
How to make an intelligent system
• The cortex has about a hundred billion neurons.
• Each neuron has thousands of connections.
• So all you need to do is find the right values for the weights
on thousands of billions of connections.
• This task is much too difficult for evolution to solve directly.
– A blind search would be much too slow.
– DNA doesn’t have enough capacity to store the answer.
• So there must be an intelligent designer.
– What does she look like?
– Where did she come from?
The intelligent designer
• The intelligent designer is a learning algorithm.
– The algorithm adjusts the weights to give the neural
network a better model of the data it encounters.
• A learning algorithm is the differential equation of knowledge.
• Evolution produced the learning algorithm
– Trial and error in the space of learning algorithms is a
much better strategy than trial and error in the space of
synapse strengths.
• To understand the learning algorithm, we first need to
understand the type of network it produces.
– Shape recognition is a good task to consider.
– We are much better than computers and it uses a lot of
neurons.
Hopfield nets
• Model each pixel in an image
using a binary neuron that has
states of 1or 0.
1
3.7
1
• Connect the neurons together
with symmetric connections.
-4.2
0
0
1
• Update the neurons one at a
time based on the total input
they receive.
0
• Stored patterns correspond to
the energy minima of the
network.
1
To store a pattern we change the weights to
lower the energy of that pattern.
binary state of unit i
in configuration s
E (s)  
 si bi   si s j wij
iunits
Energy of binary
configuration s
i j
bias of
unit i
weight between
units i and j
indexes every non-identical
pair of i and j once
Why a Hopfield net doesn’t work
• The ways in which shapes vary are much too
complicated to be captured by pair-wise
interactions between pixels.
– To capture all the allowable variations of a
shape we need extra “hidden” variables that
learn to represent the features that the shape
is composed of.
Some examples of real handwritten digits
From Hopfield Nets to Boltzmann Machines
• Boltzmann machines are stochastic Hopfield
nets with hidden variables.
• They have a simple learning algorithm that
adapts all of the interactions so that the
equilibrium distribution over the visible variables
matches the distribution of the observed data.
– The pair-wise interactions with the hidden
variables can model higher-order correlations
between visible variables.
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic
function of the neuron’s bias, b, and the input it receives
from other neurons.
p ( si  1) 
1
1  exp( bi   s j w ji )
j
1
p( si  1)
0.5
0
0
bi   s j w ji
j
How a Boltzmann Machine models data
• The aim of learning is to
discover weights that
cause the equilibrium
distribution of the whole
network to match the data
distribution on the visible
variables.
• Everything is defined in
terms of energies of joint
configurations of the
visible and hidden units.
hidden
units
visible
units
The Energy of a joint configuration
binary state of unit i in joint
configuration v,h
E (v, h) 

vh

si bi
iunits
Energy with configuration
v on the visible units and
h on the hidden units
bias of
unit i

i j
vh vh
si s j wij
weight between
units i and j
indexes every non-identical
pair of i and j once
Using energies to define probabilities
• The probability of a joint
configuration over both visible
and hidden units depends on
the energy of that joint
configuration compared with
the energy of all other joint
configurations.
• The probability of a
configuration of the visible
units is the sum of the
probabilities of all the joint
configurations that contain it.
p ( v, h ) 
partition
function
p (v ) 
e
 E ( v ,h )
e
 E (u , g )
u,g
e
h
e
u,g
 E ( v ,h )
 E (u , g )
A very surprising fact
• Everything that one weight needs to know about
the other weights and the data in order to do
maximum likelihood learning is contained in the
difference of two correlations.
 log p( v)
 si s j 
v
wij
Derivative of
log probability
of one training
vector
Expected value of
product of states at
thermal equilibrium
when the training
vector is clamped
on the visible units
si s j
free
Expected value of
product of states at
thermal equilibrium
when nothing is
clamped
The batch learning algorithm
• Positive phase
– Clamp a datavector on the visible units.
– Let the hidden units reach thermal equilibrium at a
temperature of 1 (may use annealing to speed this up)
– Sample si s j for all pairs of units
– Repeat for all datavectors in the training set.
• Negative phase
– Do not clamp any of the units
– Let the whole network reach thermal equilibrium at a
temperature of 1 (where do we start?)
– Sample si s j for all pairs of units
– Repeat many times to get good estimates
• Weight updates
– Update each weight by an amount proportional to the
difference in  si s j  in the two phases.
Three reasons why learning is impractical
in Boltzmann Machines
• If there are many hidden layers, it can take a
long time to reach thermal equilibrium when a
data-vector is clamped on the visible units.
• It takes even longer to reach thermal equilibrium
in the “negative” phase when the visible units
are unclamped.
– The unconstrained energy surface needs to
be highly multimodal to model the data.
• The learning signal is the difference of two
sampled correlations which is very noisy.
Restricted Boltzmann Machines
• We restrict the connectivity to make
inference and learning easier.
– Only one layer of hidden units.
– No connections between hidden
units.
• In an RBM, the hidden units are
conditionally independent given the
visible states. It only takes one step
to reach thermal equilibrium when
the visible units are clamped.
– So we can quickly get the exact
value of :
 si s j  v
hidden
j
i
visible
A picture of the Boltzmann machine learning
algorithm for an RBM
j
j
j
j
 si s j  fantasy
 si s j  data
a fantasy
i
i
i
t=0
t=1
t=2
i
t = infinity
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in
parallel and updating all the visible units in parallel.
wij   (  si s j  data   si s j  fantasy )
Contrastive divergence learning:
A quick way to learn an RBM
j
j
 si s j  data
i
 si s j  recon
i
t=0
data
t=1
reconstruction
Start with a training vector on the
visible units.
Update all the hidden units in
parallel
Update the all the visible units in
parallel to get a “reconstruction”.
Update the hidden units again.
wij   (  si s j  data   si s j  recon )
This is not following the gradient of the log likelihood. But it works well.
It is trying to make the free energy gradient be zero at the data distribution.
How to learn a set of features that are good for
reconstructing images of the digit 2
50 binary
feature
neurons
50 binary
feature
neurons
Decrement weights
between an active
pixel and an active
feature
Increment weights
between an active
pixel and an active
feature
16 x 16
pixel
16 x 16
pixel
image
image
data
(reality)
reconstruction
(lower energy than reality)
Bush
joke
The weights of the 50 feature detectors
We start with small random weights to break symmetry
The final 50 x 256 weights
Each neuron grabs a different feature.
How well can we reconstruct the digit images
from the binary feature activations?
Data
Reconstruction
from activated
binary features
Data
New test images from
the digit class that the
model was trained on
Reconstruction
from activated
binary features
Images from an
unfamiliar digit class
(the network tries to see
every image as a 2)
Bush
joke 2
Training a deep network
• First train a layer of features that receive input directly
from the pixels.
• Then treat the activations of the trained features as if
they were pixels and learn features of features in a
second hidden layer.
• It can be proved that each time we add another layer of
features we get a better model of the set of training
images.
– i.e. we assign lower free energy to the real data and
higher free energy to all other possible images.
– The proof uses the fact that the variational free
energy of a non-equilibrium distribution is always
higher that the variational free energy of the
equilibrium distribution.
– The proof depends on a neat equivalence.
A causal network that is
equivalent to an RBM
etc.
WT
h3
• Learning the weights in an RBM is
exactly equivalent to learning in an
infinite causal network with tied
weights.
W
h2
WT
h
W
h1
W
v
v
Learning a deep causal
network
etc.
W1
h3
• First learn with all the weights tied
W1
h2
W1
h1
W1
h1
v
W1
v
etc.
W2
h3
• Then freeze the bottom layer and
relearn all the other layers.
W2
h2
W2
h2
W2
h1
h1
W1
v
etc.
W3
h3
• Then freeze the bottom two layers
and relearn all the other layers.
W3
h2
W2
h3
W3
h1
h2
W1
v
The generative model after learning 3 layers
•
To generate data:
1. Get an equilibrium sample
from the top-level RBM by
performing alternating
Gibbs sampling.
2. Perform a top-down pass
to get states for all the
other layers.
So the lower level bottom-up
connections are not part of
the generative model
h3
W3
h2
W2
h1
W1
data
Why the hidden configurations should be treated
as data when learning the next layer of weights
• After learning the first layer of weights:
log p(v)    energy (v)   entropy(h |v)
  p(h   | v) log p(h   )  log p(v | h   )  entropy

• If we freeze the generative weights that define the
likelihood term and the recognition weights that define
the distribution over hidden configurations, we get:
log p(v)   p(h   | v) log p(h   )  const ant

• Maximizing the RHS is equivalent to maximizing the log
prob of “data”  that occurs with probability p (h   | v)
A neural model of digit recognition
The top two layers form an
associative memory whose
energy landscape models the low
dimensional manifolds of the
digits.
The energy valleys have names
2000 top-level neurons
10 label
neurons
The model learns to generate
combinations of labels and images.
To perform recognition we do an uppass from the image followed by a few
iterations of the top-level associative
memory.
500 neurons
500 neurons
28 x 28
pixel
image
Fine-tuning with the up-down algorithm:
A contrastive divergence version of wake-sleep
• Replace the top layer of the causal network by an RBM
– This eliminates explaining away at the top-level.
– It is nice to have an associative memory at the top.
• Replace the sleep phase by a top-down pass starting with
the state of the RBM produced by the wake phase.
– This makes sure the recognition weights are trained in
the vicinity of the data.
– It also reduces mode averaging. If the recognition
weights prefer one mode, they will stick with that mode
even if the generative weights like some other mode
just as much.
SHOW THE MOVIE
Examples of correctly recognized handwritten digits
that the neural network had never seen before
Its very
good
How well does it discriminate on MNIST test set with
no extra information about geometric distortions?
•
•
•
•
•
Generative model based on RBM’s
Support Vector Machine (Decoste et. al.)
Backprop with 1000 hiddens (Platt)
Backprop with 500 -->300 hiddens
K-Nearest Neighbor
1.25%
1.4%
1.6%
1.6%
~ 3.3%
• Its better than backprop and much more neurally plausible
because the neurons only need to send one kind of signal,
and the teacher can be another sensory input.
Learning perceptual physics
• Suppose we have a video sequence of some balls
bouncing in a box.
• A physicist would model the data using Newton’s laws.
To do this, you need to decide:
– How many objects are there?
– What are the coordinates of their centers at each time
step?
– How elastic are they?
• Does a baby do the same as a physicist?
– Maybe we can just learn a model of how the world
behaves from the raw video.
– It doesn’t learn the abstractions that the physicist has,
but it does know what it likes.
• And what it likes is videos that obey Newtonian physics
The conditional RBM model
t-1
• Given the data and the previous hidden
state and the previous visible frames, the
hidden units at time t are conditionally
independent.
– So it is easy to sample from their
conditional equilibrium distribution.
• Learning can be done by using
contrastive divergence.
– Reconstruct the data at time t from
the inferred states of the hidden units.
– The temporal connections between
hiddens can be learned as if they
were additional biases
wij  si (  s j  data   s j  recon )
i
t-2
t-1
t
j
t
Show Ilya’s movies
THE END
For more on this type of learning see:
www.cs.toronto.edu/~hinton/science.pdf
For the proof that adding extra layers makes the
model better see the paper on my web page:
“A fast learning algorithm for deep belief nets”
Learning with realistic labels
2000 top-level units
10 label units
500 units
This network treats
the labels in a
special way, but
they could easily be
replaced by an
auditory pathway.
500 units
28 x 28
pixel
image
Learning with auditory labels
• Alex Kaganov replaced the class labels by binarized cepstral
spectrograms of many different male speakers saying digits.
• The auditory pathway then had multiple layers, just like the visual
pathway. The auditory and visual inputs shared the top level layer.
• After learning, he showed it a visually ambiguous digit and then
reconstructed the visual input from the representation that the toplevel associative memory had settled on after 10 iterations.
“six”
reconstruction
“five”
original visual input
reconstruction
The features learned in the first hidden layer
Seeing what it is thinking
• The top level associative memory
has activities over thousands of
neurons.
– It is hard to tell what the network
is thinking by looking at the
patterns of activation.
• To see what it is thinking, convert
the top-level representation into an
image by using top-down
connections.
– A mental state is the state of a
hypothetical world in which the
internal representation is
correct.
brain state
The extra activation of
cortex caused by a
speech task. What
were they thinking?
What goes on in its mind if we show it an
image composed of random pixels and ask it
to fantasize from there?
mind
brain
2000 top-level neurons
10 label
mind
brain
neurons
500 neurons
500 neurons
mind
brain
28 x 28
pixel
image
feature
data
reconstruction