learning lateral connections between hidden units

Transcript learning lateral connections between hidden units

Learning Lateral Connections
between Hidden Units
Geoffrey Hinton
University of Toronto
in collaboration with
Kejie Bao
University of Toronto
Overview of the talk
• Causal Model: Learns to represent images using
multiple, simultaneous, hidden, binary causes.
– Introduce the variational approximation trick
• Boltzmann Machines: Learning to model the
probabilities of binary vectors.
– Introduce the brief Monte Carlo trick
• Hybrid model: Use a Boltzmann machine to model
the prior distribution over configurations of binary
causes. Uses both tricks
• Causal hierarchies of MRF’s:
Generalize the hybrid model to many hidden layers
– The causal connections act as insulators that
keep the local partition functions separate.
Bayes Nets: Hierarchies of causes
• It is easy to generate an
unbiased example at the
leaf nodes.
Hidden cause
• It is typically hard to
compute the posterior
distribution over all
possible configurations of
hidden causes.
• Given samples from the
posterior, it is easy to
learn the local
interactions
Visible
effect
A simple set of images
Two of the
training
images
probabilities of
turning on the
binary hidden
units
reconstructions
of the images
The generative model
To generate a datavector:
• first generate a code from the prior distribution
•
•
then generate an ideal datavector from the code
then add Gaussian noise.
value that code c
p(d )

 p (c ) p ( d | c )
predicts for the i’th
component of the
data vector
c  codes
p(d | c) 

i
dî | c
1
2 
 bi  
j
bias
e
c
sj
( di dî |c ) 2 2 2
binary state of hidden
unit j in code vector c
w ji
weight from hidden
unit j to pixel i
Learning the model
• For each image in the training set we ought to consider
all possible codes. This is exponentially expensive.
prior probability
of code
log p(d )
 log
j
 p (c ) p ( d | c )
c  codes
prediction error
of code c
 log p(d ) 1
 2  p(c | d ) s cj (d i  dî | c)
 c
w ji
posterior probability
of code c
w ji
d̂i
i
d
How to beat the exponential explosion
of possible codes
• Instead of considering each code separately, we
could use an approximation to the true posterior
distribution. This makes it tractable to consider
all the codes at once.
• Instead of computing a separate prediction error
for each binary code, we compute the expected
squared error given the approximate posterior
distribution over codes
– Then we just change the weights to minimize
this expected squared error.
A factorial approximation
• For a given datavector, assume that each code
unit has a probability of being on, but that the
code units are conditionally independent of each
other.
product over all
code units
Q (c | d )  

c
s jq j
c
 (1  s j )(1  q j )

j
Use this term if
code unit j is on in
code vector c
otherwise
use this
term
The expected squared prediction error
expected
prediction
 (di  dî | c) 2 
The variance term
prevents it from cheating
by using the precise realvalued q values to make
precise predictions.


  di  (bi   q j w ji ) 


j


  q j (1  q j ) w2ji
j
additional squared error
caused by the variance in
the prediction
2
Approximate inference
• We use an approximation to the posterior distribution
over hidden configurations.
– assume the posterior factorizes into a product of
distributions (q j , 1  q j ) for each hidden cause.
• If we use the approximation for learning, there is no
guarantee that learning will increase the probability that
the model would generate the observed data.
• But maybe we can find a different and sensible objective
function that is guaranteed to improve at each update.
A trade-off between how well the model fits
the data and the tractability of inference
parameters
data
approximating
posterior
distribution
true
posterior
distribution
G ( )   log p(d |  )  KL(Q(h | d ) || P(h | d , ))
d
new
objective
function
How well the model
fits the data
The inaccuracy
of inference
This makes it feasible to fit models that are so complicated that we
cannot figure out how the model would generate the data, even if we
know the parameters of the model.
Where does the approximate posterior come from?
• We have a tractable cost
function expressed in terms
of the approximating
probabilities, q.
• So we can use the gradient
of the cost function w.r.t. the
q values to train a
“recognition network” to
produce good q values.
assume that the prior
over codes also factors,
so it can be represented
by generative biases.
data
Two types of density model
Stochastic generative model
using directed acyclic graph
(e.g. Bayes Net)
p(d )   p(c) p(d | c)
c
Energy-based models that
associate an energy with each
data vector
p(d ) 
e E (d )
 E (r )
e

r
Generation from model is easy
Inference can be hard
Learning is easy after inference
Generation from model is hard
Inference can be easy
Is learning hard?
A simple energy-based model
• Connect a set of binary stochastic units together
using symmetric connections. Define the energy
of a binary configuration, alpha, to be
E   si s j wij

 
i j
• The energy of a binary vector determines its
probability via the Boltzmann distribution.
Maximum likelihood learning is hard in
energy-based models
• To get high probability for d we need low energy
for d and high energy for its main rivals, r
p(d ) 
e
 E (d )
e
r
 E (r )
It is easy to lower
the energy of d
We need to find the serious
rivals to d and raise their
energy. This seems hard.
Markov chain monte carlo
 log p(d )
E (d )
 



E (r )
 p(r ) 
r
sample rivals with
this probability?
• It is easy to set up a Markov chain so that it finds
the rivals to the data with just the right probability
A picture of the learning rule for a fully visible
Boltzmann machine
si s j 0
 si s j 1
 si s j   a fantasy
i
i
i
t=0
t=1
t=2
i
t = infinity
Start with a training vector. Then pick units at random
and update their states stochastically using the rule:
p( s j  1) 
1
1 e
x j
, where x j   sk w jk
k
The maximum likelihood learning rule is then
wij   (  si s j 0   si s j  )
A surprising shortcut
• Instead of taking the negative samples from the
equilibrium distribution, use slight corruptions of
the datavectors. Only run the Markov chain for
for a few steps.
– Much less variance because a datavector
and its confabulation form a matched pair.
– Seems to be very biased, but maybe it is
optimizing a different objective function.
• If the model is perfect and there is an infinite
amount of data, the confabulations will be
equilibrium samples. So the shortcut will not
cause learning to mess up a perfect model.
Intuitive motivation
• It is silly to run the Markov chain all the way to
equilibrium if we can get the information
required for learning in just a few steps.
– The way in which the model systematically
distorts the data distribution in the first few
steps tells us a lot about how the model is
wrong.
– But the model could have strong modes far
from any data. These modes will not be
sampled by brief Monte Carlo. Is this a
problem in practice? Apparently not.
Mean field Boltzmann machines
• Instead of using binary units with stochastic updates,
approximate the Markov chain by using deterministic
units with real-valued states, q, that represent a
distribution over binary states.
Q (c )  

c
s jq j
c
 (1  s j )(1  q j )

j
• We can then run a deterministic approximation to the
brief Markov chain:
t 1
qj

1
1 e
x j
where x j  
k
t
qk w jk
The hybrid model
• We can use the same factored distribution over
code units in a causal model and in a mean field
Boltzmann machine that learns to model the
prior distribution over codes.
• The stochastic generative model is:
– First sample a binary vector from the prior
distribution that is specified by the lateral
connections between code units
– Then use this code vector to produce an ideal
data vector
– Then add Gaussian noise.
A hybrid model
expected energy
C1   q j b j   q j qk w jk
j k
j
j
d
k
minus
entropy
  q j log q j  (1  q j ) log( 1  q j )
j
 log Z
w ji
recognition
model
The partition function
is independent of the
causal model
d̂i
i
C0 
1
2
2
ˆ
ˆ
Var
(
d
)

(
d

d
)

i
i
i
i
The learning procedure
• Do a forward pass through the recognition model to
compute q+ values for the code units
• Use the q+ values to compute top-down predictions of
the data and use the expected prediction errors to
compute:
– derivatives for the generative weights
– likelihood derivatives for the q+ values
• Run the code units for a few steps ignoring the data to
get the q- values. Use these q- values to compute
– The derivatives for the lateral weights.
– The derivatives for the q+ values that come from the
prior.
• Combine the likelihood and prior derivatives of the q+
values and backpropagate through the recognition net.
Simulation by Kejie Bao
Generative weights of hidden units
Generative weights of hidden units
Adding more hidden layers
d
k
Recognition model
xj
j
w ji
d̂i
i
d
Recognition model
The cost function for a multilayer model
C2  like C1 but without the topdown inputs
C1   q j (b j  x j )   q j qk w jk
j k
j
  q j log q j  (1  q j ) log( 1  q j )
j
 log Z{ x j }
C0 
1
2
Conditional partition function
that depends on the current
top-down inputs to each unit
2
ˆ
ˆ
Var (di )  (di  di )
i
The learning procedure for multiple hidden
layers
• The top down inputs control the conditional
partition function of a layer, but all the required
derivatives can still be found using the
differences between the q+ and the q- statistics.
• The learning procedure is just the same except
that the top down inputs to a layer from the layer
above must be frozen in place while each layer
separately runs its brief Markov chain.
Advantages of a causal hierarchy of Markov
Random Fields
• Allows clean-up at each stage of
generation in a multilayer generative
model. This makes it easy to maintain
constraints.
• The lateral connections implement a prior
that squeezes the redundancy out of each
hidden layer by making most possible
configurations very unlikely. This creates a
bottleneck of the appropriate size.
• The causal connections between layers
separate the partition functions so that the
whole net does not have to settle. Each
layer can settle separately.
– This solves Terry’s problem.
data
THE END
Energy-Based Models with deterministic
hidden units
• Use multiple layers of
deterministic hidden units
with non-linear activation
functions.
• Hidden activities
contribute additively to
the global energy, E.
p (d ) 
e
Ek
k
Ej
j
 E (d )
e

c
 E (c )
data
Contrastive divergence
Aim is to minimize the amount by which a step
toward equilibrium improves the data distribution.
data
distribution
model’s
distribution

distribution after
one step of
Markov chain

CD  KL( P || Q )  KL(Q || Q )
Minimize
Contrastive
Divergence
Minimize divergence
between data
distribution and
model’s distribution
1
Maximize the
divergence between
confabulations and
model’s distribution
Contrastive divergence
KL(Q 0 || Q  )
.

KL(Q1 || Q  )


 
 Q
 
 Q
E

E

0
1


 Q
E

 
E

Q1 KL(Q1 || Q )


Q1
Q
Contrastive
divergence
makes the
awkward
terms cancel
changing the
parameters
changes the
distribution of
confabulations

learning lateral connections between hidden units

Transcript learning lateral connections between hidden units

Directory