Slides - Computer Science

Download Report

Transcript Slides - Computer Science

Introduction to DBN
Jiaxing Tan
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Outline
 Short-comings for label based back-propagation
 What is Belief Nets?
 Restricted Boltzmann Machine
 Deep Belief Network
 Example
Outline
 Short-comings for label based back-propagation
 What is Belief Nets?
 Restricted Boltzmann Machine
 Deep Belief Network
 Example
What is wrong with backpropagation?(May not true now)
 It requires labeled training data.
– Almost all data is unlabeled.
 The learning time does not scale well
– It is very slow in networks with multiple hidden layers.
Overcoming the limitations of
backpropagation
 Keep the efficiency and simplicity of using a gradient method for adjusting
the weights, but use it for modeling the structure of the sensory input.
 Adjust the weights to maximize the probability that a generative model would
have produced the sensory input.
 Learn p(image) not p(label | image): To recognize shapes, First learn to
generate images.
Outline
 Short-comings for label based back-propagation
 What is Belief Nets?
 Restricted Boltzmann Machine
 Deep Belief Network
 Example
Belief Nets
 A belief net is a directed
acyclic(no directed cycle)
graph composed of stochastic
variables.
stochastic
hidden
cause
 We get to observe some of the
variables and we would like to
solve two problems:
 The inference problem: Infer
the states of the unobserved
variables.
 The learning problem: Adjust
the interactions between
variables to make the network
more likely to generate the
observed data.
visible
effect
We will use nets
composed of layers of
stochastic binary variables
with weighted connections
The learning rule for sigmoid belief nets
 Learning is easy if we can get
an unbiased sample from the
posterior distribution over
hidden states given the
observed data.
 For each unit, maximize the
log probability that its binary
state in the sample from the
posterior would be generated
by the sampled binary states
of its parents.
 Sj=0 Si=0,1 (No Change No
Meaning)
 Sj=1 Si=0,1(Change)
sj
j
w ji
i
pi  p( si  1) 
si
1
1  exp(  s j w ji )
j
w ji   s j ( si  pi )
Outline
 Short-comings for label based back-propagation
 What is Belief Nets?
 Restricted Boltzmann Machine
 Deep Belief Network
 Example
Restricted Boltzmann Machines
 We restrict the connectivity to make
learning easier.
 Only one layer of hidden units.
hidden
We will deal with more layers later
j
 No connections between hidden
units.
 In an RBM, the hidden units are
conditionally independent given the
visible states.
 So we can quickly get an unbiased
sample from the posterior distribution
when given a data-vector. (Just
multiplied with W)
 This is a big advantage over directed
belief nets (Explain Away->Conditional
relationship)
i
visible
The Energy of a joint configuration
(ignoring terms to do with biases)
Boltzmann and Brownian
motion(Finance)
binary state of
visible unit i
E (v,h)  
binary state of
hidden unit j
v h w
i
j
ij
i, j
Energy with configuration
v on the visible units and
h on the hidden units
Minus: Free
Energy
E (v, h)
  vi h j
wij
weight between
units i and j
Only v and h is related
=>MLE
Weights  Energies  Probabilities
 Each possible joint configuration of the visible and hidden units has
an energy
 The energy is determined by the weights and biases (as in a Hopfield
net).
 The energy of a joint configuration of the visible and hidden units
determines its probability:
p (v, h)  e
 E ( v ,h )
 The probability of a configuration over the visible units is found by
summing the probabilities of all the joint configurations that contain it.
Using energies to define probabilities
 The probability of a joint
configuration over both
visible and hidden units
depends on the energy of
that joint configuration
compared with the energy
of all other joint
configurations.
 The probability of a
configuration of the visible
units is the sum of the
probabilities of all the joint
configurations that contain
it.
p ( v, h ) 
partition
function
p (v ) 
e
 E ( v ,h )
e
 E (u , g )
u,g
e
h
e
u,g
 E ( v ,h )
 E (u , g )
A picture of the maximum likelihood
learning algorithm for an RBM
j
j
j
j
vi h j  
vi h j 0
i
i
i
t=0
t=1
t=2
a fantasy
i
t = infinity
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in
parallel and updating all the visible units in parallel.
(Gibbs Sampling with v and h)  log p(v)
Then change w.
wij
Expected value of
product of states at
thermal equilibrium
when the training
vector is clamped on
the visible units
 vi h j 0  vi h j 
The Beauty of the model is we do not need to change weight during sampling
A quick way to learn an RBM
j
vi h j 0
i
t=0
data
j
vi h j 1
i
t=1
reconstruction
Contrastive Divergence
Start with a training vector on the
visible units.
Update all the hidden units in
parallel
Update the all the visible units in
parallel to get a “reconstruction”.
w ji   s j ( si  pi )Update the hidden units again.
wij   ( vi h j   vi h j  )
0
1
Millions time fast=100 less computation x computer developed 10,000 faster in 17 years
How to use RBM
 M visual states, N hidden states
 Suppose we have weight matrix W
 Wij means w(vi->hj)
 Given a new data
 Represented by visualble state vi
 Calculate the activation
 For each hidden state hj, calculate the probability of hj to be open
http://blog.163.com/silence_ellen/blog/static/176104222201431710264087/ (Chinese Website)
How to use RBM (cont’)
 To decide wheter hj is open or not, sample u from a
Uniform distribution:
 Then we decide open or not by:
 Given W(hi->vj) and a input Y, generate a sample X. Similar Process
http://blog.163.com/silence_ellen/blog/static/176104222201431710264087/ (Chinese Website)
Outline
 Short-comings for label based back-propagation
 What is Belief Nets?
 Restricted Boltzmann Machine
 Deep Belief Network
 Example
Training a deep network
etc.
WT
h2
 Tied Weights
 CD for limited condition
W
v2
WT
 Structure loos familiar?
h1
W
v1
WT
h0
W
v0
Training a deep network
 First train a layer of features that receive input directly from the
pixels.
 Then treat the activations of the trained features as if they were
pixels and learn features of features in a second hidden layer.
(greedy layer-wise training)
 It can be proved that each time we add another layer of
features we improve a variational lower bound on the log
probability of the training data.
The proof is slightly complicated.
But it is based on a neat equivalence between an RBM and
a deep directed model
Fine-tune: Wake-sleep algorithm (Actually
Sleep and Wake)

wake phase: do a down-top pass, sample h using the recognition weight based on
input v for each RBM, and then adjust the generative weight by the RBM learning rule.
if the reality is different with the imagination, then modify the generative weights to make
what is imagined as close as the reality.
 sleep phase: do a top-down pass, start by a random state of h at the top layer and
generate v. Then the recognition weights are modified.
if the illusions produced by the concepts learned during wake phase are different with the
concepts, then modify the recognition weight to make the illusions as close as the concepts.
A neural model of digit recognition
2000 top-level neurons
10 label
neurons
The model learns to generate
combinations of labels and images.
To perform recognition we start with a
neutral state of the label units and do
an up-pass from the image followed
by a few iterations of the top-level
associative memory.
500 neurons
500 neurons
28 x 28
pixel
image
Outline
 Short-comings for label based back-propagation
 What is Belief Nets?
 Restricted Boltzmann Machine
 Deep Belief Network
 Example
Show DBN
 http://www.cs.toronto.edu/~hinton/adi/index.htm
Reference Materials
 Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for
deep belief nets. Neural Computation, 18, pp 1527-1554.
 Hinton, G. E. Learning multiple layers of representation.
Trends in Cognitive Sciences, Vol. 11, pp 428-434.
 Hinton, G. E. (2007) To recognize shapes, first learn to generate images.
In P. Cisek, T. Drew and J. Kalaska (Eds.) Computational Neuroscience:
Theoretical Insights into Brain Function. Elsevier.
 Hinton, G. E., Dayan, P., Frey, B. J. and Neal, R.
The wake-sleep algorithm for unsupervised Neural Networks.