UCL Tutorial July 17 2009

Download Report

Transcript UCL Tutorial July 17 2009

CSC2535: 2011
Lecture 5b
Object Recognition and Information
Retrieval with Deep Belief Nets
Geoffrey Hinton
Training a deep network
• First train a layer of features that receive input directly
from the binary pixels.
• Then treat the activations of the trained features as if
they were pixels and learn features of features in a
second hidden layer.
• It can be proved that each time we add another layer of
features we get a better model of the set of training
images (if we do it right).
– The proof uses variational free energy.
– The proof doesn’t apply if the layers get smaller. Also,
it requires the weights in each RBM to be initialized
properly and to be trained by properly following the
gradient of the likelihood.
Modeling real-valued data
• For images of digits it is possible to represent
intermediate intensities as if they were probabilities by
using “mean-field” logistic units.
– We can treat intermediate values as the probability
that the pixel is inked.
• This will not work for real images.
– In a real image, the intensity of a pixel is almost
always almost exactly the average of the neighboring
pixels.
– Mean-field logistic units cannot represent precise
intermediate values.
A standard type of real-valued visible unit
E
• We can model pixels as
Gaussian variables.
Alternating Gibbs
sampling is still easy,
though learning needs to
be much slower.
energy-gradient
produced by the total
input to a visible unit
parabolic
containment
function
E ( v,h) 

i  vis
(vi  bi ) 2
2 i2
vi 
bi

bjhj
j  hid

  i h j wij
vi
i, j
Welling et. al. (2005) show how to extend RBM’s to the
exponential family. See also Bengio et. al. (2007)
The two conditional distributions for a
Gaussian-Bernoulli RBM
p(vi | h)  bi   i

j  hid
h j wij


vi
p(h j | v)  logistic b j 
wij

i
i

vis


N (0,  i2 )




A random sample of 10,000 binary filters learned
by Alex Krizhevsky on a million 32x32 color images.
3D Object Recognition: The NORB dataset
Stereo-pairs of grayscale images of toy objects.
Animals
Humans
Planes
Normalizeduniform
version of
NORB
Trucks
Cars
- 6 lighting conditions, 162 viewpoints
-Five object instances per class in the training set
- A different set of five instances per class in the test set
- 24,300 training cases, 24,300 test cases
NORB Pre-processing
 Each image in a stereo pair is 96x96, so a training
case is 18432-dimensional.
 We reduce the dimensionality to 8976 by blurring
the borders of an image. Only the central 64x64
region has full resolution.
 We also make the data zero-mean and scale each
dimension to have unit-variance.
 We then use an RBM with Gaussian visible units to
learn a layer of 4000 binary features from the
unlabelled data. These features are not fine-tuned.
Five competing generative models
4000 binary units
for class 1
4000 binary units
for class 2
or
Each classspecific model
is trained
generatively on
data from its
own class
or
4000 binary units
8976 Gaussian units
4000 binary units
for class 3
or
All five models
are also trained
discriminatively
to make the
right model
have the lowest
free energy.
The model contains ~116 million parameters and is
trained with only 24,300 labeled images.
Results
(Nair & Hinton, NIPS 2009)
Error rate
Support Vector Machines
11.6%
Convolutional Neural Networks
(built in geometric knowledge)
6.0%
Single top-level RBM with label variable
generative + discriminative training
10.4%
Competing top-level RBMs:
generative + discriminative training
6.5%
Competing top-level RBMs:
generative + discriminative training
plus extra unlabeled translated data
5.3%
An embarrassing problem
• The front end was a Restricted Boltzmann
Machine with Gaussian visible units and binary
hidden units.
• We tried learning the residual noise model for
each Gaussian visible unit and it did not work.
– So we just fixed the residual noise variance to
be 1, which is much too big.
Why its hard to learn an appropriate noise
level for each visible unit
• Suppose the model of the residual
noise in a visible unit has a
standard deviation of 0.1. The
bottom-up weight is effectively
multiplied by 10 and the top-down
weight is effectively divided by 10.
– The resulting bottom-up effects
are much too strong. They
saturate the hidden units.
– The resulting top-down effects
are much too weak. They cannot
move the visible units far enough
away from their mean.
hidden
j
wij
 i wij
i
i
visible
How to implement an integer-valued variable
(Teh and Hinton, 2001)
• One way to model an integer-valued variable is to make
N identical copies of a Bernoulli unit. All copies have the
same probability, p  logistic ( x)
of being “on”.
– The total number of “on” copies (the “firing rate”) has
a binomial distribution with variance N p(1-p)
– Provided p is small, the total number of “on” copies is
Poisson-distributed with variance approximately N p
• It is very hard to stay in the Poisson regime because the
“firing rate” grows exponentially with the input to the unit.
– So learning is very unstable.
How to fix the problem with an infinite model
• Make infinitely many copies of each binary hidden unit.
• All copies have the same weights and the same
adaptive bias, b, but they have different fixed offsets to
the bias:
b  0.5, b  1.5, b  2.5, b  3.5, ....
x
• Now if the bottom-up effects become 10 times stronger,
10 times as many hidden units will turn on producing a
top-down effect that is 10 times as big.
A fast approximation
n 

logistic( x  0.5  n)

log(1  e x )
n 1
• Contrastive divergence learning works well for the sum of
binary units with offset biases.
• It also works for rectified linear units. These are much faster
to compute than the sum of many logistic units.
output = max(0, x + randn*sqrt(logistic(x)) )
A harder version of the NORB dataset
• The objects are jittered in position, size, and
intensity.
• A distractor object is placed off-center in the
image
• A cluttered background is added to the image
(but its at a different depth so stereo can be
quite effective at ignoring it.)
• To make experiments quicker the 96x96 images
were reduced to 32x32 images in the simplest
possible way.
Get examples from relu paper
Results on jittered-cluttered NORB
(Nair & Hinton, ICML 2010)
Error rate
Multinomial logistic regression on
the pixels
49.9%
Support Vector Machines
43.3%
Convolutional Neural Networks
(built in geometric knowledge)
7.2%
Multinomial logistic regression on
second hidden layer using binary units
18.8%
Multinomial logistic regression on
second hidden layer using RELU units
15.2%
Equivariance
• Simple convolutional neural nets give translational
equivariance, not translational invariance.
representation
image
translated
representation
translated
image
• A small amount of translational invariance is then
achieved at each layer by using local averaging or
maxing.
Intensity equivariance
• Consider a net with multiple layers of rectified linear
units that all have a bias of zero.
– Ignore the sampling noise for now.
• If we scale all the intensities in the input image by S,
exactly the same set of hidden units will be above
threshold in every hidden layer, but all of their
activities will be scaled up by S.
– Its a linear model.
– If the ratios of intensities do not change, it will not
switch to a different linear model.
– So we have a very non-linear model (globally)
that respects the linear property rep( x)   rep(x)
Intensity invariance
• Equivariance ensures that if we can represent
an image well, we can also represent it just as
well when all the intensities are scaled up.
• Given equivariance to intensity as the basic
property of the model, we can then achieve
invariance by a little bit of local normalization at
each layer.
– Local divisive normalization does for intensity
what local averaging or maxing does for
translation in a convolutional net.
Test examples from the CIFAR-10 dataset
plane car
bird
cat
deer dog
frog horse ship truck
The CIFAR-10 labeled subset of the
MIT tiny images dataset
• There are 5000 32x32 carefully labelled training images
and 1000 32x32 carefully labelled testing images for
each of 10 different classes. (so its very like MNIST!)
• There are 80 million very badly labelled images that are
found by trawling the web using about 50,000 search
words or phrases taken from “Wordnet”.
– Less than 10% of these images have a dominant
object that fits the search term.
– For example, about 90% of images found using “cat”
do not contain a cat!
• The badly labelled images are very useful for
unsupervised pre-training.
Some ways to make convolutional neural
nets work really well (Alex Krizhevsky)
• Use RELU units instead of binary units in the first hidden
layer.
• Use overlapping pools for “sub-sampling”
– This retains more information about the positions of
features, but codes this information in units that have
large receptive fields.
• In addition to the usual convolutional units that have
local receptive fields with weight-sharing, use a separate
set of global units.
– The global units learn to take care of edge effects that
would otherwise mess up the local units.
Local filters learned by the 64 convolutional
kernels in the first hidden layer of RELUs
After unsupervised
pre-training
After supervised
fine-tuning
Results of Alex Krizhevsky’s
experiments
Error rate
Pre-train DBN with 2 hidden layers of fully
connected binary units. Add softmax layer
and fine-tune with backprop
~35%
Convolutional neural net
(2 hidden layers with pooling)
~29%
Convolutional RELU net + global filters
(1 hidden layer but no pooling)
~29%
Convolutional RELU net + global filters
(1 hidden layer with overlapping pooling)
~25%
Convolutional RELU net + global filters
(2 hidden layers with overlapping pooling)
~21%
Mixtures of linear models
• Mixtures of linear models are easy to fit and
work surprisingly well on many simple problems.
• Given a set of images of digits that have been
approximately normalized for size and position,
we can get quite a good model by using a
mixture of factor analysers.
– Each factor analyser models one class or
subclass.
– The factors typically model small local
translations of parts of the digit.
• This type of model looks very different from a
multilayer neural net.
Some problems with
mixtures of linear models
• For data with many different things going on at once we
may need exponentially many components in the mixture
– Consider modeling pairs of digits.
• If there are too many components in the mixture they lose
statistical power.
– Each linear model only gets trained on a very small
fraction of the data.
– A mixture cannot benefit from the fact that the top of a
2 is like the top of a 3.
• Its hard to stitch together the latent representations of
different factor analysers because they all use different
coordinate systems.
– This makes it hard to associate actions with the latent
representations.
An old idea for learning a mixture of exponentially
many linear models (Sallans, Hinton & Ghahramani, 1997)
• First generate a hierarchical
binary state by running a
sigmoid belief net.
• Each unit in the SBN can
switch in a corresponding
linear unit.
– So the SBM chooses which
factors to use in a hierarchical
linear model.
• We get combinatorially many
linear models that share
factors
– but inference is hard.
A better version of the old idea
• Combine the binary switch and the linear variable
into a single rectified linear unit.
– Rectification is equivalent to the rule: If the total
input is negative, exclude that unit from the linear
model.
– This has the nice property that there is no
discontinuity at the point at which the unit is
switched in, because its value is 0.
• Use an undirected model with only one hidden layer
so that inference is easy.
– We can infer the value of a hidden unit without
knowing which other units are switched in.
BREAK?
Overview
• An efficient way to train a multilayer neural network to
extract a low-dimensional representation.
• Document retrieval (published work with Russ Salakhutdinov)
– How to model a bag of words with an RBM
– How to learn binary codes
– Semantic hashing: retrieval in no time
• Image retrieval (unpublished work with Alex Krizhevsky)
– How good are 256-bit codes for retrieval of small
color images?
– Ways to use the speed of semantic hashing for
higher-quality image retrieval (work in progress).
Deep Autoencoders
(with Ruslan Salakhutdinov)
28x28
W1T
1000 neurons
• They always looked like a really
nice way to do non-linear
dimensionality reduction:
– But it is very difficult to
optimize deep autoencoders
using backpropagation.
• We now have a much better way
to optimize them:
– First train a stack of 4 RBM’s
– Then “unroll” them.
– Then fine-tune with backprop.
W2T
500 neurons
W3T
250 neurons
W4T
30
W4
250 neurons
W3
500 neurons
W2
1000 neurons
W1
28x28
A comparison of methods for compressing
digit images to 30 real numbers.
real
data
30-D
deep auto
30-D logistic
PCA
30-D
PCA
Compressing a document count vector to 2 numbers
2000 reconstructed counts
output
vector
• We train the
autoencoder to
reproduce its input
vector as its output
250 neurons
• This forces it to
compress as much
2 real-valued units information as possible
into the 2 real numbers
in the central bottleneck.
250 neurons
• These 2 numbers are
then a good way to
500 neurons
visualize documents.
500 neurons
2000 word counts
We need a special type
of RBM to model counts
First compress all documents to 2 numbers using a type of PCA
Then use different colors for different document categories
Yuk!
First compress all documents to 2 numbers.
Then use different colors for different document categories
The replicated softmax model: How to
modify an RBM to model word count vectors
• Modification 1: Keep the binary hidden units but use
“softmax” visible units that represent 1-of-N
• Modification 2: Make each hidden unit use the same
weights for all the visible softmax units.
All the models
in this family
have 5 hidden
units.
This model is
for 8-word
documents.
The replicated softmax model: How to
modify an RBM to model word count vectors
• Modification 3: Use as many softmax visible units
as there are non-stop words in the document.
– So its actually a family of different-sized RBMs
that share weights. It not a single generative
model.
• Modification 4: Multiply each hidden bias by the
number of words in the document
• The replicated softmax model gives much higher
probability to held out bags of words than LDA topic
models (in NIPS 2009)
The relative negative log probs assigned to bags
of words by LDA and replicated softmax RBM
Finding real-valued codes for retrieval
2000 reconstructed counts
• Train an auto-encoder using
10 real-valued units in the code
layer.
• Compare with Latent Semantic
Analysis that uses PCA on the
transformed count vector
• Non-linear codes are much
better.
500 neurons
250 neurons
10
250 neurons
500 neurons
2000 word counts
Retrieval performance on 400,000 Reuters
business news stories
Finding binary codes for documents
2000 reconstructed counts
• Train an auto-encoder using 30
logistic units for the code layer.
• During the fine-tuning stage,
add noise to the inputs to the
code units.
– The “noise” vector for each
training case is fixed. So we
still get a deterministic
gradient.
– The noise forces their
activities to become bimodal
in order to resist the effects
of the noise.
– Then we simply threshold the
activities of the 30 code units
to get a binary code.
500 neurons
250 neurons
30
noise
250 neurons
500 neurons
2000 word counts
Using a deep autoencoder as a hash-function
for finding approximate matches
hash
function
“supermarket search”
Another view of semantic hashing
• Fast retrieval methods typically work by
intersecting stored lists that are associated with
cues extracted from the query.
• Computers have special hardware that can
intersect 32 very long lists in one instruction.
– Each bit in a 32-bit binary code specifies a list
of half the addresses in the memory.
• Semantic hashing uses machine learning to map
the retrieval problem onto the type of list
intersection the computer is good at.
How good is a shortlist found this way?
• Russ has only implemented it for a million
documents with 20-bit codes --- but what could
possibly go wrong?
– A 20-D hypercube allows us to capture enough
of the similarity structure of our document set.
• The shortlist found using binary codes actually
improves the precision-recall curves of TF-IDF.
– Locality sensitive hashing (the fastest other
method) is much slower and has worse
precision-recall curves.
Semantic hashing for image retrieval
• Currently, image retrieval is typically done by
using the captions. Why not use the images too?
– Pixels are not like words: individual pixels do
not tell us much about the content.
– Extracting object classes from images is hard.
• Maybe we should extract a real-valued vector
that has information about the content?
– Matching real-valued vectors in a big
database is slow and requires a lot of storage
• Short binary codes are easy to store and match
A two-stage method
• First, use semantic hashing with 28-bit binary
codes to get a long “shortlist” of promising
images.
• Then use 256-bit binary codes to do a serial
search for good matches.
– This only requires a few words of storage per
image and the serial search can be done
using fast bit-operations.
• But how good are the 256-bit binary codes?
– Do they find images that we think are similar?
Some depressing competition
• Inspired by the speed of semantic hashing, Weiss,
Fergus and Torralba (NIPS 2008) used a very fast
spectral method to assign binary codes to images.
– This eliminates the long learning times required by
deep autoencoders.
• They claimed that their spectral method gave better
retrieval results than training a deep auto-encoder using
RBM’s.
– But they could not get RBM’s to work well for
extracting features from RGB pixels so they started
from 384 GIST features.
– This is too much dimensionality reduction too soon.
A comparison of deep auto-encoders and
the spectral method using 256-bit codes
(Alex Krizhevsky)
• Train auto-encoders “properly”
– Use Gaussian visible units with fixed variance.
Do not add noise to the reconstructions.
– Use a cluster machine or a big GPU board.
– Use a lot of hidden units in the early layers.
• Then compare with the spectral method
– The spectral method has no free parameters.
• Also compare with Euclidean match in pixel space
Krizhevsky’s deep autoencoder
The encoder
has about
67,000,000
parameters.
256-bit binary code
512
It takes a few
GTX 285 GPU
days to train on
two million
images.
1024
There is no
theory to justify
this architecture
2048
4096
8192
1024
1024
1024
Reconstructions produced by 256-bit codes
A
S
E
A
S
E
A
S
E
A
S
E
A
S
E
Retrieval results
An obvious extension
• Use a multimedia auto-encoder that represents
captions and images in a single code.
– The captions should help it extract more
meaningful image features such as
“contains an animal” or “indoor image”
• RBM’s already work much better than standard
LDA topic models for modeling bags of words.
– So the multimedia auto-encoder should be
a win (for images)
a win (for captions)
a win (for the interaction during training)
A less obvious extension
• Semantic hashing gives incredibly fast retrieval
but its hard to go much beyond 32 bits.
• We can afford to use semantic hashing several
times with variations of the query and merge the
shortlists
– Its easy to enumerate the hamming ball
around a query image address in ascending
address order, so merging is linear time.
• Apply many transformations to the query image
to get transformation independent retrieval.
– Image translations are an obvious candidate.
A more interesting extension
• Computer vision uses images of uniform resolution.
– Multi-resolution images still keep all the highresolution pixels.
• Even on 32x32 images, people use a lot of eye
movements to attend to different parts of the image.
– Human vision copes with big translations by
moving the fixation point.
– It only samples a tiny fraction of the image at
high resolution. The “post-retinal’’ image has
resolution that falls off rapidly outside the fovea.
– With less “neurons” intelligent sampling
becomes even more important.
How we extract multiple images with
fewer pixels per image
• This variable resolution
“retina” is applied at 81
different locations in
the image.
• Each of the 81 “postretinal” images is used
for semantic hashing.
– We also use the
whole image.
A more human metric for image similarity
• Two images are similar if fixating at point X in one
image and point Y in the other image gives similar
post-retinal images.
• So use semantic hashing on post-retinal images.
– The address space is used for post-retinal
images and each address points to the whole
image that the post-retinal image came from.
– So we can accumulate similarity over multiple
fixations.
• The whole image addresses found after each
fixation have to be sorted to allow merging 
Summary
• Restricted Boltzmann Machines provide an efficient way
to learn a layer of features without any supervision.
– Many layers of representation can be learned by
treating the hidden states of one RBM as the data for
the next.
• This allows us to learn very deep nets that extract short
binary codes for unlabeled images or documents.
– Using 32-bit codes as addresses allows us to get
approximate matches at the speed of hashing.
• Semantic hashing is fast enough to allow many retrieval
cycles for a single query image.
– So we can try multiple transformations of the query.