Convolutional neural networks

Download Report

Transcript Convolutional neural networks

Convolutional Neural Networks
Amir Rosenfeld
Idan Himlich
April 2015
Outline
• Perceptron and back propagation
• Convolutional neural networks
– Pooling
– Dropout
• Multilayer generative model (unsupervised
learning)
Perceptrons:
The first generation of neural networks
• History
– Developed by Frank Rosenblatt in the early 1960’s.
– Lots of grand claims were made for what they
could learn to do.
– In 1969, Minsky and Papert published a book
called “Perceptrons” that analyzed what they
could do and showed their limitations.
Perceptrons
• The Perceptron architecture:
Decision Unit
Weights
Feature Units
• The goal of the perceptrons is to learn the right
values of the weights according to training data.
Perceptrons:
Perceptrons:
Learning rule
• wi = wi +  (y – h(x)) xi
• The learning rule guaranteed to find a set of weights
that gets the right answer for all the training cases if
any such set exists.
Perceptrons:
Learning Example
Perceptrons:
What they cannot do
• XOR Example:
(1,0)  1 (0,1)  1
(0,0)  0 (1,1)  0
Data Space
0,1
1,1
• The following set of inequalities is
impossible:
w1  w2   , 0  
w1   ,
w2  
0,0
The positive and negative cases
cannot be separated by a plane
1,0
Perceptrons:
What they cannot do
• We use pixels as features.
• Can we use perceptrons for
separating between the 2 classes?
pattern A
pattern A
pattern A
pattern B
pattern B
pattern B
Perceptrons:
What they cannot do
• We use pixels as features.
pattern A
pattern A
• Can we use perceptrons for
separating between the 2 classes?
NO!
pattern A
pattern B
pattern B
pattern B
Perceptrons:
What they cannot do
• We use pixels as features.
• Can we use perceptrons classify
shapes as connected or
disconnected?
A.
B.
C.
D.
Perceptrons:
What they cannot do
• We use pixels as features.
• Can we use perceptrons classify
shapes as connected or disconnected?
NO!
A.
B.
C.
D.
Perceptrons:
What they cannot do
• Networks without hidden units are very
limited in the input-output mappings they can
learn to model.
• The only way to deal with such limitations is
build hand-coded features which will convert
the original task into a linearly separable one.
Neural Networks
• The network is given an input vector and it must produce an
output that represents:
– a classification (e.g. the identity of a face)
– or a prediction (e.g. the price of oil tomorrow)
• The network is made of multiple layers of non-linear neurons.
– Each neuron sums its weighted inputs from the layer
below and the non-linearly transforms this sum into an
output that is sent to the layer above.
Neural networks
outputs
hidden
layers
input vector
Neuron Types
Linear neurons
y  b   xi wi
i
Threshold neurons
1 z0
z  b  i xi wi , y  
0 otherwise
Neuron Types
Sigmoid neurons
z  b   xi wi , y 
i
1
1 e
z
Hyperbolic Tangent neurons
e z  e z
z  b   xi wi , y 
i
e z  e z
Network Types
Feed-forward neural
networks
Flow of
information
Recurrent networks
Flow of
information
Neural networks (1980’s)
Backpropagation
• The perceptron convergence procedure works by ensuring that every time
the weights change, they get closer to a feasible set of weights.
– This guarantee cannot be extended to more complex networks in
which the average of two good solutions may be a bad solution.
• Instead of making the weights closer to a good set of weights, make the
actual output values get closer to the target values
Neural networks (1980’s)
Backpropagation
Back-propagate
error signal to get
derivatives for
learning
Compare the outputs
with the correct answer
to get an error signal
outputs
hidden
layers
input vector
Learning Algorithm:
Backpropagation
Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
The Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
Bottom-Up flow
The signal is propagating through the network.
Learning Algorithm:
Backpropagation
When the output is provided the error signal of the output layer neurons
is calculated.
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weight
coefficients of each neuron input node may be modified.
Learning Algorithm:
Backpropagation
Top-Down flow
The idea is to propagate the error signal d (computed in a single teaching
step) back to all the neurons, of which the output signals were input
signals.
Learning Algorithm:
Backpropagation
Top-Down flow
The idea is to propagate the error signal d (computed in a single teaching
step) back to all the neurons, of which the output signals were input
signals.
Learning Algorithm:
Backpropagation
Top-Down flow
The idea is to propagate the error signal d (computed in a single teaching
step) back to all the neurons, of which the output signals were input
signals.
Learning Algorithm:
Backpropagation
Top-Down flow
The idea is to propagate the error signal d (computed in a single teaching
step) back to all the neurons, of which the output signals were input
signals.
Learning Algorithm:
Backpropagation
Top-Down flow
The idea is to propagate the error signal d (computed in a single teaching
step) back to all the neurons, of which the output signals were input
signals.
Outline
• Perceptron and back propagation
• Convolutional neural networks
– Pooling
– Dropout
• Multilayer generative model (unsupervised
learning)
Convolutional neural network
Lecun 89’
• Goal: Minimizing the space of possible
networks without excluding “good” networks
– Decrease learning time
– Minimize the overfitting effect
• Good generalization in real life problems
cannot be achieved unless some a priori
knowledge about the task is built in to the
network.
Convolutional neural network
Lecun 89’
• Task – Image recognition:
knowledge about the task
Restriction on the network
Advantage in extracting local
features and combining them
to higher order features
Forcing the hidden layer to
combine only part of the
input source
The precise location of the
feature is not relevant for the
classification
Weight sharing
Convolutional neural network
Lecun 89’
• Task – Image recognition:
– Use several different feature types, each with its
own map of replicated detectors.
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Convolutional neural network
Lecun 89’
Net-1
•10 sigmoid units
•Total of 2570 weights
•The network has successfully learned the training set
•Only ~76% correct on test set
Convolutional neural network
Lecun 89’
Net-2
•One hidden layer – 12 hidden units fully connected.
•Total of 3240 weights
•Only ~87% correct on test set
Convolutional neural network
Lecun 89’
Net-3
Locally Connected
•2 hidden layers:
•H1 – 2D array of size 8 by 8. Each unit connected to 9 input units
•H2 – 4 by 4 plane. Each unit is connected to 5 H1 units.
•Total of 1226 weights
•Only 88.5% correct on test set (much smaller computational cost than
Net-2)
Convolutional neural network
Lecun 89’
Net-4
Locally Connected
Weight Sharing
(only for the first
hidden layer)
•2 hidden layers:
•H1 – 2D feature map of size 8 by 8. Each unit is connected to 9 input
units and shares the same set of 9 weights with other units in the
same map.
•H2 – 4 by 4 plane. Each unit is connected to 5 H1 units.
•Total of 1132 weights (2266 connections)
•94% correct on test set
Convolutional neural network
Lecun 89’
Net-5
Locally Connected
Weight Sharing (for
both hidden layers)
•2 hidden layers:
•H1 – 2 feature maps of size 8 by 8. Each unit is connected to 9 input
units and shares the same set of 9 weights with other units in the
same map.
•H2 – 4 feature maps of size 4 by 4. Each unit is connected to 25 H1
units, and shares the same set of 25 weights with other units in the
same map.
•Total 1060 weights (5194 connections)
•98.4% correct on test set
Convolutional neural network
Lecun 89’
Nets Results:
Convolutional neural network
Lenet-5 (1998)
Pooling
• Transform the joint feature representation into a smaller one
that preserves the important information while discarding
irrelevant detail
– Transform the joint feature representation into smaller one
that preserves important information while discarding
irrelevant detail
Pooling
• Reduce the number of hidden units in a hidden layer
• Achieve invariance to local translations
Prevent Neural Networks from Overfitting
Dropout(2014)
• Temporarily removing hidden units from the network, along
with all its incoming and outgoing connections.
Prevent Neural Networks from Overfitting
Dropout(2014)
• Prevents overfitting.
• Approximate exponential combination of many different
neural network architectures.
The 82
errors made
by LeNet5
Notice that most of the
errors are cases that
people find quite easy.
The human error rate is
probably 20 to 30 errors
but nobody has had the
patience to measure it.
Data set expansion
• Lecun injected knowledge through:
– the local connectivity
– the weight-sharing
– the pooling
This achieves about 80 errors.
• Ciresan et. al. (2010) injected knowledge by creating a
huge amount of synthesized extra training data
They achieve about 35 errors.
Data set expansion
Not
supported
by leNet
Data set expansion
Outline
• Perceptron and back propagation
• Convolutional neural networks
– Pooling
– Dropout
• Multilayer generative model (unsupervised
learning)
A temporary digression(1990’s)
• Vapnik and his co-workers developed a “very clever type of
perceptron” called a Support Vector Machine.
• In the 1990’s, many researchers abandoned neural networks
with multiple adaptive hidden layers because Support Vector
Machines worked better.
What’s wrong with back-propagation?
(Geoffrey Hinton 2005)
• It requires labeled training data.
– Almost all data is unlabeled.
• The learning time does not scale well
– It is very slow in networks with multiple hidden layers.
What’s wrong with back-propagation?
(Geoffrey Hinton 2005)
• It requires labeled training data.
– Almost all data is unlabeled.
• The learning time does not scale well
– It is very slow in networks with multiple hidden layers.
Multilayer Generative model:
Unsupervised Learning
• We need to keep the efficiency of using a gradient method for
adjusting the weights, but use it for modeling the structure of
the sensory input.
– Adjust the weights to maximize the probability that a
generative model would have generated the sensory input.
– Learn p(image) not p(label | image)
The building blocks: Binary stochastic
neurons
• y is the probability of producing a spike.
1
yj
0.5
synaptic weight
from i to j
0
0
total input to neuron j  external input   yi wij
i
output of
neuron i
Deep Belief Nets
• The learning: modify the
weights to make the
network more likely to
generate the training data.
hidden neurons
visible neurons
Deep Belief Nets
Learning:
j
j
vi h j
v'i *h' j
i
I’
data
hidden
visible
reconstruction
wij   ( vi h j v'i h' j )
Deep Belief Nets
Learning:
binary
feature
neurons
Increment weights
between an active pixel
and an active feature
image
data
(reality)
binary
feature
neurons
Decrement weights
between an active pixel
and an active feature
image
reconstruction
(What the model “thinks”)
How to learn a set of features that are good
for reconstructing images of the digit 2
50 binary
feature
neurons
50 binary
feature
neurons
Increment weights
between an active pixel
and an active feature
Decrement weights
between an active pixel
and an active feature
16 x 16
pixel
16 x 16
pixel
image
image
data
(reality)
reconstruction
(What the model “thinks”)
The weights of the 50 feature detectors
We start with small random weights to break symmetry
The final 50 x 256 weights
Each neuron grabs a different feature.
feature
data
reconstruction
How well can we reconstruct the digit
images from the binary feature activations?
Data
Reconstruction
from activated
binary features
New test images from
the digit class that the
model was trained on
Data
Reconstruction
from activated
binary features
Images from an unfamiliar
digit class (the network
tries to see every image as
a 2)