file - ViCoS Prints

Download Report

Transcript file - ViCoS Prints

Understanding Convolutional Neural
Networks for Object Recognition
Domen Tabernik
University of Ljubljana
Faculty of Computer and Information Science
Visual Cognitive Systems Laboratory
Visual object recognition
•
How to capture representation of objects in computational/mathematical
Raw pixels
model?
(x,y,RGB/Gray)
?
Objects: a category
[position or segmentation]
– Impossible to explicitly model each one - too many objects, too many variations
– Let machine learn the model from samples
•
•
Inspiration from biological (human/primate) visual system
Key element: a hierarchy
•
Biological plausibility
•
Part sharing between objects/categories
 Efficient representation
•
Object/part as a composition of other
parts
 Compositional interpretation
Kruger et al, Deep Hierarchies in the Primate Visual Cortex : What
Can We Learn For Computer Vision ?, PAMI 2012
Ljubljana, June 2016
Deep learning – a sigmoid neuron
•
Basic element: a sigmoid neuron (improved perceptron from 60‘s)
A probability of a
car y =0.78
Image/pixels
•
Mathematical form:
–
–
Weighted linear combination of inputs + bias
Sigmoid function:
–
Why sigmoid?
•
•
•
Equals to a smooth threshold function
Smoothness  nice mathematical properties (derivatives)
Threshold  adds non-linearity when stacked
– Captures more complex representations
Ljubljana, June 2016
Deep learning – a sigmoid neuron
•
Basic element: a sigmoid neuron (improved perceptron from 60‘s)
A probability of a
car y =0.78
Image/pixels
•
Learning process:
–
–
•
Known values:
Unknown values:
x, y
w, b
 learning input values
 learned output parameters
Which w,b will for ALL learning images xn produce its correct output yn?
Basically, cost is an average difference between neuron‘s outputs and actual correct outputs
Ljubljana, June 2016
Deep learning – optimization
•
Best solution when cost is lowest, therefore our goal is a minimal C(w,b):
Basic optimization problem:
– When is the function at a minimal point? (high-school math problem)
– When its derivative is at ZERO.
•
How to find ZERO derivative?
– Analytically?
• Need N > num(w)
• Not possible when stacked
– Naive iterative approach:
• Start at random position
• Find small combination of ∆w to min. C
• If num(w) big, too many combinations
and checks
– Use gradient descent instead!
Ljubljana, June 2016
Deep learning – gradient descent
•
•
Iterative process:
–
–
–
Start at random position
Compute activations for all samples yn
Find partial derivative/gradient for each parameter w (and b)
–
Move each wi (and b) in its gradient direction (actually in negative gradient)
–
Repeat until cost low enough
Stochastic gradient/mini batch:
–
–
•
Take smaller subset of samples at each step
Has still enough gradient information
Not perfect – has multiple solutions !
–
–
Local minima
Plateaus
Ljubljana, June 2016
Deep learning – gradient descent
•
•
Heuristics to avoid local minima and plateaus
Chose step size carefully
–
–
•
Momentum:
–
–
•
Main goal: want to have only a small number of weights active/big
Primarily used to fight overfitting issues but helps to escape local maxima as well
Second order derivatives
–
•
Considers gradients from previous steps to increase or decrease step size
Helps escape local maxima without manually increasing step size parameter
Weight decay (regularization)
–
–
•
Too small: slow convergence and unable to escape local maxima
Too big: will not converge!
Gradients for wi considers other gradients wj (where i != j)
Approximations to second order derivatives
–
–
–
–
Nesterov‘s algorithm
AdaGrad
AdaDelta
…
Ljubljana, June 2016
Deep learning – back-propagation
•
Single neuron:
Chain rule for
derivative of f(g(x))
•
Stacked (deep) neurons:
–
–
Keep repeating the chaining process from top to bottom
Take into account all paths where wi appears
Ljubljana, June 2016
Deep learning – convolutional net
•
•
•
Previous slides all general (not computer vision specific)
Appling only fully connected deep neural network to image not feasible
Image size 128x128  16k pixels
–
–
–
•
We can exploit spatial locality of images
–
–
•
Input neurons = 16k
First layer neurons = 4k (lets say we want to reduce dimensions at each layer by 2x)
Number of weights = 16*4m (64 mio for first layer only !!)
Features are local, only small neighborhood of pixels are needed
Feature repeat thorough the image
Local connections and weight sharing:
− Divide neurons into sub-features, each RGB
channel is a separate feature
− One neuron looks only at a small local
neighborhood of pixels (3x3, 5x5, 7x7,…)
− Neurons of the same feature but each at different
positions share weights
Ljubljana, June 2016
Deep learning - ReLU
How does sigmoid function affect learning?
•
Enables easier computation of derivative but has negative effects:
–
–
•
Neuron never reaches 1 or 0  saturating
Gradient reduces the magnitude of error
Leads to two problems:
•
•
Slow learning when neurons saturated i.e. big z values
Vanishing gradient problem (gradient always 25% of error from previous layer!!)
Ljubljana, June 2016
Deep learning - ReLU
•
Alex Krizhevsky (2011) proposed Rectified Linear Unit instead of sigmoid function
•
•
Main purpose of ReLu: reduces saturation and vanishing gradient issues
Still not perfect:
–
–
Stops learning at negative z values (can use piecewise linear - Parametric ReLu, He 2015 from
Microsoft)
Bigger risk of saturating neurons to infinity
Ljubljana, June 2016
Deep learning - dropout
•
•
Too many weights cause overfitting issues
Weight decay (regularization) helps but is not perfect
–
•
•
Srivastava et al. (2014) proposed a kind of „bagging“ for deep nets (actually Alex
Krizhevsky already used it in AlexNet in 2011)
Main point:
–
–
–
•
Also adds another hyper-parameter to setup manually
Robustify network by disabling neurons
Each neuron has a probability, usually of 0.4, of being disabled
Remaining neurons must adept to work without them
Applied only to fully connected layers
–
Conv. layers less susceptible to overfitting
Srivastava et al., Dropout : A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Ljubljana, June 2016
Deep learning – batch norm
•
Input needs to be whitened i.e. normalized (LeCun 1998,
Efficient BackProp)
–
Usually done on first layer input only
•
The same reason for normalization of first layer exists for
other layers as well
•
Ioffe and Szegedy, Batch Normalization, 2015
–
–
–
–
Normalize input to each layer
Reduce internal covariance shift
Too slow to normalize all input data (>1M samples)
Instead normalize within mini-batch only
–
–
Learning: norm over mini-batch data
Inference: norm over all trained input data
Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift, 2015
Better results while allowing to use higher learning rate, higher decay, no dropout, no LRN.
Ljubljana, June 2016
Deep learning - residual learning
•
Current state-of-the-art on ImageNet classification:
– CNN with ~150 layer (by Microsoft China)
•
Key features:
–
–
Requires a reduction of internal covariance shift (Batch Normalization)
Only ~2M parameters (using many small kernels, 1x1, 3x3)
•
–
•
CNN with 1500 layers had ~20M parameters and had overfitting issues
Adds identity bypass:
Why bypass?
–
–
He et al., Deep Residual Learning for Image Recognition, CVPR2016
If layer will not be needed it can simply be ignored; it will just forward input as output
By default weights are really small and F(x,{Wi}) is negligible compared to x
Ljubljana, June 2016
Deep learning - visualizing features
•
Difficult to understand internals of CNNs
•
Many visualization attempts, most quite complex
–
–
–
–
•
Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013
Simonyan et al., Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps, ICLR2014
Mahendran et al., Understanding Deep Image Representations by Inverting Them, CVPR2015
Yosinski et al., Understanding Neural Networks Through Deep Visualization, ICML2015
Strange properties of CNNs: adversarial examples
–
Add invisible permutations to pixels  completely incorrect classifications
Prediction:
a car
Prediction:
unknown
Image diff:
Prediction:
a car
Szegedy et al., Intriguing properties of neural networks, ICLR2014
Ljubljana, June 2016
Prediction:
unknown
Image diff:
Deep learning - visualizing features
•
Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013
Ljubljana, June 2016
Deep learning - visualizing features
•
Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013
Ljubljana, June 2016
Deep learning - visualizing features
•
Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, ECCV2013
Ljubljana, June 2016
Deep learning - visualizing features
•
Simonyan et al., Deep Inside Convolutional Networks: Visualising Image Classification
Models and Saliency Maps, ICLR2014
Ljubljana, June 2016
Deep learning - visualizing features
•
Mahendran et al., Understanding Deep Image Representations by Inverting Them,
CVPR2015
Ljubljana, June 2016
Deep learning - visualizing features
•
Yosinski et al., Understanding Neural Networks Through Deep Visualization, ICML2015
Ljubljana, June 2016
PART II
Ljubljana, June 2016
Convolutional neural networks
•
Trained filters for part on a second
layer:
–
–
–
Which parts on first layer are important?
Can we deduce anything about the
object/part modeled this way?
Compositional interpretation?
 CNN hierarchical but not compositional
Ljubljana, June 2016
Our approach
• CNN might learn compositions but compositions are not explicit
– Cannot utilize advantages of compositions
• Capture compositions as structure in filters with weight
parametrization:
•
Use Gaussian distribution as model:
Ljubljana, June 2016
Compositional neural network
Compositional deep nets
Convolutional neural nets
Possible benefits:
• Model can be interpreted as hierarchical composition!
• Reduced number of parameters (faster learning?, less training samples?)
• Combine generative learning (co-occurance statistics from compositional hierarchies) with
discriminative optimization (gradient descent from CNN)
• Visualizations based on compositions without additional data or complex optimization
Ljubljana, June 2016
Compositional neural network
•
Back-propagation remains the same:
•
Minimize loss function C w.r.t.:
–
–
–
Weights
Means
Variance
Ljubljana, June 2016
Compositional neural network
Ljubljana, June 2016
First layer - activations
…
Input image
(normalized)
Input image
(normalized)
Ljubljana, June 2016
…
Second layer - weights
16 different channel filters per one feature
Gaussian CNN
Ljubljana, June 2016
16 different features at second layer
Standard CNN
Second layer - activations
…
Input image
(normalized)
Input image
(normalized)
Ljubljana, June 2016
…