Deep Learning Overview

Download Report

Transcript Deep Learning Overview

Deep Learning Overview
Jaya Thomas
Computer Science Department
SUNY Korea
Sources:https://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10workshop-tutorial-final.pdf
http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/
Deep Learning = Learning Representations/Features
Deep Learning = Learning Hierarchical Representations
Trainable Feature Hierarchy
Three Types of Training Protocols
Deep Learning: Why Training is hard

Depending on the situation one or other situation tend to prevail

If first hypothesis (under fitting): use better optimization


Active area of research
If second hypothesis (over fitting): use better regularization



Unsupervised learning
Stochastic <<dropout>> training
Solution: Initialize hidden layers using unsupervised learning

Force network to represent latent structure of input distribution

Encourage hidden layers to encode that structure
Unsupervised Pre-training

We will use greedy, layer wise procedure

Train one layer at a time, from first to last, with unsupervised criterion

Fix the parameters of previous hidden layers

Previous layers viewed as feature extraction


Procedure:
 First layer: find hidden unit features that are more common in training input than in
random inputs
 Second layer: find combinations of hidden unit features that are more common than
random hidden unit features
 Third layer: find combination of….
 Etc.
Pre-training initializes the parameters in a region such that the near local optima overfit
less the data
Fine tuning

Once all the layers are pre-trained
 Add output layer
 Train the whole network using supervised learning (Back propagation)

Supervised learning is performed as in a regular feed forward network
 Forward propagation, back propagation and update

We call this last phase fine-tuning
 All parameters are “tuned” for the supervised task at hand
 Representation is adjusted to be more discriminative
Deep Learning
What kind of unsupervised learning Algorithm

Stacked restricted Boltzmann machine

Stacked Autoencodes

Stacked denoise autoencoders

Stacked semi-supervised embeddings

Stacked kernel PCA

Stacked independent subspace analysis
Advantage

Architecture of a CNN is designed to take advantage of the 2D structure
of an input image


Achieved with local connections and tied weights followed by some form
of pooling which results in translation invariant features.
CNN are easier to train and have many fewer parameters than fully
connected networks with the same number of hidden units.
Architecture



A CNN consists of a number of
convolutional and subsampling layers
optionally followed by fully connected
layers
Input to a convolutional layer is
a m x m x r image where m x m is the height
and width of the image and r is the number
of channels, e.g. an RGB image has r=3
Convolutional layer will have k filters (or
kernels)

size n x n x q
 n is smaller than the dimension of the
image and,
 q can either be the same as the number
of channels r or smaller and may vary
for each kernel
Fig 1: First layer of a convolutional
neural network with pooling. Units of
the same color have tied weights and
units of different color represent
different filter maps

A convolutional neural network consists of several layers. These layers can be of
three types:



Convolutional
Max Pooling
Fully-Connected
Source: http://andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networks/
Convolutional
Convolutional: Convolutional layers consist of a rectangular
grid of neurons.
It requires that the previous layer also be a rectangular grid of
neurons.
Each neuron takes inputs from a rectangular section of the
previous layer;
the weights for this rectangular section are the same for each
neuron in the convolutional layer.
Thus, the convolutional layer is just an image convolution of
the previous layer, where the weights specify the convolution
filter.
In addition, there may be several grids in each convolutional
layer; each grid takes inputs from all the grids in the previous
layer, using potentially different filters.
Feature Extraction using CNN
Locally Connected Networks
Solution to this problem is to restrict the
connections between the hidden units and the
input units, allowing each hidden unit to
connect to only a small subset of the input units.
 Each hidden unit will connect to only a small
contiguous region of pixels in the input.
This idea of having locally connected
networks also draws inspiration from how the
early visual system is wired up in biology.
Specifically, neurons in the visual cortex have
localized receptive fields (i.e., they respond
only to stimuli in a certain location
Pooling
Using features obtained after Convolution for
Classification
 In theory, one could use all the extracted features
with a classifier such as a softmax classifier, but
this can be computationally challenging.
Example : Consider images of size 96x96 pixels,
and suppose we have learned 400 features over
8x8 inputs. Each convolution results in an output
of
size
(96−8+1)∗(96−8+1)
=7921(96−8+1)∗(96−8+1)=7921, and since we
have 400 features, this results in a vector of
892∗400=3,168,400892∗400=3,168,400 features
per example! Learning a classifier with inputs
having 3+ million features can be unwieldy, and
can also be prone to overfitting

Max-Pooling: After each convolutional layer, there may be a pooling
layer.

The pooling layer takes small rectangular blocks from the convolutional
layer and subsamples it to produce a single output from that block.

There are several ways to do this pooling, such as taking the average or the
maximum, or a learned linear combination of the neurons in the block.

Our pooling layers will always be max-pooling layers; that is, they take the
maximum of the block they are pooling.

Fully-Connected: Finally, after several convolutional and max pooling
layers, the high-level reasoning in the neural network is done via fully
connected layers.


A fully connected layer takes all neurons in the previous layer (be it fully
connected, pooling, or convolutional) and connects it to every single neuron it
has.
Fully connected layers are not spatially located anymore (you can visualize
them as one-dimensional), so there can be no convolutional layers after a
fully connected layer.
Forward Propagation
1. Compute activations for layers with known inputs:
2. Compute inputs for the next layer from these activations:
3. Repeat steps 1 and 2 until you reach the output layer, and
know values of yL.
Forward Propagation in Convolutional Neural
Network

Suppose we have some N×N square neuron layer which is followed by
our convolutional layer. If we use an m×m filter ω, our convolutional
layer output will be of size (N−m+1)×(N−m+1). In order to compute the
pre-nonlinearity input to some unit xℓij in our layer, we need to sum up
the contributions (weighted by the filter components) from the previous
layer cells:

Then, the convolutional layer applies its nonlinearity:
Back Propagation
Back Propagation in Convolutional Network
Back Propagation

upsample operation has to propagate the error through the pooling layer
by calculating the error w.r.t to each unit incoming to the pooling layer

Ex> if mean pooling then upsample simply uniformly distributes the
error for a single pooling unit among the units which feed into it in the
previous layer.
In max pooling the unit which was chosen as the max receives all the
error since very small changes in input would perturb the result only
through that unit.
Gradient Descent
Thank You