Course overview

Download Report

Transcript Course overview

Deep Learning and Its
Application to Signal and
Image Processing and
Analysis
FALL 2016
TAMMY RIKLIN RAVIV, ELECTRICAL AND COMPUTER ENGINEERING
Information Processing & Neural Networks
• The Hubel and Wiesel Experiment 1959: They inserted a microelectrode into
the primary visual cortex of an anesthetized cat, and projected patterns of light
and dark on a screen in front of the cat.
• They found that some neurons fired rapidly when presented with lines
at one angle, while others responded best to another angle. Some of these
neurons responded to light patterns and dark patterns differently.
Hubel and Wiesel called these neurons simple cells.
• Still other neurons, which they termed complex cells, detected edges
regardless of where they were placed in the receptive field of the neuron
and could preferentially detect motion in certain directions.
• These studies showed how the visual system constructs complex
representations of visual information from simple stimulus features.
T. Wiesel (left) and D. Hubel (right)
co-recipients of the 1981 Nobel Prize in
Physiology for their discoveries concerning
information processing in the visual system
Hubel and Wiesel Experiments
Some YouTube links:
https://www.youtube.com/watch?v=IOHayh06LJ4
https://www.youtube.com/watch?v=8VdFf3egwfg
https://www.youtube.com/watch?v=y_l4kQ5wjiw
https://www.youtube.com/watch?v=UU2esxycMAw
The biological neuron
The biological neuron
• Our brains are made up of about 100 billion tiny
units called neurons.
• Each neuron is connected to thousands of other neurons
and communicates with them via electrochemical signals.
• Signals coming into the neuron are received via junctions called synapses,
these in turn are located at the end of branches of the neuron cell called dendrites.
• The axon carries the neuron’s output.
The biological neuron model
• A biological neuron model (spiking neuron model)
is a mathematical model of the electrical properties
of neuronal action potentials (APs).
• APs are sharp changes in the electrical potential
across the cell membrane that last for about one millisecond.
• Spiking neurons are known to be a major signaling
unit of the nervous system.
A neuronal action potential ("spike").
• Integrate and fire (earliest model) ; Hodgkin–Huxley model (Nobel Prize 1963); many more..
An Artificial Neuron
An Artificial Neuron
• Neural networks are made up of many artificial neurons.
• Artificial neurons - simplified models of biological neurons.
• Each input into the neuron is associated with weight
• A weight is simply a floating point number, which can be
positive (excitatory) or negative (inhibitory) adjusted during training.
• The weighted sum of the inputs gives us the activation.
• The neuron’s output is determined by an activation function.
Simple activation function - threshold
Piecewise linear activation function
Sigmoid activation function
bias
Rectified linear unit (ReLU) activation
function
Artificial neural network
Feed forward neural network:
Artificial neural network
We will discuss networks with a different architecture Recurrent neural network:
e.g.
Artificial neural network
Feed forward neural network:
• Each input is sent to every neuron in the
hidden layer and then each hidden layer’s neuron’s
output is connected to every neuron in the next layer.
• There can be any number of hidden layers within a
feedforward network and any number of neurons.
Artificial neural network
Simple (and classical) example
Character recognition:
Input: binary vector of length N X N
Requires N X N input neurons.
Output: a binary vector e.g.
(0,0,0,1, …0)
and the answer is ‘4’
Training and Test
Supervised neural networks:
The weights (and the biases) are adjusted by
training using a training set.
A common way to train is by back-propagation.
When weights are adjusted – run test examples.
We will talk on both supervised and unsupervised networks.
Back-propagation
• Most common
• Supervised
• It requires a teacher that knows,
or can calculate, the desired
output for any input in the training set.
Back-propagation algorithm
1.
Present a training sample to the neural network.
2.
Compare the network's output to the desired output from that sample.
3. Calculate the error in each output neuron.
4. For each neuron, calculate what the output should have been, and a scaling factor,
how much lower or higher the output must be adjusted to match the desired output.
This is the local error.
5. Adjust the weights of each neuron to lower the local error.
6. Assign `error’ to neurons at the previous level.
7. Repeat from step 3 on the neurons at the previous level.
Training NN as a XOR function
Overfitting vs. generalization
• Overfitting occurs when a model is excessively
complex, such as having too many parameters
relative to the number of observations.
• A model that has been overfit has poor predictive
performance, as it overreacts to minor fluctuations
in the training data.
•In other words, the model begins to `memorize’
training data rather than ‘learning’ to generalize
from trend.
Rest of today’s plan
Course overview and motivation
Syllabus
Course objectives
Course structure
Course resources
About the instructor
Expectations & grading
Image classification
22
Deep Learning and Its Applications to
Signal and Image Processing and Analysis
Lecturer: Dr. Tammy Riklin Raviv
No.: 361-21120
Time: Monday 10:00-13:00
Location: Building 72 (Markus) Room 122
Graduate course
Course Web Site:
http://www.ee.bgu.ac.il/~rrtammy/DNN/DNN.html
23
Course Objectives
The primary objective of this course is to provide the students the
necessary computational tools to:
1.
Understand basic principles of artificial Neural Networks (NN) and deep
learning and Machine Learning in general
2.
2. Be familiar with a variety of NN architectures, training strategies,
challanges and potential applications
3.
3. Be familiar with up-to-date literature in ANN for signal processing/
image analysis
4. Implement, train and test DNN for particular applications
24
Course description
The course will focus on both theoretical and practical aspects of
Neural Networks (NN) and deep learning and their applications to
processing and analysis of signals and images.
Among other topics we will discuss neural networks as
computational models, Rosenblatt's perceptron,logistic
regression, back-propagation, activation functions, gradient flow,
hyperparameters Restricted Boltzmann Machines (RBM),
contrastive divergence, auto-encoders, and a generative modeling
of neural networks.
25
Course description
In addition we will discuss specific applications such as
classification, spatial localization and segmentation using
Convolutional Neural Network
(CNN), Recurrent Neural Networks (RNN) and Long Short Term
Memory (LSTM) networks.
NN applications to signal processing, de-noising and image
captioning.
An overview on commonly used DNN tools such as
Caffe/Torch/Theano/TensorFlow will be given.
We will focus on Tensor Flow.
26
Course Structure
1. Overview lectures: Basic introduction to ANN, Machine
Learning, Image Processing and Analysis
2. Lab. instruction, introduction to tensor flow (by Ohad Shitrit)tentative date: Nov 14 – Save-the-date
3. Guest lectures:
Prof. Jacob Goldberger – BIU (Nov. 21),
Dr. Oren Shriki –BGU (Nov. 28) and others.
4. Student lectures – each student will present a topic/paper to
the class, followed by a discussion – a list will be distributed soon
27
Course Resources
Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). Deep
Learning. MIT Press. Online
http://www.deeplearningbook.org/
Tensor flow course: https://www.udacity.com/course/deep-learning--ud730
Convolutional Neural Networks for Visual Recognition – Stanford
http://cs231n.stanford.edu/ and a lecture series
https://www.youtube.com/playlist?list=PLLvH2FwAQhnpj1WEB-jHmPuUeQ8mXXXG
Neural Networks for Machine Learning – Coursera by Jeff Hinton
https://www.coursera.org/learn/neural-networks
28
What should I do in order to succeed in
the course?
Active class participation (bonus up to 5 %)
Be Present in at least 10 classes out of the first 13
Last class (project presentation) is mandatory
Reading and exercises (mandatory, )
Class Presentation 30%
Final Project 70%
29
The instructor
Tammy Riklin Raviv,
Research interests:
Signal processing: Biomedical Image Analysis, Computer Vision,
Machine Learning
Contact info:
Telephone: 08-6428812
Fax: 08-647 2949
E-mail: [email protected]
Office: 212/33
Reception hours:
please coordinate via email
Personal web page:
http://www.ee.bgu.ac.il/~rrtammy/
30
Image classification
Assigning an input image one label from a fixed set of categories
Motivation: An important computer vision problem.
Has a large variety of practical applications.
Many other seemingly distinct Computer Vision tasks (such as object
detection, segmentation) can be reduced to image classification.
Example 1: Digit classification {0,1,2,3…8,9}
Example 2: Take a single image and assign probabilities to 4 labels, {cat, dog, hat,
mug}.
Hereafter, based on: Convolutional Neural Networks for Visual Recognition – Stanford course
Image classification – an example
The task in Image Classification is to predict
a single label (or a distribution over labels as
shown here to indicate our confidence) for
a given image. Images are 3-dimensional
arrays of integers from 0 to 255, of size
Width x Height x 3.
The 3 represents the three color channels
Red, Green, Blue.
Image classification - challenge
Data driven approach
Instead of trying to specify what every one of the categories of interest look like
directly in code, provide the computer with many examples of each class and
then develop learning algorithms that look at these examples and learn about
the visual appearance of each class.
Labeled database
An example training set for four visual categories. In practice we may have thousands
of categories and hundreds of thousands of images for each category.
MNIST database
Nearest Neighbor Classifier
image classification dataset: CIFAR-10.
Left: Example images from the CIFAR-10 dataset. Right: first column shows a few test images and next to each we show
the top 10 nearest neighbors in the training set according to pixel-wise difference.
CIFAR10
http://www.cs.toronto.edu/~kriz/cifar.html
This dataset consists of 60,000 tiny images that are 32 pixels high and wide.
Each image is labeled with one of 10 classes (for example “airplane, automobile, bird, etc”).
These 60,000 images are partitioned into a training set of
50,000 images and a test set of 10,000 images. In the previous slide you could see 10 random
example images from each one of the 10 classes.
CIFAR10 – nearest neighbor
Suppose now that we are given the CIFAR-10 training set of 50,000 images
(5,000 images for every one of the labels), and we wish to label the remaining 10,000.
The nearest neighbor classifier will take a test image, compare it to every single one of the
training images, and predict the label of the closest training image.
In the image above and on the right you can see an example result of such a procedure for
10 example test images. Notice that in only about 3 out of 10 examples an image of the
same class is retrieved, while in the other 7 examples this is not the case.
For example, in the 8th row the nearest training image to the horse head is a red car, presumably
due to the strong black background.
As a result, this image of a horse would in this case be mislabeled as a car.
Nearest neighbor – L1 distance
An example of using pixel-wise differences to compare two images with L1 distance
(for one color channel in this example). Two images are subtracted elementwise and then all differences are
added up to a single number.
If two images are identical the result will be zero. But if the images are very different the result will be large.
K- nearest neighbors
Find the top k closest images, and have them vote on the label of the test image.
Validation sets for Hyperparameter tuning
K is an hyperparameter
How can we determine K ?
We CANNOT find K by tweaking it for the test set
Use validation set instead
Split your training set into training set and a validation set.
Use validation set to tune all hyperparameters.
At the end run a single time on the test set and report performance.
Cross validation
Insufficient data ?
Split your training set into training set and a validation set.
Use validation set to tune all hyperparameters.
At the end run a single time on the test set and report performance.
Cross validation
Example of a 5-fold cross-validation run for the
parameter k. For each value of k we train on 4 folds
and evaluate on the 5th. Hence, for each k we
receive 5 accuracies on the validation fold (accuracy
is the y-axis, each result is a point). The trend line is
drawn through the average of the results for each k
and the error bars indicate the standard deviation.
Note that in this particular case, the cross-validation
suggests that a value of about k = 7 works best on
this particular dataset (corresponding to the peak in
the plot). If we used more than 5 folds, we might
expect to see a smoother (i.e. less noisy) curve.
Pros and Cons of Nearest Neighbor
classifier
Pixel-based distances on high-dimensional data (and images especially) can be very unintuitive. An
original image (left) and three other images next to it that are all equally far away from it based on
L2 pixel distance. Clearly, the pixel-wise distance does not correspond at all to perceptual or
semantic similarity.
L2 pixelwise differences (CIFAR DB)
Intermediate summary (I)
We introduced the problem of Image Classification, in which we are given a set of images
that are all labeled with a single category. We are then asked to predict these categories for
a novel set of test images and measure the accuracy of the predictions.
We introduced a simple classifier called the Nearest Neighbor classifier. We saw that there
are multiple hyper-parameters (such as value of k, or the type of distance used to compare
examples) that are associated with this classifier and that there was no obvious way of
choosing them.
We saw that the correct way to set these hyperparameters is to split your training data into
two: a training set and a fake test set, which we call validation set. We try different
hyperparameter values and keep the values that lead to the best performance on the
validation set.
Intermediate summary (II)
If the lack of training data is a concern, we discussed a procedure called cross-validation,
which can help reduce noise in estimating which hyperparameters work best.
Once the best hyperparameters are found, we fix them and perform a single evaluation on
the actual test set.
We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to
implement but requires us to store the entire training set and it is expensive to evaluate on a
test image.
Finally, we saw that the use of L1 or L2 distances on raw pixel values is not adequate since
the distances correlate more strongly with backgrounds and color distributions of images
than with their semantic content.
K-NN Disadvantages
1) The classifier must remember all of the training data and store it for future
comparisons with the test data. This is space inefficient because datasets may
easily be gigabytes in size.
2) Classifying a test image is expensive since it requires a comparison to all
training images.
Parameterized mapping from images to
label scores
Define a score function that maps the pixel values of an image to confidence
scores for each class:
Training data-set: N examples, K categories
Linear classifier:
The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the
parameters (weights) of the function. In CIFAR-10, xi contains all pixels
in the i-th image flattened into a single column.
Interpreting a linear classifier
Analogy of images as high-dimensional
points.
Interpretation of linear classifiers as
template matching.
Each row of W corresponds to a template (prototype) for one of the classes.
The score of each class for an image is then obtained by comparing each
template with the image using an inner product (or dot product) one by one to
find the one that “fits” best – analogous to template matching.
Similar to doing Nearest Neighbor, but with a single image per class instead of
having thousands of training, and we use the (negative) inner product as the
distance instead of the L1 or L2 distance.
Example learned weights
Example learned weights at the end of learning for CIFAR-10. Note that, for example, the ship template contains a lot
of blue pixels as expected. This template will therefore give a high score once it is matched against images of
ships on the ocean with an inner product.
Looking ahead
Looking ahead a bit, a neural network will be able to develop intermediate
neurons in its hidden layers that could detect specific class types (e.g. green car
facing left, blue car facing front, etc.), and neurons on the next layer could
combine these into a more accurate car score through a weighted sum of the
individual class (e.g. car) detectors.
Bias trick
Extending the vector xi with one additional dimension that always holds the
constant 1.
The extra column that W now corresponds to the bias b
Bias trick
A note on image data processing
Normalize your input features (e.g. pixel values)
Centralize your input features
[0,255] -> [-1,1]
A note on Support Vector Machine (SVM)
The operation of the SVM algorithm is based on finding
the hyperplane that gives the largest minimum
distance to the training examples. Twice, this distance
receives the important name of margin within SVM’s
theory. Therefore, the optimal separating hyperplane
maximizes the margin of the training data.
Loss function
Also termed a cost function or the objective
Measure our unhappiness with outcomes )-:
Multiclass Support Vector Machine loss
(all vs. all)
The score for the j-th class is the j-th element:
The Multiclass SVM loss for the i-th example is then formalized as follows:
Margin
The SVM loss function wants the score of the correct class yi to be larger than the incorrect
class scores by at least by Δ (delta). If this is not the case, we will accumulate loss.
SVM Loss (cont.)
Hinge loss:
squared hinge loss SVM (or L2-SVM):
Regularization
We wish to encode some preference for a certain set of weights W over others to
Remove ambiguities.
We can do so by extending the loss function with a regularization penalty R(W).
The most common regularization penalty is the L2 norm that discourages large
weights through an elementwise quadratic penalty over all parameters:
Full multi-class SVM loss:
Softmax classifier
Generalizes to logistic regression classifier to multi-class, i.e.
or equivalently:
SSoftmax classifier: Information theory view
The cross-entropy between a “true” distribution p and an estimated distribution
q is defined as:
he Softmax classifier is hence minimizing the cross-entropy between the
estimated class probabilities and the “true” distribution, which in this
interpretation is the distribution where all probability mass is on the correct class
(i.e. p=[0,…1,…,0] contains a single 1 at the yi -th position.
Softmax classifier: probabilistic view
can be interpreted as the (normalized)
probability assigned to the correct label yi given
the image xi and parameterized by W
Intermediate summary
We defined a score function from image pixels to class scores (in this section, a linear function
that depends on weights W and biases b).
Unlike kNN classifier, the advantage of this parametric approach is that once we learn the
parameters we can discard the training data. Additionally, the prediction for a new test image is
fast since it requires a single matrix multiplication with W, not an exhaustive comparison to
every single training example.
We introduced the bias trick, which allows us to fold the bias vector into the weight matrix for
convenience of only having to keep track of one parameter matrix.
We defined a loss function (we introduced two commonly used losses for linear classifiers: the
SVM and the Softmax) that measures how compatible a given set of parameters is with respect
to the ground truth labels in the training dataset. We also saw that the loss function was defined
in such way that making good predictions on the training data is equivalent to having a small
loss.
What is next?
We now saw one way to take a dataset of images and map each one to class
scores based on a set of parameters, and we saw two examples of loss functions
that we can use to measure the quality of the predictions. But how do we
efficiently determine the parameters that give the best (lowest) loss? This
process is optimization, and it is the topic of the next class.