PPT - Aspiring Minds Research

Download Report

Transcript PPT - Aspiring Minds Research

Neural Network Models
Ashutosh Pandey and Shashank Srikant
Layout of talk
• Classification problem
• Idea of gradient descent
• Neural network architecture
• Learning a function using neural network
• Backpropagation algorithm
• Art of backpropagation
Classification Problem
• Finding the class of a given input.
• Build a model which can accurately predict the class based on input
• Given a training data set, use information in it to learn the model.
• A model architecture is fixed.
• Parameters of model are learned by minimizing a cost function.
• Cost function can be minimized using gradient descent.
• Neural network can be used for classification.
• Backpropagation algorithm is used to learn neural network model
• Backpropagation uses gradient descent for cost minimization
Idea of Gradient Descent
• Give a point on a surface find a direction along which function decreases at maximum rate.
• Take a fix step along that direction and reach to new point.
• Repeat previous 2 steps at new point.
• Why not use math formula?
Artificial Neural Networks
• Computing machines that try to mimic brain architecture
•
•
•
•
•
•
A large network of interconnected units
Each unit has simple input-output mapping
Each interconnection has numerical weight attached to it
Output of unit depends on outputs and connection weights of units connected to it
‘Knowledge’ resides in the weights
Problem solving ability is often through learning
Single Neuron Model
• 𝒙𝒊 are inputs into the (artificial) neuron and
𝒘𝒊 are the corresponding weights. 𝒚 is the
output of the neuron
• Net input : 𝜼 =
𝒋 𝒘𝒋 𝒙𝒋
• Output: 𝒚 = 𝒇 𝜼 , where 𝒇(. ) is called
activation function
Typical activation functions
1. Hard limiter:
𝒇 𝒙 = 𝟏 𝒊𝒇 𝒙 > 𝝉
= 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
• We can keep the 𝜏 to be zero and add one more input line to the neuron. An example of a single
neuron with this activation function is Perceptron.
Activation functions (cont).......
2. Sigmoid function
𝒇 𝒙 =
𝒂
𝟏+𝐞𝐱𝐩(−𝒃𝒙)
, 𝐚, 𝐛 > 𝟎
Activation functions (Contd.)
3. tanh
𝒇 𝒙 = 𝒂tanh(𝒃𝒙), 𝒂, 𝒃 > 𝟎
Networks of neurons
• We can connect a number of such units or neurons to form a network. Inputs to a neuron can be
outputs of other neurons (and/or external inputs).
• Notation:
• 𝑦𝑗 - Output of the 𝑗𝑡ℎ neuron;
• 𝑤𝑖𝑗 - weight of connection from neuron 𝑖 to neuron 𝑗
• A single neuron ‘represents’ a class of functions from ℜ𝑚 to ℜ.
• Specific set of weights realise specific functions.
• By interconnecting many units/neurons, networks can represent more complicated functions
′
from ℜ𝑚 to ℜ𝑚 .
• The architecture constrains the function class that can be represented. Weights define specific
function in the class
• To form meaningful networks, nonlinearity of activation function is important
Why study such Models?
• A belief that the architecture of Brain is critical to intelligent behaviour.
• Models can implement highly nonlinear functions. They are adaptive and can be trained.
• Useful in many applications
Time series prediction
system identification and control
pattern recognition and Regression
• Model can help us understand Brain function Computational neuroscience
Feedforward Neural network
• Feedforward network can always be organised as layered network
The function represented by previous network can be written as
• 𝒚𝟓 = 𝒇𝟓 𝒘𝟑𝟓 𝒚𝟑 + 𝒘𝟒𝟓 𝒚𝟒
•
= 𝒇𝟓 𝒘𝟑𝟓 𝒇𝟑 𝒘𝟏𝟑 𝒚𝟏 + 𝒘𝟐𝟑 𝒚𝟐 + 𝒘𝟒𝟓 𝒇𝟒 𝒘𝟏𝟒 𝒚𝟏 + 𝒘𝟐𝟒 𝒚𝟐
•
= 𝒇𝟓 𝒘𝟑𝟓 𝒇𝟑 𝒘𝟏𝟑 𝒙𝟏 + 𝒘𝟐𝟑 𝒙𝟐 + 𝒘 𝟒𝟓 𝒇𝟒 𝒘𝟏𝟒 𝒙𝟏 + 𝒘𝟐𝟒 𝒙𝟐
We can similarly write 𝑦6 as a function of 𝒙𝟏 , 𝒙𝟐
Multilayer feedforward network
• Here is a general multilayer feedforward network
Notation
• 𝐿 – number of layers
• 𝑛𝑙 - number of nodes in layer 𝑙, 𝑙 = 1, … , 𝐿.
• 𝑦𝑙𝑖 - output of 𝑖 𝑡ℎ node in layer 𝑙
𝑖 = 1, … , 𝜂𝑙 , 𝑙 = 1, … , 𝐿.
𝑙
• 𝑤𝑖𝑗
- weight of connection from node-𝑖 , layer-𝑙 to node-𝑗 , layer- 𝑙 + 1 .
• 𝜂𝑖𝑙 - net input of node-𝑖 in layer-𝑙
• Our network represents a function from ℜ
𝜂1
to ℜ
𝜂𝐿
The output of a typical unit is computed as
𝜼𝒍−𝟏
𝜼𝒍𝒋 =
𝒍−𝟏
𝒘𝒍−𝟏
𝒊𝒋 𝒚𝒊
𝒊=𝟏
𝒚𝒍𝒋 = 𝒇(𝜼𝒍𝒋 )
• To get a specific function we need to learn appropriate weights
• Thus,
𝒘𝒍𝒊𝒋 , 𝒊 = 𝟎, … , 𝜼𝒍 ,
𝒋 = 𝟏, … , 𝜼𝒍+𝟏 , 𝒍 = 𝟏, … , 𝑳 − 𝟏,
are parameters to learn
• Let 𝑾 represent all these parameters.
• The 𝒚𝒍𝒊 are functions of 𝑾 and the external inputs 𝑿 , though we may not always explicitly show it
in the notation.
Learning a function using neural network
• Let us first consider using neural network models to learn a function.
• Suppose we have training data
•
𝑿𝒊 , 𝒅𝒊 , 𝒊 = 𝟏, … , 𝑵 , where
• 𝑿𝒊 = 𝒙𝒊𝟏 , … , 𝒙𝒊𝒎
𝑻
∈ 𝕽𝒎 and
• 𝒅𝒊 = 𝒅𝒊𝟏 , … , 𝒅𝒊𝒎′
𝑻
∈ 𝕽𝒎
′
• We want to learn a neural network to represent this function.
• We can use a 𝑳 layer network with 𝒏𝟏 = 𝒎 and 𝒏𝑳 = 𝒎′ .
• 𝑳 and 𝒏𝟐 , … , 𝒏𝑳−𝟏 , are parameters which we fix (for now) arbitrarily.
• We assume that output in all layers including output layer use the sigmoid activation function.
• Hence we take 𝒅𝒊 ∈ 𝟎, 𝟏
𝒎′
, ∀ 𝒊.
• We can always linearly scale the output as needed (Or we can also use a linear activation function
for output nodes).
• Let
𝒚𝑳
=
𝑻
𝑳
′𝑳
𝒚𝒊 , … , 𝒚𝒎
denote the vector of outputs.
• We should actually write 𝒚𝑳 𝑾, 𝑿 , 𝒚𝑳𝒊 (𝑾, 𝑿) and so on.
• Now the risk minimization framework is as follows.
• We fix a 𝓗 by fixing architecture of the network. (That is fixing, 𝑳 and 𝒏𝒍 , 𝒍 = 𝟏, … , 𝑳 ).
• This hypothesis in space is parametrized by 𝑾.
• We can now choose a loss function and learn ‘best’ 𝑾 by minimizing empirical risk
• We choose squared error loss function.
• The empirical risk is given by
𝑹𝑵
𝟏
𝑾 =
𝑵
𝑵
𝑳
𝒊
𝒊
𝒚 𝑾, 𝑿 − 𝑫
𝒊=𝟏
𝑵
𝒎′
𝒚𝑳𝒋
=
𝒊=𝟏
𝑾, 𝑿
𝒊
−
𝒋=𝟏
• We want to find a 𝑾 that is a minimizer of 𝑹𝑵 𝑾
• Same as finding 𝑾 to minimize
𝑱 𝑾 =
𝟐
𝑵
𝒊=𝟏 𝑱𝒊
𝑾 , where
𝒊 𝟐
𝒅𝒋
𝟏
𝒋𝒊 𝑾 =
𝟐
𝒎′
𝒚𝑳𝒋 𝑾, 𝑿𝒊 − 𝒅𝒊𝒋
𝟐
𝒋=𝟏
• 𝒋𝒊 is the square of the error between the output of the network and the desired output for the
training example 𝑿𝒊 .
• One method of finding a minimizer of 𝑱 is to use gradient-descent.
• This gives us the following learning algorithm
𝑾 𝒕 + 𝟏 = 𝑾 𝒕 − 𝝀𝜵𝑱(𝑾(𝒕))
𝑵
=𝑾 𝒕 −𝝀
𝜵𝑱𝒊 (𝑾(𝒕))
𝒊=𝟏
Where 𝑡 is the iteration count and 𝝀 is the step-size parameter.
• To completely specify our algorithm for learning the weights, we need an expression for the gradient of
𝑱𝒊 .
• In terms of the individual weights, the gradient descent is
𝑵
𝒘𝒍𝒊𝒋
𝒕+𝟏 =
𝒘𝒍𝒊𝒋
𝒕 −𝝀
𝒔=𝟏
𝝏𝑱𝒔
𝝏𝒘𝒍𝒊𝒋
(𝑾(𝒕))
• Because of the structure of the network, there is an efficient way of computing such partial
derivatives.
• We look at this computation for any one training sample (and for a general multilayer
feedforward network).
• From now on we omit explicit mention of the specific training example.
• Any weight 𝒘𝒍𝒊𝒋 can affect 𝑱 only by affecting the final output of the network.
• In a layered network, the weight 𝒘𝒍𝒊𝒋 can affect the final output only through its affect on 𝜼𝒍+𝟏
.
𝒋
• Hence, using the chain rule of differentiation, we have
𝝏𝑱
𝝏𝒘𝒍𝒊𝒋
=
𝝏𝜼𝒍+𝟏
𝒋
𝝏𝑱
𝝏𝜼𝒍+𝟏
𝝏𝒘𝒍𝒊𝒋
𝒋
• Recall that
𝒏𝒍
𝜼𝒍+𝟏
=
𝒋
𝒘𝒍𝒔𝒋 𝒚𝒍𝒔 ⇒
𝒔=𝟏
𝝏𝜼𝒍+𝟏
𝒋
𝝏𝒘𝒍𝒊𝒋
• Define
𝜹𝒍𝒋 =
𝜹𝑱
𝜹𝜼𝒍𝒋
, ∀𝒋, 𝒍
= 𝒚𝒍𝒊
• Now we get
𝝏𝑱
𝝏𝒘𝒍𝒊𝒋
=
𝝏𝑱
𝝏𝜼𝒍+𝟏
𝒋
𝝏𝜼𝒍+𝟏
𝝏𝒘𝒍𝒊𝒋
𝒋
𝒍
= 𝜹𝒍+𝟏
𝒚
𝒋
𝒊
• We can get all the needed partial derivatives if we calculate 𝜹𝒍𝒋 for all nodes.
• We can compute 𝜹𝒍𝒋 recursively:
𝜹𝒍𝒋 =
𝜹𝑱
𝜹𝜼𝒍𝒋
𝒏𝒍+𝟏
=
𝒔=𝟏
𝝏𝑱 𝝏𝜼𝒍+𝟏
𝒔
𝝏𝜼𝒍+𝟏
𝝏𝜼𝒍𝒋
𝒔
• We can calculate 𝜹𝒍𝒋 recursively:
𝜹𝒍𝒋 =
𝜹𝑱
𝜹𝜼𝒍𝒋
𝒏𝒍+𝟏
=
𝒔=𝟏
𝒏𝒍+𝟏
=
𝒔=𝟏
𝝏𝑱
𝝏𝑱
𝝏𝜼𝒍+𝟏
𝒔
𝝏𝜼𝒍+𝟏
𝝏𝜼𝒍𝒋
𝒔
𝒍
𝝏𝒚
𝝏𝜼𝒍+𝟏
𝒋
𝒔
𝝏𝜼𝒍+𝟏
𝝏𝒚𝒍𝒋 𝝏𝜼𝒍𝒋
𝒔
𝒏𝒍+𝟏
𝜹𝒍+𝟏
𝒘𝒍𝒋𝒔 𝒇′(𝜼𝒍𝒋 )
𝒔
=
𝒔=𝟏
• Recall that the partial derivatives are given by
𝝏𝑱
𝝏𝒘𝒍𝒊𝒋
𝒍
= 𝜹𝒍+𝟏
𝒋 𝒚𝒊
• For the weights, range of 𝒍 is 𝒍 = 𝟐, … , 𝐋, and all nodes 𝒋 .
• Recall the recursive formula for 𝜹𝒍𝒋 =
𝒏𝒍+𝟏 𝒍+𝟏
𝒔=𝟏 𝜹𝒔
𝒘𝒍𝒋𝒔 𝒇′(𝜼𝒍𝒋 )
• So, we need to first compute, 𝜹𝑳𝒋
• By definition, we have
𝜹𝑳𝒋 =
𝝏𝑱
𝝏𝜼𝒍𝒋
• We have
𝑱=
• Hence we have
𝟏
𝟐
𝜼𝑳
𝒚𝑳𝒋 − 𝒅𝒋
𝟐
𝒋=𝟏
𝑳
𝝏𝒚
𝝏𝑱
𝝏𝑱
𝒋
𝜹𝑳𝒋 =
=
𝝏𝜼𝑳𝒋
𝝏𝒚𝑳𝒋 𝝏𝜼𝑳𝒋
= (𝒚𝑳𝒋 −𝒅𝒋 )𝒇′(𝜼𝑳𝒋 )
• Using the above we can compute 𝜹𝑳𝒋 𝒋 = 𝟏, … , 𝒏𝑳 .
• Then we can compute 𝜹𝒍𝒋 , 𝒋 = 𝟏 , … , 𝒏𝑳 for 𝒍 = 𝑳 − 𝟏 , … , 𝟐, recursively, using
𝒏𝒍+𝟏
𝜹𝑳𝒋 =
𝜹𝒍+𝟏
𝒘𝒍𝒋𝒔 𝒇′(𝜼𝒍𝒋 )
𝒔
𝒔=𝟏
• We call 𝜹𝒍𝒋 the ‘error’ at node-𝑗 layer , 𝒍.
• Then we compute all partial derivatives with respect to weights as
𝝏𝑱
𝝏𝒘𝒍𝒊𝒋
𝒍
= 𝜹𝒍+𝟏
𝒋 𝒚𝒊
and hence can update the weights using the gradient descent procedure.
• Note that we actually need partial derivatives with respect to 𝑱𝒊 .
Computing output of network
• For the input layer: 𝒚𝒍𝒊 = 𝒙𝒔𝒊 , 𝒊 = 𝟏, … , 𝒏𝟏 .
• For 𝒍 = 𝟐, … , 𝑳, we now compute
𝒏𝒍 −𝟏
𝜼𝒍𝒋 =
𝒍−𝟏
𝒘𝒍−𝟏
𝒊𝒋 𝒚𝒊
𝒊=𝟏
𝒚𝒍𝒋 = 𝒇 𝜼𝒍𝒋
• Once we have the output of the network, we then need to compute the ‘errors’ 𝜹𝒍𝒋
Backpropagation of errors
• At the output layer
𝜹𝑳𝒋 = (𝒚𝑳𝒋 − 𝒅𝒋 )𝒇′(𝜼𝑳𝒋 )
• Now, for layers 𝒍 = 𝑳 − 𝟏 , … , 𝟐, we compute
𝒏𝒍+𝟏
𝜹𝑳𝒋 =
𝜹𝒍+𝟏
𝒘𝒍𝒋𝒔 𝒇′(𝜼𝒍𝒋 )
𝒔
𝒔=𝟏
• Once all 𝛿𝑗𝐿 are available, we update the weights by
𝒍 𝒕+𝟏
𝒘𝒊𝒋
𝒍 𝒕
𝒍
= 𝒘𝒊𝒋 − 𝝀𝜹𝒍+𝟏
𝒋 𝒚𝒊
Backpropagation algorithm
• We train the neural network by minimization of empirical risk under squared error loss.
• We use gradient descent for the minimization.
• Because of the network structure, the needed partial derivatives are computed efficiently.
• One forward pass to compute outputs of network and one backward pass to compute all errors
and hence all the needed derivatives.
• This algorithm is called Backpropagation (or backpropagation of error).
• Suppose we have 𝑴𝒘 number of weights and 𝑴𝒍 number of nodes in the network.
• The computational complexity of backpropagation algorithm will be 𝑶(𝑴𝒘 )
• In the backpropagation of error, the number of computations needed are of the same order as
the forward computation
• Numerically we need 𝑶(𝑴𝟐𝒘 ) computation to find all derivatives
• Consider a 3-layer network with 𝑚 input nodes, one output node and p nodes in the hidden layer
• This represents a function from 𝕽𝒎 to 𝕽
• The output of this network can be written as
𝒑
𝒎
𝒘𝟐𝒋𝟏 𝒇
𝒚=𝒇
𝒋=𝟏
𝒘𝟏𝒊𝒋 𝒙𝒊
𝒊=𝟏
• Suppose the activation function of output node is linear and all hidden nodes have bias inputs.
Then we can rewrite this as
𝒑
𝒚=
𝒎
𝜷𝒋 𝒇
𝒋=𝟏
𝒘𝒊𝒋 𝒙𝒊 + 𝒃𝒋
𝒊=𝟏
• The three layer network represents a function
𝒑
𝒉 𝑿 =
𝒎
𝜷𝒋 𝒇
𝒘𝒊𝒋 𝒙𝒊 + 𝒃𝒋
𝒋=𝟏
𝒊=𝟏
• Linear regression models are of the form
𝒑
𝒉 𝑿 =
𝜷𝒋 𝝋𝒋 (𝑿)
𝒋=𝟏
• What is the difference?
• The main difference is that the basis functions are not fixed beforehand.
• The basis functions themselves are adapted or learnt using the training data.
• We can think of the outputs of the hidden layer as a ‘proper’ representation of the input so that
now we can use a ‘linear’ model for predicting the target.
• ‘Backpropagation algorithm learns proper internal representations’.
• By using a neural network, we are adapting the basis rather than choose a fixed basis.
• Is there some advantage?
• One can show that a neural network (whose weights globally minimize the sum of squared errors)
would achieve better approximation accuracy than the function learnt using a fixed basis
• That is, the approximation error with a neural network falls faster with 𝑁, the number of
examples.
Art of Backpropagation
• To use a network for learning a function, we have to decide on many ‘parameters’ :
•
•
•
•
•
•
•
•
•
Number of hidden layers and hidden nodes
Activation function for nodes
Online or batch mode for learning
The initial values for weights
Step-sizes and other issues with the learning algorithm
We need to fix the structure of network before we can learn weights using backpropagation
Theoretically one hidden layer is enough
But how many hidden nodes?
In practice how many hidden layers, nodes?
• The VC-dimension of these models is of the order of number of weights plus nodes.
• Structure should not be too complicated relative to the number of examples we have.
• What kind of activation functions to use?
• Both sigmoid and tanh are suitable.
• Should we do online or batch mode updates?
• We also generally normalize the input (or feature) vectors.
• We can use a linear transform to bring each feature value to [−1, 1].
• Or we can transform each feature to be a zero-mean unit variance random variable.
• Another factor that affects the performance of gradient descent is the initialization of weights
• Small random values for initialization is better.
• Backpropagation is a gradient descent in a very high dimensional space.
• Hence it has all problems associated with such gradient descent.
• It gets stuck in local minima. Often multiple starting points are used.
• It is also generally slow
• There are some ways to improve this.
• Hessian technique. Uses second order information and converges quickly.
• Version of back propagation to compute Hessian.
• Hessian is computationally intensive
Backpropagation with momentum term
• One often uses a so called momentum term and writes the algorithm as
𝜟𝒘𝒍𝒊𝒋 𝒕 = −𝝀
𝝏𝑱
𝝏𝒘𝒍𝒊𝒋
𝒕 + 𝜸 𝜟𝒘𝒍𝒊𝒋 𝒕 − 𝟏
Where 𝜟𝒘𝒍𝒊𝒋 𝒕 = 𝒘𝒍𝒊𝒋 𝒕 + 𝟏 − 𝒘𝒍𝒊𝒋 𝒕
• At each iteration, we add a small term which is proportional to the direction in which we moved
in the previous iteration.
𝑣 → 𝑣 ′ = 𝜇𝑣 − η𝛻𝐶
w→ 𝑤 ′ = 𝑤 + 𝑣′
The cross-entropy cost function
• Problem with sigmoid function.
• Partial derivative of cost function with respect to weights is proportional to derivative of sigmoid
function
• Cross-entropy cost function makes it proportional to error only.
𝑪=−
𝟏
𝑵
𝒙
𝒋
𝒚𝑳𝒋 𝐥𝐧𝒅𝑳𝒋 + 𝟏 − 𝒚𝑳𝒋 𝐥𝐧(𝟏 − 𝒅𝑳𝒋 )
• It can be shown that it this case partial derivative is proportional to error only
Softmax
𝑦𝑗𝐿 =
𝑍𝐿
𝑒 𝑗
𝐿
𝑒 𝑍𝑘
𝑘
𝑍𝑗𝐿 =
𝑘
𝑦𝑗𝐿
𝑗
,Where
𝐿 𝐿−1
𝑤𝑗𝑘
𝑦𝑘 + 𝑏𝑗𝐿
=
𝑗𝑒
𝑘
𝑍𝑗𝐿
𝐿
𝑒 𝑧𝑘
=1
𝐶 ≡ −ln(𝑦𝑗𝐿 ) if input is of class 𝑗
Overfitting
• When number of training data point is not sufficient for parameter learning
• Early stopping is a way to stop overfitting.
• Early stopping may not always.
• Regularization is another way to avoid overfitting
• Increasing the training data is also a solution but not always practical
• Reduce the size of network but large networks are more powerful than small networks.
Regularization
1
𝐶=−
𝑁
𝑑𝑗𝐿 𝑙𝑛𝑦𝑗𝐿
+ (1 −
𝑑𝑗𝐿 )
𝑥𝑗
1
𝐶=−
𝑁
||𝑑𝑗𝑙
−
𝑦𝑗𝐿 |
𝑥𝑗
𝜆
𝐶 = 𝐶0 +
2𝑁
𝜆
+
2𝑁
2
𝑤 2 − 𝐶𝑜𝑠𝑡 𝐸𝑛𝑡𝑟𝑜𝑝𝑦
𝑤
𝜆
+
2𝑁
𝑤 2 − 𝑀𝑆𝐸
𝑤
𝑤 2 − 𝐺𝑒𝑛𝑒𝑟𝑎𝑙
𝑤
𝜕𝐶 𝜕𝐶0 𝜆
=
+ 𝑤 − 𝑊𝑒𝑖𝑔ℎ𝑡 𝑈𝑝𝑑𝑎𝑡𝑒
𝜕𝑤 𝜕𝑤 𝑛
𝜕𝐶 𝜕𝐶0
=
− 𝐵𝑖𝑎𝑠 𝑈𝑝𝑑𝑎𝑡𝑒
𝜕𝑏
𝜕𝑏
𝑤 → 1−
𝜂𝜆
𝜕𝐶0
𝑤−𝜂
− Weight Decay
𝑁
𝜕𝑤
• In neural networks regularization can play two important roles:
• Reduce overfitting
• Avoid local minima
• How does regularization reduce overfitting?
• How does regularization avoid local minima?
• Role of regularization in gradient descent
• The dynamics of gradient descent learning in multilayer nets has a ‘self-regularization’ effect
L1 Regulaization
𝐶 = 𝐶0 +
𝜆
2𝑁
|𝑤|
𝑤
𝜕𝐶 𝜕𝐶0 𝜆
=
+ 𝑠𝑔𝑛(𝑤) − 𝑊𝑒𝑖𝑔ℎ𝑡 𝑈𝑝𝑑𝑎𝑡𝑒
𝜕𝑤 𝜕𝑤 𝑛
𝜆
𝜕𝐶0
𝑤 → 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 − 𝜂
𝑛
𝜕𝑤
𝜂𝜆
𝜕𝐶0
𝑤 → 1−
𝑤−𝜂
𝑁
𝜕𝑤
− L1
− L2
Dropout
• We start by randomly (and temporarily) deleting half the hidden
neurons in the network, while leaving the input and output
neurons untouched
• Update the remaining weights and bias using backpropagation.
• Restore the dropout neurons
• Repeat previous steps by randomly deleting half the nodes
• Different networks may overfit in different ways, and averaging
may help eliminate that kind of overfitting.
•
"This technique reduces complex co-adaptations of neurons, since a neuron
cannot rely on the presence of particular other neurons. It is, therefore, forced
to learn more robust features that are useful in conjunction with many different
random subsets of the other neurons.”
Weight Initialization
• Why not initialize with zero mean, unit variance Gaussian random
variable?
• Cost entropy function helped with saturated output neurons, it does
nothing at all for the problem with saturated hidden neurons.
• Normally one uses random initial weights drawn from a distribution with
mean zero and variance 𝟏 𝝂 where 𝝂 is the in-degree of the node to which
this weight connects.
• It’s not only the speed of learning which is improved, it's sometimes also
the final performance.
Number of training epochs
• Use early stopping to determine the number of training epochs
• in the early stages of experimentation it can be helpful to turn off early stopping, so you can see
any signs of overfitting, and use it to inform your approach to regularization.
• A good rule for early stopping is to terminate if the best classification accuracy doesn't improve
for quite some time. (for some epochs, say 10)
• Regularization parameter 𝜆
• Start with no regularization and determine a value of 𝜂
• Use the validation data to select a good value for 𝜆
• That done, you should return and re-optimize 𝜂 again
• Online mode or Batch mode?
• Online mode is fast as it updates the weights after each training sample
• A compromise between online and batch mode. Mini-batch
• We can use inbuilt math libraries to calculate derivatives on mini-batch data.
• Fortunately, the choice of mini-batch size at which the speed is maximized is relatively
independent of the other hyper-parameters (apart from the overall architecture)
• the space of hyper-parameters is so large that one never really finishes optimizing, one only
abandons the network to posterity
• "Yes, a well-tuned neural network may get the best performance on the problem. On the other
hand, I can try a random forest [or SVM or … insert your own favourite technique] and it just
works. I don't have time to figure out just the right neural network.”
• Is this the right attitude?
Questions?
References
• http://neuralnetworksanddeeplearning.com/
• http://nptel.ac.in/downloads/117108048/
THANK YOU