Deep Feedforward Networks

Download Report

Transcript Deep Feedforward Networks

Deep Feedforward
Networks
6 ~ 6.2
Kyunghyun Lee
2016. 10. 17
목차
1. Introduction
2. Example: Learning XOR
3. Gradient-Based Learning
1.
Cost Functions
2.
Output Units
2
1. Introduction
 Deep Feedforward Networks
• A feedforward network defines a mapping
• Feedforward : information flows x to f(x), there are no feedback
connections
• Networks : composing together many different functions
–
3
1. Introduction
 Deep Feedforward Networks
depth
output layer
second layer,
hidden later
first layer,
input layer
4
1. Introduction
 Deep Feedforward Networks
• Hidden layers
– training examples specify directly what the output layer must do
– behavior of the other layers(hidden layers) is not directly specified
by the training data
– learning algorithm must decide how to use these layers to best
implement an approximation
• Each hidden layer of the network is typically vector-valued
– dimensionality of these hidden layers determines the width
5
2. Example: Learning XOR
 Learning the XOR function
model
fitting
XOR
 Use linear regression
• loss function
x1
x2
f
0
0
0
0
1
1
1
0
1
1
1
0
• linear model
• learning result
model’s output
is always 0.5!
6
2. Example: Learning XOR
 Add a different feature space : h space
• model
7
2. Example: Learning XOR
 Add a different feature space : h space
• we must use a nonlinear function to describe the features
• activation function : rectified linear unit or ReLU
8
2. Example: Learning XOR
 Add a different feature space : h space
• solution
– let
– calculation
activation
function
multiplying
by w
9
2. Example: Learning XOR
 Conclusion
• we simply specified the solution, then showed that it obtained
zero error
• In a real situation, there might be billions of model parameters
and billions of training examples
• Instead, a gradient-based optimization algorithm can find
parameters that produce very little error
10
3. Gradient-Based Learning
 Neural network vs Linear models
• Designing and training a neural network is not much different
from training any other machine learning model with gradient
descent
• The largest difference between the linear models we have seen
so far and neural networks is that the nonlinearity of a neural
network causes most interesting loss functions to become nonconvex
11
3. Gradient-Based Learning
 Non-convex optimization
• no such convergence guarantee
• sensitive to the values of the initial parameters
 For feedforward neural networks
• important to initialize all weights to small random values
• biases may be initialized to zero or to small positive values
12
3. Gradient-Based Learning
 Gradient descent
• train models such as linear regression and support vector
machines with gradient descent
• must choose a cost function
• must choose how to represent the output of the model
13
3.1 Cost Functions
 An important aspect of the design of a deep neural
network is the choice of the cost function
 Parametric model defines a distribution
• maximum likelihood
 predict some statistic of y conditioned on x
• specialized loss functions
14
3.1 Cost Functions
 Learning Conditional Distributions with Maximum
Likelihood
• Most modern neural networks are trained using maximum
likelihood
• cost function is simply the negative log-likelihood
– removes the burden of designing cost functions for each model
model
cost function
15
3.1 Cost Functions
 Learning Conditional Statistics
• Instead of learning a full probability distribution we often want to
learn just one conditional statistic
– ex) mean of y
• two results derived using calculus of variations(19.4.2)
– mean squared error :
– mean absolute error :
16
3.2 Output Units
 The choice of cost function is tightly coupled with the
choice of output unit
 suppose that the feedforward
network provides a set of
hidden features
 role of the output layer is then
to provide some additional
transformation from the
features to complete the task
17
3.2 Output Units
 Linear Units for Gaussian Output Distributions
• Given features h, a layer of linear output units produces a vector
• Linear output layers are often used to produce the mean of a
conditional Gaussian distribution
• Maximizing the log-likelihood is then equivalent to minimizing the
mean squared error
• may be used with gradient-based optimization algorithms and a
wide variety of optimization algorithms (linear units do not
18
3.2 Output Units
 Sigmoid Units for Bernoulli Output Distributions
• for predicting the value of a binary variable
– ex) classification problems with two classes
• Bernoulli distribution
– neural net needs to predict only
– For this number to be a valid probability, it must lie in the interval
[0, 1]
• linear unit with threshold
– Any time that
strayed outside the unit interval, the
gradient of the output of the model with respect to its parameters 19
3.2 Output Units
 Sigmoid Units for Bernoulli Output Distributions
• sigmoid output units combined with maximum likelihood
– sigmoid output unit :
– Probability distributions :
 assumption : unnormalized log probabilities are linear in y and z
20
3.2 Output Units
 Sigmoid Units for Bernoulli Output Distributions
• cost function with sigmoid approach
– cost function used with maximum likelihood is
– loss function :
– gradient-based learning is available
21
3.2 Output Units
 Softmax Units for Multinoulli Output Distributions
• Multinoulli
– generalization of the Bernoulli distribution
– probability distribution over a discrete variable with n possible
values
• Softmax function
– often used as the output of a classifier
22
3.2 Output Units
 Softmax Units for Multinoulli Output Distributions
• log softmax function
– first term : input zi always has a direct contribution to the cost
function
– second term : roughly approximated by
– negative log-likelihood cost function always strongly penalizes the
most active incorrect prediction
 for correct answer
23
3.2 Output Units
 Softmax Units for Multinoulli Output Distributions
• softmax saturation
– when the differences between input values become extreme
– saturates to 1
 input is maximal and greater than all of the other inputs
– saturates to 0
 input is not maximal and the maximum is much greater
– if the loss function is not designed to compensate for softmax
saturation, it can cause similar difficulties for learning
24
3.2 Output Units
 Softmax Units for Multinoulli Output Distributions
• softmax argument : z
– can be produced in two different ways
– using the linear layer
 this approach actually overparametrizes the distribution
 constraint that the n outputs must sum to 1 means that only n−1
parameters are necessary
– impose a requirement that one element of z be fixed
 example :
sigmoid softmax -
with a two-dimensional z
25
3.2 Output Units
 Other Output Types
• In general, we can think of the neural network as representing a
function
• outputs of this function are not direct predictions of the value y
•
provides the parameters for a distribution over y
26
3.2 Output Units
 Other Output Types
• example : variance
– learn the variance of a conditional Gaussian for y, given x
– negative log-likelihood
function
will provide a cost
– make optimization procedure incrementally learn the variance
27
3.2 Output Units
 Other Output Types
• example : mixture density networks
– Neural networks with Gaussian mixtures as their output
– The neural network must have three outputs
 vector
 matrix
 tensor
: mixture components
: means
: covariances
28
3.2 Output Units
 Other Output Types
• example : mixture density networks
– mixture components
 multinoulli distribution over the n different components
 can typically be obtained by a softmax over an n-dimensional vector
– means
 indicate the center associated with the i-th Gaussian component
 Learning these means with maximum likelihood is slightly more
complicated than learning the means of a distribution with only one
output mode
29
3.2 Output Units
 Other Output Types
• example : mixture density networks
– covariances
 specify the covariance matrix for each component i
 maximum likelihood is complicated by needing to assign partial
responsibility for each point to each mixture component
30