Deep Feedforward Networks
Download
Report
Transcript Deep Feedforward Networks
Deep Feedforward
Networks
6 ~ 6.2
Kyunghyun Lee
2016. 10. 17
목차
1. Introduction
2. Example: Learning XOR
3. Gradient-Based Learning
1.
Cost Functions
2.
Output Units
2
1. Introduction
Deep Feedforward Networks
• A feedforward network defines a mapping
• Feedforward : information flows x to f(x), there are no feedback
connections
• Networks : composing together many different functions
–
3
1. Introduction
Deep Feedforward Networks
depth
output layer
second layer,
hidden later
first layer,
input layer
4
1. Introduction
Deep Feedforward Networks
• Hidden layers
– training examples specify directly what the output layer must do
– behavior of the other layers(hidden layers) is not directly specified
by the training data
– learning algorithm must decide how to use these layers to best
implement an approximation
• Each hidden layer of the network is typically vector-valued
– dimensionality of these hidden layers determines the width
5
2. Example: Learning XOR
Learning the XOR function
model
fitting
XOR
Use linear regression
• loss function
x1
x2
f
0
0
0
0
1
1
1
0
1
1
1
0
• linear model
• learning result
model’s output
is always 0.5!
6
2. Example: Learning XOR
Add a different feature space : h space
• model
7
2. Example: Learning XOR
Add a different feature space : h space
• we must use a nonlinear function to describe the features
• activation function : rectified linear unit or ReLU
8
2. Example: Learning XOR
Add a different feature space : h space
• solution
– let
– calculation
activation
function
multiplying
by w
9
2. Example: Learning XOR
Conclusion
• we simply specified the solution, then showed that it obtained
zero error
• In a real situation, there might be billions of model parameters
and billions of training examples
• Instead, a gradient-based optimization algorithm can find
parameters that produce very little error
10
3. Gradient-Based Learning
Neural network vs Linear models
• Designing and training a neural network is not much different
from training any other machine learning model with gradient
descent
• The largest difference between the linear models we have seen
so far and neural networks is that the nonlinearity of a neural
network causes most interesting loss functions to become nonconvex
11
3. Gradient-Based Learning
Non-convex optimization
• no such convergence guarantee
• sensitive to the values of the initial parameters
For feedforward neural networks
• important to initialize all weights to small random values
• biases may be initialized to zero or to small positive values
12
3. Gradient-Based Learning
Gradient descent
• train models such as linear regression and support vector
machines with gradient descent
• must choose a cost function
• must choose how to represent the output of the model
13
3.1 Cost Functions
An important aspect of the design of a deep neural
network is the choice of the cost function
Parametric model defines a distribution
• maximum likelihood
predict some statistic of y conditioned on x
• specialized loss functions
14
3.1 Cost Functions
Learning Conditional Distributions with Maximum
Likelihood
• Most modern neural networks are trained using maximum
likelihood
• cost function is simply the negative log-likelihood
– removes the burden of designing cost functions for each model
model
cost function
15
3.1 Cost Functions
Learning Conditional Statistics
• Instead of learning a full probability distribution we often want to
learn just one conditional statistic
– ex) mean of y
• two results derived using calculus of variations(19.4.2)
– mean squared error :
– mean absolute error :
16
3.2 Output Units
The choice of cost function is tightly coupled with the
choice of output unit
suppose that the feedforward
network provides a set of
hidden features
role of the output layer is then
to provide some additional
transformation from the
features to complete the task
17
3.2 Output Units
Linear Units for Gaussian Output Distributions
• Given features h, a layer of linear output units produces a vector
• Linear output layers are often used to produce the mean of a
conditional Gaussian distribution
• Maximizing the log-likelihood is then equivalent to minimizing the
mean squared error
• may be used with gradient-based optimization algorithms and a
wide variety of optimization algorithms (linear units do not
18
3.2 Output Units
Sigmoid Units for Bernoulli Output Distributions
• for predicting the value of a binary variable
– ex) classification problems with two classes
• Bernoulli distribution
– neural net needs to predict only
– For this number to be a valid probability, it must lie in the interval
[0, 1]
• linear unit with threshold
– Any time that
strayed outside the unit interval, the
gradient of the output of the model with respect to its parameters 19
3.2 Output Units
Sigmoid Units for Bernoulli Output Distributions
• sigmoid output units combined with maximum likelihood
– sigmoid output unit :
– Probability distributions :
assumption : unnormalized log probabilities are linear in y and z
20
3.2 Output Units
Sigmoid Units for Bernoulli Output Distributions
• cost function with sigmoid approach
– cost function used with maximum likelihood is
– loss function :
– gradient-based learning is available
21
3.2 Output Units
Softmax Units for Multinoulli Output Distributions
• Multinoulli
– generalization of the Bernoulli distribution
– probability distribution over a discrete variable with n possible
values
• Softmax function
– often used as the output of a classifier
22
3.2 Output Units
Softmax Units for Multinoulli Output Distributions
• log softmax function
– first term : input zi always has a direct contribution to the cost
function
– second term : roughly approximated by
– negative log-likelihood cost function always strongly penalizes the
most active incorrect prediction
for correct answer
23
3.2 Output Units
Softmax Units for Multinoulli Output Distributions
• softmax saturation
– when the differences between input values become extreme
– saturates to 1
input is maximal and greater than all of the other inputs
– saturates to 0
input is not maximal and the maximum is much greater
– if the loss function is not designed to compensate for softmax
saturation, it can cause similar difficulties for learning
24
3.2 Output Units
Softmax Units for Multinoulli Output Distributions
• softmax argument : z
– can be produced in two different ways
– using the linear layer
this approach actually overparametrizes the distribution
constraint that the n outputs must sum to 1 means that only n−1
parameters are necessary
– impose a requirement that one element of z be fixed
example :
sigmoid softmax -
with a two-dimensional z
25
3.2 Output Units
Other Output Types
• In general, we can think of the neural network as representing a
function
• outputs of this function are not direct predictions of the value y
•
provides the parameters for a distribution over y
26
3.2 Output Units
Other Output Types
• example : variance
– learn the variance of a conditional Gaussian for y, given x
– negative log-likelihood
function
will provide a cost
– make optimization procedure incrementally learn the variance
27
3.2 Output Units
Other Output Types
• example : mixture density networks
– Neural networks with Gaussian mixtures as their output
– The neural network must have three outputs
vector
matrix
tensor
: mixture components
: means
: covariances
28
3.2 Output Units
Other Output Types
• example : mixture density networks
– mixture components
multinoulli distribution over the n different components
can typically be obtained by a softmax over an n-dimensional vector
– means
indicate the center associated with the i-th Gaussian component
Learning these means with maximum likelihood is slightly more
complicated than learning the means of a distribution with only one
output mode
29
3.2 Output Units
Other Output Types
• example : mixture density networks
– covariances
specify the covariance matrix for each component i
maximum likelihood is complicated by needing to assign partial
responsibility for each point to each mixture component
30