Neural Network - Sun Yat

Download Report

Transcript Neural Network - Sun Yat

Artificial Neural Network
Yalong Li
Some slides are from http://www.cs.cmu.edu/~tom/10701_sp11/slides/NNets701-3_24_2011_ann.pdf
Structure
• Motivation
• Artificial neural networks
• Learning: Backpropagation Algorithm
• Overfitting
• Expressive Capabilities of ANNs
• Summary
Some facts about our brain
• Performance tends to degrade gracefully under partial damage
• Learn (reorganize itself) from experience
• Recovery from damage is possible
• Performs massively parallel computations extremely efficiently
• For example, complex visual perception occurs within less than 100 ms, that is,
10 processing steps!(processing speed of synapses about 100hz)
• Supports our intelligence and self-awareness
Neural Networks in the Brain
• Cortex, midbrain, brainstem and cerebellum
• Visual System
• 10 or 11 processing stages have been identified
• feedforward
• earlier processing stages (near the sensory input) to later ones (near the motor output)
• feedback
Neurons and Synapses
• Basic computational unit in the nervous system is the nerve cell,
or neuron.
Synaptic Learning
• One way brain learn is by altering the strengths of connections
between neurons, and by adding or deleting connections between
neurons
• LTP(long-term potentiation)
•
Long-Term Potentiation:
•
•
An enduring (>1 hour) increase in synaptic efficacy that results from high-frequency stimulation of an afferent (input) pathway
The efficacy of a synapse can change as a result of experience, providing both memory and learning through long-term potentiation. One way
this happens is through release of more neurotransmitter.
• Hebbs Postulate:
•
"When an axon of cell A... excites[s] cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes
place in one or both cells so that A's efficiency as one of the cells firing B is increased.“
• Points to note about LTP:
•
•
•
Synapses become more or less important over time (plasticity)
LTP is based on experience
LTP is based only on local information (Hebb's postulate)
Brain → 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟
?
?
Structure
• Motivation
• Artificial neural networks
• Backpropagation Algorithm
• Overfitting
• Expressive Capabilities of ANNs
• Summary
Artificial Neural Networks to learn f: X → 𝑌
• f might be non-linear function
• X (vector of) continuous and/or discrete vars
• Y (vector of) continuous and/or discrete vars
• Represent by network of logistic units
• Each unit is a logistic function
unit output=
1
1+exp(𝑤0+ 𝑖 𝑤𝑖𝑥𝑖 )
• MLE: train weights of all units to minimize sum of squared
errors of predicted network outputs
• MAP: train to minimize sum of squared errors plus weight
magnitudes
Artificial Neural Networks to learn f: X → 𝑌
f: x-> y
• f(*) is:
Nonlinear activation function, for classification
Identity, for regression
• ∅𝑗(𝑥) depends on parameters and then to allow
these parameters to be adjusted, along with the coefficients {wj}
• Sigmoid function can be logistic or tanh
Artificial Neural Networks to learn f: X → 𝑌
aj: activations
h(*): nonlinear function
𝜎: activation function, determined by the nature
of the data and the assumed distribution of
target variables
Artificial Neural Networks to learn f: X → 𝑌
How to define 𝛿 ?
• for standard regression, 𝛿 is the identity:
y k = ak
• for multiple binary classification, each output
unit activation is transformed using a logistic
sigmoid function, so that:
yk = 𝛿(ak)
𝛿(a) = 1/(1+exp(-a))
• For multiclass problems, a softmax activation of the form:
Artificial Neural Networks to learn f: X → 𝑌
Why is that 𝛿 ?
There is a natural choice of both output unit activation function and matching error function, according to the type of problem being solved.
• Regression:
linear outputs, Error = sum-of-squares error
• (Multiple independent)binary classifications:
logistic sigmoid outputs, Error = cross-entropy error function
• Multiclass classification:
softmax output, Error = multiclass cross-entropy error function
Two classes ?
Derivative of the error function with respect to the activation for a particular
ouput take the form:
A probablilistic interpretation to the network outputs is given in book PRML, M.Bishop.
Multilayer Networks of Sigmoid Units
Multilayer Networks of Sigmoid Units
Connectionist Models
Consider humans:
• Neuron switching time ~.001 second
• Number of neurons ~1010fs
• Connections per neuron ~104-5
• Scene recognition time ~.1 second
• 100 inference steps doesn’t seem like enough
→ Much parallel compution
Properties of artificial neural nets(ANN’s):
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
Structure
• Motivation
• Artificial neural networks
• Learning: Backpropagation Algorithm
• Overfitting
• Expressive Capabilities of ANNs
• Summary
Backpropagation Algorithm
• Looks for the minium of the error function in weight space
using the method of gradient descent.
• The combination of weights which minimizes the error function
is considered to be a solution of the learning problem.
Sigmoid unit
Error Gradient for a Sigmoid Unit
Gradient Descent
Incremental(Stochastic) Gradient Descent
Backpropagation Algorithm(MLE)
Backpropagation Algorithm(MLE)
Derivation of the BP rule:
Error:
Goal:
Notation:
Backpropagation Algorithm(MLE)
For ouput unit j:
 (tj  o j )
−𝜎𝑗
Backpropagation Algorithm(MLE)
For hidden unit j:
-𝜎𝑗
More on Backpropagation
Structure
• Motivation
• Artificial neural networks
• Learning: Backpropagation Algorithm
• Overfitting
• Expressive Capabilities of ANNs
• Summary
Overfitting in ANNs
Dealing with Overfitting
Dealing with Overfitting
K-Fold Cross Validation
Leave-Out-One Cross Validation
Structure
• Motivation
• Artificial neural networks
• Backpropagation Algorithm
• Overfitting
• Expressive Capabilities of ANNs
• Summary
Expressive Capabilities of ANNs
• Single Layer: Preceptron
• XOR problem
• 8-3-8 problem
Single Layer: Perceptron
Single Layer: Perceptron
• Representational Power of Perceptrons
hyperplane decision surface in the n-dimensional
space of instances
wx = 0
• Linear separable sets
• Logical: and, or, …
• How to learn w ?
Single Layer: Perceptron
• Nonliear sets of examples?
Multi-layer perceptron, XOR
• 𝑧k = y1 AND NOT y2
= (x1 OR x2) AND NOT (x1 AND X2)
= x1 XOR x2
Boundary:
x1 + x2 + 0.5 = 0
x1 +x2 – 1.5 = 0
Multi-layer perceptron
Expressive Capabilities of ANNs
Leaning Hidden Layer Representations
• 8-3-8 problem
Leaning Hidden Layer Representations
• 8-3-8 problem
Leaning Hidden Layer Representations
• 8-3-8 problem
Auto Encoder?
Training
Training
Training
Neural Nets for Face Recognition
Leaning Hidden Layer Representations
Leaning Hidden Layer Representations
Structure
• Motivation
• Artificial neural networks
• Backpropagation Algorithm
• Overfitting
• Expressive Capabilities of ANNs
• Summary
Summary
• Brain
• parallel computing
• hierarchic network
• Artificial Neural Network
• Mathematical expression
• Activation function 𝜎 selection
• Gradient Descent and BP
• Error back-propagation for hidden units
• Overfitting
• Expressive capabilities of Anns
• Decision surface, function approximate, hidden layer
Thank you!