Neural Networks

Download Report

Transcript Neural Networks

Machine Learning
Neural Networks
Human Brain
Neurons
Input-Output Transformation
Input
Spikes
Output
Spike
Spike (= a brief pulse)
(Excitatory Post-Synaptic Potential)
Human Learning
•
•
•
•
Number of neurons:
Connections per neuron:
Neuron switching time:
Scene recognition time:
~
~
~
~
1010
104 to 105
0.001 second
0.1 second
100 inference steps doesn’t seem much
Machine Learning Abstraction
Artificial Neural Networks
• Typically, machine learning ANNs are very
artificial, ignoring:
– Time
– Space
– Biological learning processes
• More realistic neural models exist
– Hodkins & Huxley (1952) won a Nobel prize
for theirs (in 1963)
• Nonetheless, very artificial ANNs have
been useful in many ML applications
Perceptrons
• The “first wave” in neural networks
• Big in the 1960’s
– McCulloch & Pitts (1943), Woodrow & Hoff
(1960), Rosenblatt (1962)
A single perceptron
x1
x0
Bias (x
0
=1,always)
w1
w0
w2
n

1 if  wi xi  0
 
i 0
0 else
Inputs
x2
x3
w3
x4 w4
x5 w5
Logical Operators
x0
-0.8
x1
0.5
x2
n

1 if  wi xi  0
 
i 0
0 else
AND
x0
0.5
0.0
x1
x0
-0.3
x1
0.5
x2
0.5
n

1 if  wi xi  0
 
i 0
0 else
OR
n

1 if  wi xi  0
 
i 0
0 else
-1.0
NOT
Perceptron Hypothesis Space
• What’s the hypothesis space for a perceptron on
n inputs?

 
n 1
H  w | w

Learning Weights
• Perceptron Training Rule
• Gradient Descent
• (other approaches: Genetic Algorithms)
x0
x1
?
x2
?
?
n

1 if  wi xi  0
 
i 0
0 else
Perceptron Training Rule
• Weights modified for each example
• Update Rule:
wi  wi  wi
where
wi   (t  o) xi
learning
rate
target
value
perceptron
output
input
value
What weights make XOR?
x0
?
x1
?
x2
n

1 if  wi xi  0
 
i 0
0 else
?
• No combination of weights works
• Perceptrons can only represent linearly
separable functions
Linear Separability
x2


OR


x1
Linear Separability
x2


AND


x1
Linear Separability
x2


XOR


x1
Perceptron Training Rule
• Converges to the correct classification IF
– Cases are linearly separable
– Learning rate is slow enough
– Proved by Minsky and Papert in 1969
Killed widespread interest in perceptrons till the 80’s
XOR
x0
0
0.6
x1
n

1 if  wi xi  0
 
i 0
0 else
x0
0
1
-0.6
x0
0
-0.6

1 if  wi xi  0
 
i 0
0 else
n
x2
0.6
1
n

1 if  wi xi  0
 
i 0
0 else
XOR
What’s wrong with perceptrons?
• You can always plug multiple perceptrons
together to calculate any function.
• BUT…who decides what the weights are?
– Assignment of error to parental inputs
becomes a problem….
– This is because of the threshold….
• Who contributed the error?
Problem: Assignment of error
x0
?
x1
?
x2
n

1 if  wi xi  0
 
i 0
0 else
Perceptron Threshold
Step function
?
• Hard to tell from the output who
contributed what
• Stymies multi-layer weight learning
Solution: Differentiable Function
x0
x1
n
?
x2
Simple linear function
?
   wi xi
i 0
?
• Varying any input a little creates a
perceptible change in the output
• This lets us propagate error to prior nodes
Measuring error for linear units
• Output Function

 
 (x)  w  x
• Error Measure:
 1
2
E ( w)   (t d  od )
2 dD
data
target
value
linear unit
output
Gradient Descent
Training rule:
Gradient:
Gradient Descent Rule
E
 1
2

(t d  od )

wi wi 2 dD
  (t d  od )(  xi ,d )
d D
Update Rule:
wi  wi    (td  od ) xi ,d
d D
Back to XOR
x0
x1
n

1 if  wi xi  0
 
i 0
0 else
x0
x2
n

1 if  wi xi  0
 
i 0
0 else
x0
n

1 if  wi xi  0
 
i 0
0 else
XOR
Gradient Descent for Multiple Layers
x0
x1
wij
n
   wi xi
x0
i 0
n
   wi xi
x0
i 0
n
x2
XOR
   wi xi
i 0
We can compute:
E

wij
Gradient Descent vs. Perceptrons
• Perceptron Rule & Threshold Units
– Learner converges on an answer ONLY IF
data is linearly separable
– Can’t assign proper error to parent nodes
• Gradient Descent
– Minimizes error even if examples are not
linearly separable
– Works for multi-layer networks
• But…linear units only make linear decision surfaces
(can’t learn XOR even with many layers)
– And the step function isn’t differentiable…
A compromise function
• Perceptron
n

1 if  wi xi  0
output  
i 0

0 else
• Linear
n
output  net   wi xi
i 0
• Sigmoid (Logistic)
1
output   (net ) 
 net
1 e
The sigmoid (logistic) unit
• Has differentiable function
– Allows gradient descent
• Can be used to learn non-linear functions
x1
?

1
n
1 e
x2
?
  wi xi
i 0
Logistic function
Inputs
Age
Gender
Stage
34
1
4
Independent
variables
Output
.5
.4
0.6
S
“Probability
of beingAlive”
.8
Coefficients
Prediction

1
n
1 e
  wi xi
i 0
Neural Network Model
Inputs
Age
.6
34
.2
.4
S
.5
.1
Gender
2
.2
.3
S
.7
Stage
4
Independent
variables
Output
.8
S
“Probability
of beingAlive”
.2
Weights
0.6
Hidden
Layer
Weights
Dependent
variable
Prediction
Getting an answer from a NN
Inputs
Age
Output
.6
34
.5
.1
Gender
S
2
Stage
“Probability
.8
.7
of beingAlive”
4
Independent
variables
0.6
Weights
Hidden
Layer
Weights
Dependent
variable
Prediction
Getting an answer from a NN
Inputs
Age
Output
34
.5
.2
Gender
2
S
.3
“Probability
.8
Stage
4
Independent
variables
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Getting an answer from a NN
Inputs
Age
Output
.6
34
.5
.2
.1
Gender
1
S
.3
.7
Stage
4
Independent
variables
“Probability
.8
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Minimizing the Error
Error surface
initial error
negative derivative
final error
local minimum
winitial wtrained
positive change
Differentiability is key!
• Sigmoid is easy to differentiate
 ( y )
  ( y )  (1   ( y ))
y
• For gradient descent on multiple layers, a
little dynamic programming can help:
– Compute errors at each output node
– Use these to compute errors at each hidden node
– Use these to compute errors at each input node
The Backpropagation Algorithm

For each input trai ning example, x,t

1. Input instance x to the network and compute the output ou
for every unit u in the network
2. For each output unit k , calculate its error term δk
δk  ok (1  ok )(t k  ok )
3. For each hidden unit h, calculate its error term δh
δh  oh (1  oh )
w
koutputs
δ
hk k
4.Update each network we ight w ji
w ji  w ji  δk x ji
Getting an answer from a NN
Inputs
Age
Output
.6
34
.5
.2
.1
Gender
1
S
.3
.7
Stage
4
Independent
variables
“Probability
.8
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
Expressive Power of ANNs
• Universal Function Approximator:
– Given enough hidden units, can approximate
any continuous function f
• Need 2+ hidden units to learn XOR
• Why not use millions of hidden units?
– Efficiency (neural network training is slow)
– Overfitting
Overfitting
Real Distribution
Overfitted Model
Combating Overfitting in Neural Nets
• Many techniques
• Two popular ones:
– Early Stopping
• Use “a lot” of hidden units
• Just don’t over-train
– Cross-validation
• Choose the “right” number of hidden units
Early Stopping
error
Overfitted model
errora
min (error)
a = validation set
b = training set
errorb
Stopping criterion
Epochs
Cross-validation
• Cross-validation: general-purpose technique for
model selection
– E.g., “how many hidden units should I use?”
• More extensive version of validation-set approach.
Cross-validation
• Break training set into k sets
• For each model M
– For i=1…k
• Train M on all but set i
• Test on set i
• Output M with highest average test score,
trained on full training set
Summary of Neural Networks
Non-linear regression technique that is trained
with gradient descent.
Question: How important is the biological
metaphor?
Summary of Neural Networks
When are Neural Networks useful?
– Instances represented by attribute-value pairs
• Particularly when attributes are real valued
– The target function is
• Discrete-valued
• Real-valued
• Vector-valued
– Training examples may contain errors
– Fast evaluation times are necessary
When not?
– Fast training times are necessary
– Understandability of the function is required
Advanced Topics in Neural Nets
•
•
•
•
•
Batch Move vs. incremental
Hidden Layer Representations
Hopfield Nets
Neural Networks on Silicon
Neural Network language models
Incremental vs. Batch Mode
Incremental vs. Batch Mode
• In Batch Mode we minimize:

• Same as computing: wD 
• Then setting



w  w   wD

 wd
d D
Advanced Topics in Neural Nets
•
•
•
•
•
Batch Move vs. incremental
Hidden Layer Representations
Hopfield Nets
Neural Networks on Silicon
Neural Network language models
Hidden Layer Representations
• Input->Hidden Layer mapping:
– representation of input vectors tailored to the
task
• Can also be exploited for dimensionality
reduction
– Form of unsupervised learning in which we
output a “more compact” representation of
input vectors
– <x1, …,xn> -> <x’1, …,x’m> where m < n
– Useful for visualization, problem simplification,
data compression, etc.
Dimensionality Reduction
Model:
Function to learn:
Dimensionality Reduction: Example
Dimensionality Reduction: Example
Dimensionality Reduction: Example
Dimensionality Reduction: Example
Advanced Topics in Neural Nets
•
•
•
•
•
Batch Move vs. incremental
Hidden Layer Representations
Hopfield Nets
Neural Networks on Silicon
Neural Network language models
Advanced Topics in Neural Nets
•
•
•
•
•
Batch Move vs. incremental
Hidden Layer Representations
Hopfield Nets
Neural Networks on Silicon
Neural Network language models
Neural Networks on Silicon
• Currently:
Simulation of continuous device
physics (neural networks)
Digital computational
model
Why
not
(thresholding)
skip this?
Continuous device physics
(voltage)
Example: Silicon Retina
Simulates function
of biological retina
Single-transistor
synapses adapt to
luminance,
temporal contrast
Modeling retina
directly on chip
=> requires 100x
less power!
Example: Silicon Retina
• Synapses modeled with single transistors
Luminance Adaptation
Comparison with Mammal Data
• Real:
• Artificial:
• Graphics and results taken from:
General NN learning in silicon?
• Seems less in-vogue than in late-90s
• Interest has turned somewhat to
implementing Bayesian techniques in
analog silicon
Advanced Topics in Neural Nets
•
•
•
•
•
Batch Move vs. incremental
Hidden Layer Representations
Hopfield Nets
Neural Networks on Silicon
Neural Network language models
Neural Network Language Models
• Statistical Language Modeling:
– Predict probability of next word in sequence
I was headed to Madrid , ____
P(___ = “Spain”) = 0.5,
P(___ = “but”) = 0.2, etc.
• Used in speech recognition, machine
translation, (recently) information
extraction
Formally
• Estimate:
P (w j | w j 1 , w j  2 ,..., w j  n 1 
 P (w j | h j 
Optimizations
• Key idea – learn simultaneously:
– vector representations for each word (50 dim)
– a predictor of the next word
• Short-lists
– Much complexity in hidden->output layer
• Number of possible next words is large
– Only predict a subset of words
• Use a standard probabilistic model for the rest
Design Decisions (1)
• Number of hidden units
• Almost no difference…
Design Decisions (2)
• Word representation (# of dimensions)
• They chose 120
Comparison vs. state of the art