Transcript NeuralNets

Neural Networks
ICS 273A UC Irvine
Instructor: Max Welling
Neurons
• Neurons communicate by receiving signals
on their dendrites. Adding these signals and
firing off a new signal along the axon if the total
input exceeds a threshold.
• The axon connects to new dendrites through
synapses which can learn how much signal
is transmitted.
• McCulloch and Pitt (’43) built a first abstract
model of a neuron.
1
b
y  g (Wi xi  b )
i
output
activation
function
input
weights
bias
Neurons
• We have about 1011 neurons, each one connected to 10 4 other
neurons on average.
• Each neuron needs at least 10 3 seconds to transmit the signal.
• So we have many, slow neurons. Yet we recognize our grandmother in 10 sec.
1
• Computers have much faster switching times: 10 10 sec.
• Conclusion: brains compute in parallel !
• In fact, neurons are unreliable/noisy as well.
But since things are encoded redundantly by many of them,
their population can do computation reliably and fast.
fi (x )
Classification
& Regression
Wij
xj
• Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y).
• If Y is continuous: regression, if Y is discrete: classification.
• We adapt the weights so as to minimize the error between the data and the
model predictions.
N dout
din
n 1 i 1
j 1
error   (yin  Wij x jn  bi )2
• This is just a perceptron with a quadratic cost function.
Optimization
• We use stochastic gradient descent: pick a single data-item, compute the
contribution of that data-point to the overall gradient and update
the weights.
Stochastic Gradient Descent
stochastic updates
full updates (averaged over all data-items)
• Stochastic gradient descent does not converge to the minimum, but “dances”
around it.
• To get to the minimum, one needs to decrease the step-size as one get closer
to the minimum.
• Alternatively, one can obtain a few samples and average predictions over them
(similar to bagging).
Multi-Layer Nets
yˆi  g (Wij3hj2  bi 3 )
y
j
W3,b3
h2
W2,b2
hi 2  g (Wij2hj1  bi 2 )
j
hi 1  g (Wij1x j  bi 1 )
h1
j
W1,b1
x
• Single layers can only do linear things. If we want to learn non-linear
decision surfaces, or non-linear regression curves, we need more than
one layer.
• In fact, NN with 1 hidden layer can approximate any boolean and cont. functions
Back-propagation
• How do we learn the weights of a multi-layer network?
Answer: Stochastic gradient descent. But now the gradients are harder!
error = å y in loghin3 + (1 - y in )log(1 - hin3 );
d errorn
dhin3
d error
dW jk2
å
in
å
in
å
in
å
in
å
in
in
=
y in
hin3
=å
in
d errorn
dhin
3
d errorn
dhin
3
d errorn
dhin3
d errorn
dhin3
d errorn
dhin
3
-
1 - y in
1 - hin3
;
y
d errorn dhin3
hin3
hin3 (1 - hin3 )
=
dW jk2
æ
ö
3 2
3
d ççåW is h sn + bi ÷÷
ès
ø
hin3 (1 - hin3 )W ij3
dW jk
2
dh 2jn
dW jk
2
W3,b3
h2
=
W2,b2
=
hin3 (1 - hin3 )W ij3h 2jn (1 - h 2jn )
æ
ö
d ççåW js2 h 1sn + b 2j ÷÷
ès
ø
dW jk2
1
hin3 (1 - hin3 )W ij3h 2jn (1 - h 2jn )hkn
=
h1
=
W1,b1
x
hin3 (1 - hin3 )W ij3h 2jn (1 - h 2jn )s (åW kl1 x ln + b k1 )
l
i
Back Propagation
y
W3,b3
h2
i
hin3 = s (åWij3h2jn + bi3 )
i
din3 = hin3 (1 - hin3 )
j
W3,b3
hin2 = s (åWij2h1jn + bi2 )
h2
d errorin
dhin3
 jn2  hjn2 (1  hjn2 )

Wij3in3
1
kn
 hkn1 (1  hkn1 )

2
Wjk2 jn
upstream i
j
W2,b2
W2,b2
h1
y
hin1 = s (åWij1 x jn + bi1 )
W1,b1
x
Upward pass
h1
j
W1,b1
x
downward pass
upstream j
Back Propagation
y
i
din3 = hin3 (1 - hin3 )
d errorin
dhin3
d error
dW jk
2
W3,b3
h2
 jn2  hjn2 (1  hjn2 )

upstream i
Wij3in3
=å
in
d errorn
dhin
3
1
hin3 (1 - hin3 )W ij3h2jn (1 - h 2jn )hkn
2 1
= d jn
hkn
2 1
Wjk2 Wjk2   jn
hkn
W2,b2
h1
2
bj2  bj2   jn
kn1  hkn1 (1  hkn1 )

upstream j
W1,b1
x
2
Wjk2 jn
ALVINN
Learning to drive a car
This hidden unit detects a mildly left sloping
road and advices to steer left.
How would another hidden unit look like?
Weight Decay
• NN can also overfit (of course).
• We can try to avoid this by initializing all weights/biases terms to very small
random values and grow them during learning.
• One can now check performance on a validation set and stop early.
• Or one can change the update rule to discourage large weights:
2 1
Wjk2  Wjk2   jn
hkn  Wjk2
bj2  bj2   jn2  bj2
• Now we need to set

using X-validation.
• This is called “weight-decay” in NN jargon.
Momentum
• In the beginning of learning it is likely that the weights are changed in a
consistent manner.
• Like a ball rolling down a hill, we should gain speed if we make consistent
changes. It’s like an adaptive stepsize.
• This idea is easily implemented by changing the gradient as follows:
2 1
Wjk2 (new )   jn
hkn  Wjk2 (old )
Wjk2  Wjk2  Wjk2 (new )
(and similar to biases)
Conclusion
• NN are a flexible way to model input/output functions
• They are robust against noisy data
• Hard to interpret the results (unlike DTs)
• Learning is fast on large datasets when using stochastic gradient descent
plus momentum.
• Local minima in optimization is a problem
• Overfitting can be avoided using weight decay or early stopping
• There are also NN which feed information back (recurrent NN)
• Many more interesting NNs: Boltzman machines, self-organizing maps,...