NeuralNets273ASpring09
Download
Report
Transcript NeuralNets273ASpring09
Lecture 4
Neural Networks
ICS 273A UC Irvine
Instructor: Max Welling
Neurons
• Neurons communicate by receiving signals
on their dendrites. Adding these signals and
firing off a new signal along the axon if the total
input exceeds a threshold.
• The axon connects to new dendrites through
synapses which can learn how much signal
is transmitted.
• McCulloch and Pitt (’43) built a first abstract
model of a neuron.
1
b
y g (Wi xi b )
i
output
activation
function
input
weights
bias
Neurons
• We have about 1011 neurons, each one connected to 10 4 other
neurons on average.
• Each neuron needs at least 10 3 seconds to transmit the signal.
• So we have many, slow neurons. Yet we recognize our grandmother in 10 sec.
1
• Computers have much faster switching times: 10 10 sec.
• Conclusion: brains compute in parallel !
• In fact, neurons are unreliable/noisy as well.
But since things are encoded redundantly by many of them,
their population can do computation reliably and fast.
Classification / Regression
• Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y).
• If Y is continuous: regression, if Y is discrete: classification.
• We adapt the weights so as to minimize the error between the data and the
model predictions. Or, in other words: maximize the conditional probability of
the data (output, given attributes)
• E.g. 2-class
classification
(y=0/1)
N
P (y | x ) f (xn )yn (1 f (xn ))1Yn
n 1
N
error yn logf (xn ) (1 yn )log(1 f (xn ))
n 1
looks familiar ?
fi (x )
Wij
Logistic Regression
xj
• If we use the following model for f(x), we obtain logistic regression!
yn f (xn ;W ) ( Wi xin b )
1
(z )
1 exp( z )
i
• This is called a “perceptron” in neural networks jargon.
• Perceptrons can only separate classes linearly.
fi (x )
Wij
xj
Regression
• Probability of output given input attributes is Normally distributed.
1
2
p (y | x ) c exp (yin Wij x jn bi )
j
n 1 i 1
2
N
• Error is negative log-probability = squared loss function:
N dout
din
n 1 i 1
j 1
error (yin Wij x jn bi )2
Optimization
• We use stochastic gradient descent: pick a single data-item, compute the
contribution of that data-point to the overall gradient and update
the weights.
Repeat:
1) Pick randon data-item (yn , xn )
2a )
d errorn
(yin Wik xkn )x jn
dWij
k
2b )
d errorn
(yin Wik xkn )
dbi
k
3a ) Wij Wij
3b ) bi bi
d errorn
dWij
d errorn
dbi
Stochastic Gradient Descent
stochastic updates
full updates (averaged over all data-items)
• Stochastic gradient descent does not converge to the minimum, but “dances”
around it.
• To get to the minimum, one needs to decrease the step-size as one get closer
to the minimum.
• Alternatively, one can obtain a few samples and average predictions over them
(similar to bagging).
Multi-Layer Nets
yˆi g (Wij3hj2 bi 3 )
y
j
W3,b3
h2
W2,b2
hi 2 g (Wij2hj1 bi 2 )
j
hi 1 g (Wij1x j bi 1 )
h1
j
W1,b1
x
• Single layers can only do linear things. If we want to learn non-linear
decision surfaces, or non-linear regression curves, we need more than
one layer.
• In fact, NN with 1 hidden layer can approximate any boolean and cont. functions
Back-propagation
• How do we learn the weights of a multi-layer network?
Answer: Stochastic gradient descent. But now the gradients are harder!
example:
error yin log in (1 yin )log(1 in )
y
in
d error
d errorn d in
2
dWjk2
dW
in
in
jk
d Wij3hjn2 bi 3
j
d errorn
in (1 in )
2
in
dWjk
in
in
d errorn
in
in (1 in )W
3
ij
dhjn2
dWjk2
d errorn
d errorn
in
in
in
in
h2
d Wjk2hkn1 bj2
d errorn
in (1 in )Wij3 jn (1 jn ) k
2
in
dWjk
in
W3,b3
W2,b2
h1
in (1 in )Wij3 jn (1 jn )hkn1
in (1 in )Wij3 jn (1 jn ) ( Wkl1 xln bk1 )
l
W1,b1
x
i
Back Propagation
y
i
yˆi (W h bi )
3 2
ij j
j
3
hi 2 (Wij2hj1 bi 2 )
in3 yˆin (1 yˆin )
h2
d errorin
d in
jn2 hjn2 (1 hjn2 )
Wij3in3
kn1 hkn1 (1 hkn1 )
2
Wjk2 jn
upstream i
j
W2,b2
W2,b2
h1
i
W3,b3
W3,b3
h2
y
hi 1 (Wij1x j bi 1 )
W1,b1
x
Upward pass
h1
j
W1,b1
x
downward pass
upstream j
Back Propagation
y
i
in3 yˆin (1 yˆin )
d errorin
d in
W3,b3
h2
d error
d errorn
in (1 in )Wij3 jn (1 jn ) kn
2
dWjk
in
in
jn2 hkn1
jn2 hjn2 (1 hjn2 )
upstream i
Wij3in3
2 1
Wjk2 Wjk2 jn
hkn
W2,b2
h1
2
bj2 bj2 jn
kn1 hkn1 (1 hkn1 )
upstream j
W1,b1
x
2
Wjk2 jn
ALVINN
Learning to drive a car
This hidden unit detects a mildly left sloping
road and advices to steer left.
How would another hidden unit look like?
Weight Decay
• NN can also overfit (of course).
• We can try to avoid this by initializing all weights/biases terms to very small
random values and grow them during learning.
• One can now check performance on a validation set and stop early.
• Or one can change the update rule to discourage large weights:
2 1
Wjk2 Wjk2 jn
hkn Wjk2
2
bj2 bj2 jn
bj2
• Now we need to set
using X-validation.
• This is called “weight-decay” in NN jargon.
Momentum
• In the beginning of learning it is likely that the weights are changed in a
consistent manner.
• Like a ball rolling down a hill, we should gain speed if we make consistent
changes. It’s like an adaptive stepsize.
• This idea is easily implemented by changing the gradient as follows:
2 1
Wjk2 (new ) jn
hkn Wjk2 (old )
Wjk2 Wjk2 Wjk2 (new )
(and similar to biases)
Conclusion
• NN are a flexible way to model input/output functions
• They can be given a probabilistic interpretation
• They are robust against noisy data
• Hard to interpret the results (unlike DTs)
• Learning is fast on large datasets when using stochastic gradient descent
plus momentum.
• Overfitting can be avoided using weight decay or early stopping
• There are also NN which feed information back (recurrent NN)
• Many more interesting NNs: Boltzman machines, self-organizing maps,...