NeuralNets273ASpring09

Transcript NeuralNets273ASpring09

Lecture 4
Neural Networks
ICS 273A UC Irvine
Instructor: Max Welling
Neurons
• Neurons communicate by receiving signals
on their dendrites. Adding these signals and
firing off a new signal along the axon if the total
input exceeds a threshold.
• The axon connects to new dendrites through
synapses which can learn how much signal
is transmitted.
• McCulloch and Pitt (’43) built a first abstract
model of a neuron.
1
b
y  g (Wi xi  b )
i
output
activation
function
input
weights
bias
Neurons
• We have about 1011 neurons, each one connected to 10 4 other
neurons on average.
• Each neuron needs at least 10 3 seconds to transmit the signal.
• So we have many, slow neurons. Yet we recognize our grandmother in 10 sec.
1
• Computers have much faster switching times: 10 10 sec.
• Conclusion: brains compute in parallel !
• In fact, neurons are unreliable/noisy as well.
But since things are encoded redundantly by many of them,
their population can do computation reliably and fast.
Classification / Regression
• Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y).
• If Y is continuous: regression, if Y is discrete: classification.
• We adapt the weights so as to minimize the error between the data and the
model predictions. Or, in other words: maximize the conditional probability of
the data (output, given attributes)
• E.g. 2-class
classification
(y=0/1)
N
P (y | x )  f (xn )yn (1  f (xn ))1Yn
n 1
N
error   yn logf (xn )  (1  yn )log(1  f (xn ))
n 1
looks familiar ?
fi (x )
Wij
Logistic Regression
xj
• If we use the following model for f(x), we obtain logistic regression!
yn  f (xn ;W )   ( Wi xin  b )
1
 (z ) 
1  exp( z )
i
• This is called a “perceptron” in neural networks jargon.
• Perceptrons can only separate classes linearly.
fi (x )
Wij
xj
Regression
• Probability of output given input attributes is Normally distributed.
 1

2
p (y | x )  c  exp    (yin  Wij x jn  bi ) 
j
n 1 i 1
 2

N
• Error is negative log-probability = squared loss function:
N dout
din
n 1 i 1
j 1
error   (yin  Wij x jn  bi )2
Optimization
• We use stochastic gradient descent: pick a single data-item, compute the
contribution of that data-point to the overall gradient and update
the weights.
Repeat:
1) Pick randon data-item (yn , xn )
2a )
d errorn
 (yin  Wik xkn )x jn
dWij
k
2b )
d errorn
 (yin  Wik xkn )
dbi
k
3a ) Wij  Wij  
3b ) bi  bi  
d errorn
dWij
d errorn
dbi
Stochastic Gradient Descent
stochastic updates
full updates (averaged over all data-items)
• Stochastic gradient descent does not converge to the minimum, but “dances”
around it.
• To get to the minimum, one needs to decrease the step-size as one get closer
to the minimum.
• Alternatively, one can obtain a few samples and average predictions over them
(similar to bagging).
Multi-Layer Nets
yî  g (Wij3hj2  bi 3 )
y
j
W3,b3
h2
W2,b2
hi 2  g (Wij2hj1  bi 2 )
j
hi 1  g (Wij1x j  bi 1 )
h1
j
W1,b1
x
• Single layers can only do linear things. If we want to learn non-linear
decision surfaces, or non-linear regression curves, we need more than
one layer.
• In fact, NN with 1 hidden layer can approximate any boolean and cont. functions
Back-propagation
• How do we learn the weights of a multi-layer network?
Answer: Stochastic gradient descent. But now the gradients are harder!
example:
error   yin log  in  (1  yin )log(1   in )
y
in
d error
d errorn d  in



2
dWjk2

dW
in
in
jk


d  Wij3hjn2  bi 3 
 j

d errorn


 in (1   in )

2
 in
dWjk
in

in
d errorn
 in
 in (1   in )W
3
ij
dhjn2
dWjk2
d errorn

d errorn
in
in
 in
 in
h2



d  Wjk2hkn1  bj2 
d errorn

 in (1   in )Wij3 jn (1   jn )  k

2
 in
dWjk
in

W3,b3
W2,b2
h1
 in (1   in )Wij3 jn (1   jn )hkn1 
 in (1   in )Wij3 jn (1   jn ) ( Wkl1 xln  bk1 )
l
W1,b1
x
i
Back Propagation
y
i
yî   (W h  bi )
3 2
ij j
j
3
hi 2   (Wij2hj1  bi 2 )
in3  yîn (1  yîn )
h2
d errorin
d  in
 jn2  hjn2 (1  hjn2 )

Wij3in3
kn1  hkn1 (1  hkn1 )

2
Wjk2 jn
upstream i
j
W2,b2
W2,b2
h1
i
W3,b3
W3,b3
h2
y
hi 1   (Wij1x j  bi 1 )
W1,b1
x
Upward pass
h1
j
W1,b1
x
downward pass
upstream j
Back Propagation
y
i
in3  yîn (1  yîn )
d errorin
d  in
W3,b3
h2
d error
d errorn

 in (1   in )Wij3 jn (1   jn ) kn

2
dWjk
 in
in
  jn2 hkn1
 jn2  hjn2 (1  hjn2 )

upstream i
Wij3in3
2 1
Wjk2 Wjk2   jn
hkn
W2,b2
h1
2
bj2  bj2   jn
kn1  hkn1 (1  hkn1 )

upstream j
W1,b1
x
2
Wjk2 jn
ALVINN
Learning to drive a car
This hidden unit detects a mildly left sloping
road and advices to steer left.
How would another hidden unit look like?
Weight Decay
• NN can also overfit (of course).
• We can try to avoid this by initializing all weights/biases terms to very small
random values and grow them during learning.
• One can now check performance on a validation set and stop early.
• Or one can change the update rule to discourage large weights:
2 1
Wjk2  Wjk2   jn
hkn  Wjk2
2
bj2  bj2   jn
 bj2
• Now we need to set

using X-validation.
• This is called “weight-decay” in NN jargon.
Momentum
• In the beginning of learning it is likely that the weights are changed in a
consistent manner.
• Like a ball rolling down a hill, we should gain speed if we make consistent
changes. It’s like an adaptive stepsize.
• This idea is easily implemented by changing the gradient as follows:
2 1
Wjk2 (new )   jn
hkn  Wjk2 (old )
Wjk2  Wjk2  Wjk2 (new )
(and similar to biases)
Conclusion
• NN are a flexible way to model input/output functions
• They can be given a probabilistic interpretation
• They are robust against noisy data
• Hard to interpret the results (unlike DTs)
• Learning is fast on large datasets when using stochastic gradient descent
plus momentum.
• Overfitting can be avoided using weight decay or early stopping
• There are also NN which feed information back (recurrent NN)
• Many more interesting NNs: Boltzman machines, self-organizing maps,...

NeuralNets273ASpring09

Transcript NeuralNets273ASpring09

Directory