lec1b

Transcript lec1b

CSC2535: Computation in Neural Networks
Lecture 1b: Learning without hidden units.
Geoffrey Hinton
www.cs.toronto.edu/~hinton/csc2535/notes/lec1b.htm
Linear neurons
• The neuron has a realvalued output which is a
weighted sum of its inputs
weight
vector
ˆy   wi xi  w T x
i
Neuron’s estimate of
the desired output
input
vector
• The aim of learning is to
minimize the discrepancy
between the desired output
and the actual output
– How de we measure the
discrepancies?
– Do we update the weights
after every training case?
– Why don’t we solve it
analytically?
The delta rule
• Define the error as the squared
residuals summed over all
training cases:
• Now differentiate to get error
derivatives for weights
E
1
2
E

wi
2
ˆ
(
y

y
)
 n n
n
1
2
yˆ n En
 w yˆ
i
n
n
  xi ,n ( yn  yˆ n )
n
• The batch delta rule changes
the weights in proportion to
their error derivatives summed
over all training cases
E
wi  
wi
The error surface
• The error surface lies in a space with a
horizontal axis for each weight and one vertical
axis for the error.
– It is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
E
w1
w2
Online versus batch learning
• Batch learning does
steepest descent on the
error surface
• Online learning zig-zags
around the direction of
steepest descent
constraint from
training case 1
w1
w1
w2
constraint from
training case 2
w2
Convergence speed
• The direction of steepest
descent does not point at
the minimum unless the
ellipse is a circle.
– The gradient is big in
the direction in which
we only want to travel
a small distance.
– The gradient is small in
the direction in which we
want to travel a large
distance.
E
wi   
wi
• This equation is sick. The
RHS needs to be multiplied
by a term of dimension w^2.
• A later lecture will cover
ways of fixing this problem.
Preprocessing the input vectors
• Instead of trying to predict the answer directly
from the raw inputs we could start by extracting
a layer of “features”.
– Sensible if we already know that certain
combinations of input values would be useful
– The features are equivalent to a layer of
hand-coded non-linear neurons.
• So far as the learning algorithm is concerned,
the hand-coded features are the input.
Is preprocessing cheating?
• It seems like cheating if the aim to show how
powerful learning is. The really hard bit is done
by the preprocessing.
• Its not cheating if we learn the non-linear
preprocessing.
– This makes learning much more difficult and
much more interesting..
• Its not cheating if we use a very big set of nonlinear features that is task-independent.
– Support Vector Machines make it possible to
use a huge number of features without much
computation or data.
Perceptrons
The input is recoded using
hand-picked features that do
not adapt.
Only the last layer of weights
is learned.
The output units are binary
threshold neurons and are
learned independently.
output units
non-adaptive
hand-coded
features
input units
Binary threshold neurons
• McCulloch-Pitts (1943)
– First compute a weighted sum of the inputs
from other neurons
– Then output a 1 if the weighted sum exceeds
the threshold.
z   xi wi
i
y
1 if
z 
0 otherwise
1
y
0
threshold
z
The perceptron convergence procedure
• Add an extra component with value 1 to each input vector.
The “bias” weight on this component is minus the
threshold. Now we can forget the threshold.
• Pick training cases using any policy that ensures that
every training case will keep getting picked
– If the output unit is correct, leave its weights alone.
– If the output unit incorrectly outputs a zero, add the
input vector to the weight vector.
– If the output unit incorrectly outputs a 1, subtract the
input vector from the weight vector.
• This is guaranteed to find a suitable set of weights if any
such set exists.
Weight space
• Imagine a space in which
each axis corresponds to a
weight.
– A point in this space is a
weight vector.
• Each training case defines
a plane.
– On one side of the plane
the output is wrong.
• To get all training cases
right we need to find a point
on the right side of all the
planes.
bad
weights
good
weights
an input
vector
o
origin
Why the learning procedure works
• Consider the squared
distance between any
satisfactory weight vector
and the current weight
vector.
– Every time the
perceptron makes a
mistake, the learning
algorithm moves the
current weight vector
towards all satisfactory
weight vectors (unless it
crosses the constraint
plane).
• So consider “generously satisfactory”
weight vectors that lie within the
feasible region by a margin at least as
great as the largest update.
– Every time the perceptron makes a
mistake, the squared distance to all
of these weight vectors is always
decreased by at least the squared
length of the smallest update vector.
What perceptrons cannot do
• The binary threshold output
units cannot even tell if two
single bit numbers are the
0,1
same!
Same: (1,1)  1; (0,0)  1
Different: (1,0)  0; (0,1)  0
• The following set of inequalities
is impossible:
w1  w2   , 0  
w1   ,
w2  
0,0
Data Space
1,1
1,0
The positive and negative cases
cannot be separated by a plane
What can perceptrons do?
• They can only solve tasks if the hand-coded features
convert the original task into a linearly separable one.
How difficult is this?
• The N-bit parity task :
– Requires N features of the form: Are at least m bits
on?
– Each feature must look at all the components of the
input.
• The 2-D connectedness task
– requires an exponential number of features!
– Connectedness is much easier to compute using
iterative algorithms that propagate markers.
Distinguishing T from C in any orientation and position
• What kind of features are
required to distinguish two
different patterns of 5 pixels
independent of position and
orientation?
– Do we need to replicate T
and C templates across
all positions and
orientations?
– Looking at pairs of pixels
will not work
– Looking at triples will work
if we assume that each
input image only contains
one object.
Replicate the following two
feature detectors in all positions
+ -+
+
+
If any of these equal their threshold
of 2, it’s a C. If not, it’s a T.
Beyond perceptrons
• We need to learn the features, not just how to weight
them to make a decision.
– We may need to abandon guarantees of finding
optimal solutions.
• Need to make use of recurrent connections, especially
for modeling sequences.
– Long-term temporal regularities are hard to learn.
• Need to learn representations without a teacher.
– This makes it much harder to decide what the goal is.
• Need to learn complex hierarchical representations.
– Must traverse deep hierarchies using fixed hardware.
• Need to attend to one part of the sensory input at a time.
– Requires segmentation and sequential organization of
sensory processing.

lec1b

Transcript lec1b

Directory