5-NeuralNetworks

Download Report

Transcript 5-NeuralNetworks

Machine Learning
Neural Networks
Ch. 4
Biological motivation
• Can we simulate the human learning process?  Two schools
• modeling biological systems
• ‘just’ building learning systems
• Biological learning system (brain)
– complex network of neurons
– properties
• Neuron switching time ~0.001 second
• Number of neurons ~ 1010
• Connections per neuron ~104 - 105
• Scene recognition time ~0.1 second
• 100 inference steps doesn't seem like enough
• much parallel computation
Artificial Neural Networks (ANN)
• ANN
– network of simple units
– real-valued inputs & outputs
•
•
•
•
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
Neural Speed Constraints
• Neurons have a “switching time” on the order of a few
milliseconds, compared to nanoseconds for current
computing hardware.
• However, neural systems can perform complex cognitive
tasks (vision, speech understanding) in tenths of a
second.
• Only time for performing 100 serial steps in this time
frame, compared to orders of magnitude more for current
computers.
• Must be exploiting “massive parallelism.”
• Human brain has about 1011 neurons with an average of
104 connections each.
R. Mooney, UT Austin
Neural Network Learning
• Learning approach based on modeling adaptation in
biological neural systems.
• Perceptron: Initial algorithm for learning simple neural
networks (single layer) developed in the 1950’s.
• Backpropagation: More complex algorithm for learning
multi-layer neural networks developed in the 1980’s.
R. Mooney, UT Austin
Real Neurons
How does our Brain Work?
• A neuron is connected to other neurons via its input and
output links.
• Each incoming neuron has an activation value and each
connection has a weight associated with it.
• The neuron sums the incoming weighted values and this
value is input to an activation function.
• The output of the activation function is the output from
the neuron.
Neural Communication
•
•
•
•
•
Electrical potential across cell
membrane exhibits spikes
called action potentials.
Spike originates in cell body,
travels down axon, and causes
synaptic terminals to release
neurotransmitters.
Chemical diffuses across
synapse to dendrites of other
neurons.
Neurotransmitters can be
excititory or inhibitory.
If net input of
neurotransmitters to a neuron
from other neurons is excititory
and exceeds some threshold,
it fires an action potential.
R. Mooney, UT Austin
Real Neural Learning
• To model the brain we need to model a neuron.
• Each neuron performs a simple computation.
– It receives signals from its input links and it uses these values to
compute the activation level (or output) for the neuron.
– This value is passed to other neurons via its output links.
Perceptrons
• Introduced in the late 50s .
Minsky and Papert.
• Perceptron convergence
theorem Rosenblatt 1962:
• Perceptron will learn to
classify any linearly
separable set of inputs.
XOR function
Perceptron is a network:
– single-layer
– feed-forward
the data only travels in
one direction (from the
input neurons to the
output neurons)
ALVINN drives 70 mph on
highways
Perceptron
The input value received of a neuron is calculated
by summing
n
the weighted input values from its input links  wi xi
i 0
threshold function
Vector notation:
Different Threshold Functions
 
  1, w  x  0
o( x )  
 
 1, w  x  0
 
  1, w  x  t
o( x )  
0, otherwise
 
  1, w  x  0
o( x )  
 
 1, w  x  0
 
  1, w  x  0
o( x )  
 

1
,
w
x 0

We should learn the weight w1,…, wn
Examples
(step activation function)
In1
In2
Out
In1
In2
Out
In
Out
0
0
0
0
0
0
0
1
0
1
0
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
Perceptron Training
• Assume supervised training examples giving the desired
output for a unit given a set of known input activations.
• Learn synaptic weights so that unit produces the correct
output for each example.
• Perceptron uses iterative update algorithm to learn a
correct set of weights.
R. Mooney, UT Austin
Perceptron Training Rule
• Learn weights of a single unit
• Problem
– given examples labeled +1/-1
– learn weight vector
• Algorithms
– perceptron rule, delta rule, …
– guaranteed to converge
– different acceptable hypotheses & assumptions
Perceptron Training Rule
• Update weights by:
wi  wi  wi
wi   (t  o ) wi
where
η is the learning rate
t – target output for the current training example
• Equivalent to rules:
– If output is correct do nothing.
– If output is high, lower weights on active inputs
– If output is low, increase weights on active inputs
Perceptron training rule
• Can prove it will converge
– if training data is linearly separable
– and η sufficiently small
Perceptron Learning Algorithm
• Iteratively update weights until convergence.
Initialize weights to random values
Until outputs of all training examples are correct
For each training pair, E, do:
Compute current output oj for E given its inputs
Compare current output to target value, tj , for E
Update synaptic weights and threshold using learning rule
• Each execution of the outer loop is typically called an
epoch.
R. Mooney, UT Austin
Delta Rule
• Works reasonably with non-separable data
• Minimizes error
• Gradient descent method
– basis of Backpropagation method
– basis for methods working in multidimensional
continuous spaces
Gradient Descent Preliminaries
• Tangent
• Derivatives
Tangent, Inflection Point
• In geometry, the tangent line (or simply the tangent) to a curve at a
given point is the straight line that "just touches" the curve at that
point
• In differential calculus, an inflection point, or point of inflection (or
inflexion) is a point on a curve at which the curvature changes sign.
The curve changes from being concave upwards (positive curvature)
to concave downwards (negative curvature), or vice versa.
Visualization (tangent)
Visualization (inflection point)
Extremum: Maximum and Minimum
global maximum
local maximum
local minimum
global minimum
• In mathematics, Fermat's theorem is a theorem in real analysis,
named after Pierre de Fermat.
• It gives a method to find local maxima and minima of differentiable
functions on open sets by showing that every local extremum of
the function is a stationary point (the function derivative is
zero in that point).
Hypothesis space
• Example case: two weights
– error surface E(w)
– parabolic (by definition)
– single global minimum
– arrow: negated gradient at one point
– steepest descent along the surface
Gradient Descent
•
To understand, consider a linear unit, where
o  w0  w1 x1  ...  wn xn
•
Let’s learn the wi’s that minimize the squared error
 1
E[ w]  2  (td  od ) 2
d D
•
•
•
Where D is the set of training examples
In statistics, mean squared error is one of many ways to quantify the amount
by which an estimator differs from the true value of the quantity being
estimated
We should minimize squared error
There exist alternative error functions
Hypothesis space
• Example case: two weights
– error surface E(w)
– parabolic (by definition)
– single global minimum
– arrow: negated gradient at one point
– steepest descent along the surface
Gradient Descent

E(w)  [ wE0 , wE1 ,..., wEn ]
Partial derivatives of E
with respect to weights wi
is itself a vector, whose components are the partial derivatives of E with
respect to each of the wi. When interpreted as a vector in weightspace,
the gradient specifies the direction that produces the steepest
increase in E. The negative of this vector therefore gives the direction of
steepest decrease
where
,
Hypothesis space
• Example case: two weights
– error surface E(w)
– parabolic (by definition)
– single global minimum
– arrow: negated gradient at one point
– steepest descent along the surface
Derivation of the rule
• Compute the gradient of E(w)
–
–
–
–
–
–
vector E(w) of partial derivatives
specifies the direction of steepest increase in E
training rule: w = -E(w)
componentwise
wi = wi +  wi,  wi = -E/ wi
wi is changed in proportion to E/ wi
Practical algorithm
• Efficient computation of E/ wi
– Reduces to sum ((t - o)(- xi))
• Converges to minimum error
– too large   may oscillate
– common modification: gradually decrease learning
rate
Problems with gradient descent
• Gradient descent
– continuous hypothesis space
– error can be differentiated wrt hypothesis parameters
• Difficulties
– converging can be slooooow
– many local minima  no guarantee we find the global one
Remarks
• Perceptron rule
– thresholded output
– converges after a finite # of iterations
– provided data is separable
• Delta rule
– unthresholded
– asymptotic convergence
– regardless of training data
Perceptron as a Linear Separator
• Since perceptron uses linear threshold function, it is
searching for a linear separator that discriminates the
classes.
o3
w12o2  w13o3  T1
??
w12
T1
o3  
o2 
w13
w13
o2
R. Mooney, UT Austin
Or hyperplane in
n-dimensional space
Concept Perceptron Cannot Learn
• Cannot learn exclusive-or, or parity function in
general.
o3
1
+
??
–
–
+
0
1
o2
R. Mooney, UT Austin
Perceptron Limits
• System obviously cannot learn concepts it cannot
represent.
• Minksy and Papert (1969) wrote a book analyzing the
perceptron and demonstrating many functions it could
not learn.
• These results discouraged further research on neural
nets; and symbolic AI became the dominate paradigm.
R. Mooney, UT Austin
Perceptron as Hill Climbing
•
•
•
•
The hypothesis space being search is a set of weights and a threshold.
Objective is to minimize classification error on the training set.
Perceptron effectively does hill-climbing (gradient descent) in this space,
changing the weights a small amount at each point to decrease training set
error.
For a single model neuron, the space is well behaved with a single minima.
training
error
0
weights
R. Mooney, UT Austin
Perceptron Performance
• Linear threshold functions are restrictive (high bias) but still
reasonably expressive; more general than:
– Pure conjunctive
– Pure disjunctive
– M-of-N (at least M of a specified set of N features must be present)
• In practice, converges fairly quickly for linearly separable data.
• Can effectively use even incompletely converged results when only
a few outliers are misclassified.
• Experimentally, Perceptron does quite well on many benchmark
data sets.
R. Mooney, UT Austin
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g. raw sensor
input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
• Examples:
– Speech phoneme recognition [Waibel]
– Image classification [Kanade, Baluja, Rowley]
– Financial prediction