Neural Networks

Download Report

Transcript Neural Networks

2101INT – Principles of Intelligent Systems
Lecture 10
Neural Networks




Since we are interested in creating artificial intelligence
in systems, it is reasonable that we would attempt to
mimic the human brain
The concept of an artificial neuron has been around
since at least 1943
This type of field is usually described as artificial neural
networks (ANNs), connectionism, parallel distributed
processing or neural computation
This field is also of interested to cognitive
psychologists who seek to better understand the
human brain
Neurons





A neuron is a cell in the brain that collects, processes
and disseminates electric signals
On their own, neurons are not particularly complex
Much of the brain’s information-processing capacity is
thought to stem from the number of and interrelationships between the neurons.
As such is an emergent property of the neurons, since
each of its own does not have the power of the whole
The human brain contains about 1010 neurons, each on
average connected to about 10,000 others
Neurons

Signals in a brain are noisy “spike trains” of electrical
energy
Neurons
Neurons



The axon endings almost touch the dendrites or cell
body of the next neuron – termed a synapse
Electrical signals are transferred with the aid of
neurotransmitters – chemicals which are released from
one neuron and which bind to another
Signal transmission depends on:
–
–
–
–
quantity of neurotransmitter
number and arrangement of receptors
neurotransmitter re-absorption
etc.
Carbon vs Silicon








Elements: 1014 synapses vs 108 transistors
Size: 10-6m vs 10-6m
Energy: 30W vs 30W (CPU)
Speed: 100Hz vs 109Hz
Architecture: Parallel/Distributed vs Serial/Centralised
Fault Tolerant: Yes vs a Little
Learns: Yes vs Maybe
Intelligent: Usually vs Not Yet
737
Mathematical Model of a Neuron

McCulloch and Pitts (1943)
Mathematical Model of a Neuron






ANNs are composed of many
units
A link from unit j to unit i
propagates the activation of
unit j (aj) to unit i
Each link also has weight Wj,i
which determines the strength
and sign of the link
Each unit computes the
weighted sum of its inputs, ini:
And then applies an activation
function to derive the output ai:
A bias weight is also present
n
ini  W j ,i a j
j 0
 n

ai  g (ini )  g  W j ,i a j 
 j 0

738
Activation Functions



The activation function should create a “high” output
(say 1) when the correct inputs are given and a “low”
output (say 0) otherwise
The function also needs to be non-linear (not of the
form y = mx + c), otherwise the network as a whole
would be a simple linear function, which isn’t
particularly powerful
Two choices are the threshold (unit step) function and
the sigmoid function = 1/(1+e-x)
738
Neurons as Logic Gates


A single neuron can implement the three most basic Boolean logic
functions (and also NAND and NOR)
A single unit cannot represent XOR
Usefulness of ANNs





So neurons and ANNs can be designed to exhibit
particular behaviours
More often, they are used to learn to recognise/classify
particular patterns.
Rosenblatt’s work (1958) explicitly considered this
problem when a teacher is providing advice to the ANN
From this we derive the term supervised classification
or supervised learning
It was he who introduced perceptrons: ANNs that
change their link weights when they make incorrect
decisions/classifications
738
Networks of Neurons




Two main categories: feed-forward networks and
recurrent networks
Feed-forward networks are acyclic: all links feed
forward in the network. A feed forward network is
simply a function of its current input. It has no internal
state.
Recurrent networks are cyclic: links can feed back into
themselves. Thus, the activation levels of the network
form a dynamic system, and can exhibit either stable,
oscillatory or even chaotic behaviour.
A recurrent network’s response will depend on its initial
state, which depends on prior inputs
740
A Single Layer Perceptron

A network with all inputs connected directly to the
outputs is called a single-layer neural network or a
perceptron network
What can single layer networks do?

Already seen that they can implement simple Boolean
logic functions
Can also represent other more complex functions, like
the majority function: returns T iff more than half of its
inputs are T
Decision tree representation would require O(2n) nodes

So why can’t we represent XOR?


740
Linear Seperability

The output of a threshold perceptron can be described
as:
n
W j x j  0
j 0

or as vectors:
W•x > 0

This equation defines a hyperplane in the input space
741
Linear Seperability cont.



Functions that can be divided by such a hyperplane
are termed linear seperable
In general, threshold perceptrons can represent only
linearly seperable functions
There are times when such networks are sufficiently
appropriate however
How does the brain learn?



Brains learn by altering the strength of connections
between neurons
They learn online, often without the benefit of explicit
training examples
Hebb’s Postulate:
"When an axon of cell A... excites[s] cell B and repeatedly or
persistently takes part in firing it, some growth process or
metabolic change takes place in one or both cells so that A's
efficiency as one of the cells firing B is increased."
741
Learning


What is learning?
Rosenblatt (1958) provided a learning scheme with the
property that: “if the patterns of the training set can be
seperated by some choice of weights and threshold, then the
scheme will eventually yield a satisfactory setting of the weights”
1. Pick a “representative set of patterns” – training set
2. Expose network to this set to adjust synaptic weights
using a learning rule – such as minimise the error
741
Learning


Minimise the error, in this case sum of square error
SSE:
1
1
2
2
E  Err   y  hW (x) 
2
2
Use gradient descent, to minimise the error wrt each
particular link weight. Use chain rule:
x x u

y u y
741
Learning cont.
1
1
2
Err 2   y  hW (x) 
2
2
Using the chain rule:
E
E Err

W j Err W j
E



First term reduces to:
E
 1 2 Err 2

 Err
Err
Err
Second term:
n

Err
 

g  y  W j x j 
W j W j 
j 0



741
Learning cont.


Combining these:
E
E Err

 Err  g ' (in )   x j
W j Err W j
E
  Err  g ' (in )  x j
W j
Definitions of terms
–
–
g’() is the derivative of the activation function
ini is the weighted sum of inputs
742
Updating weights






Having calculated the impact of each weight on the
overall error, can now adjust each Wj accordingly:
W j  W j    Err  g ' (in )  x j
Note that the minus has been dropped from the
previous equation: +ve error requires increased output
 is called the learning rate
The network is shown each training example, and the
weights are updated
Exposure to a complete set of training examples is
termed an epoch
The process is repeated until convergence occurs
743
Learning Examples
748
Learning Examples

Graph shows total error on a training set of 100
examples
744
Multilayer Feed Forward Networks


Will now consider networks where the inputs do not
connect directly to the outputs
Introduce some intervening units between input and
output which are termed hidden units
Why?
744
Effect of Hidden Units



Call these networks multilayer perceptron networks
Hidden layers remove the restriction to linearly
seperable functions
Using a sigmoid activation function, two hidden units
can classify a ridge, 4 a bump, >4 more bumps, etc
745pp
Weight learning in MLPs



Similar procedure as for a single layer, except that the
error must be propagated back through the hidden
layers
Gives rise to back propagation learning
The equations for back propagation learning are
derived on pages 745-747 of the text
Performance comparison MLP vs Single
Learning in ANNs
25pp
Readings for Week 10
1.
2.
Russell and Norvig, Chapter 1
http://www.willamette.edu/~gorr/classes/cs449/brain.ht
ml