Transcript Slides

CSM6120
Introduction to Intelligent Systems
Neural Networks
[email protected]
Sub-symbolic learning

When we use some sort of rule-based system, generally we
understand the rules


We understand the conclusions it draws, because it can tell us
When a system learns from such rules, processes it in an
understood way, and comes up with an understandable result,
it is known as symbolic learning
Sub-symbolic learning

Some systems of learning – which we are coming to – operate
in quite a different way

Sub-symbolic learning is learning where we don't really
understand or have control over the process by which it
comes to its conclusion


Though we could, with considerable effort and vast amounts of time, find
out
This includes neural networks, genetic algorithms, genetic
programming and to some extent, statistical methods
Artificial neural networks

ANNs for short

We use the word “Artificial” to distinguish them from
biological neural networks

They come in various shapes and sizes

Inputs (variables), outputs (results)

May be one or more outputs
Neural networks (applications)









Face recognition
Event recognition, tracking
Time series prediction
Process control
Optical character recognition
Handwriting recognition
Robotics, movement
Games: car control, etc
Etc…
ANNs statistical?

We all think of ANNs as pure AI, right?

There is much published literature which claims that
ANNs are really no more than non-linear regression
models, and claims that they can be implemented using
standard statistical software

But there is a difference: randomness
Neurons

Original ANN research aimed at modelling networks of
real neurons in the brain


Brain is massively parallel



1010 neurons in the brain – unrealistic target!
One job shared between many neurons
If one goes wrong, no big deal
Known as distributed processing

Fault-tolerant
Patterns within data

ANNs learn to recognise patterns within sets of data

Fault tolerance helps – can still cope with unexpected
input examples, placing emphasis only on what it
considers important
The neuron metaphor

Neurons

Accept information from multiple inputs
Transmit information to other neurons
Multiply inputs by weights along edges

Apply some function to the set of inputs at each node


ANN topology
Nodes (neurons, or perceptrons)

Each input node brings into the network the value of one
independent variable

The hidden layer nodes do most of the work

The output node(s) give us our result
Types of neuron
Linear neuron
Logistic neuron
Perceptron
Activation functions


Transforms neuron’s input into output
Features of activation functions:

A squashing effect is required


Prevents accelerating growth of activation levels through the network
Simple and easy to calculate, differentiable...?
Perceptrons

Formal neuron (McCulloch & Pitts, 1943)


Summation of inputs and threshold functions
Showed how any logical function could be computed with
simple model neurons
Perceptron
Perceptrons

Learning by weight adjustment


Output is 1 if the weighted sum of inputs is greater than a given
threshold, otherwise output is 0
An extra weight
threshold
(bias) is used as a way of adjusting the
Perceptron
Alternative notation

The threshold (for these examples) is always = 0

If the sum of the weighted inputs >= 0 then P = 1, else P = 0
X=1
w0
A
w1
P
B
w2
Example

If the inputs are binary (0 or 1) then the neuron units act like a
conventional logic module

In this example, P = A or B

Equates to: if 2A + B – 0.5 >= 0 then output 1, else output 0
X=1
-0.5
A
2
P
1
B
Example 2

What is being computed here?
X=1
-2.5
A
2
P
1
B
By only changing the bias, we can model
another logical function = AND
Example 3

What is being computed here?
X=1
-1.5
A
2
2
P
-2
B
-3
X=1
-0.5
1
C
Q
2.5
Standard backpropagation

The commonest learning rule for ANNs

Made popular by Rumelhart and McClelland in 1986

Connections between nodes given random initial weights

We therefore get a value at the output node(s) which is what
happens when these random weights are applied to the data at
the input

Use the difference between the output and correct values to
adjust weights
Training perceptrons
X = -1
?
A
?
P
?
B
• What are the weight values?
• Initialize with random weights
For AND
AB P
00 0
01 0
10 0
11 1
Training perceptrons
-1
0.3
A
0.5
P
For AND
AB P
00 0
01 0
10 0
11 1
-0.4
B
Bias
A
B
Summation
Output
-1
0
0
(-1*0.3) + (0*0.5) + (0*-0.4) = -0.3
0
-1
0
1
(-1*0.3) + (0*0.5) + (1*-0.4) = -0.7
0
-1
1
0
(-1*0.3) + (1*0.5) + (0*-0.4) = 0.2
1
-1
1
1
(-1*0.3) + (1*0.5) + (1*-0.4) = -0.2
0
Learning algorithm

Epoch: Presentation of the entire training set to the neural
network


In the case of the AND function an epoch consists of four sets of inputs
being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1])
Error: The error value is the amount by which the value
outputted by the network differs from the target value

For example, if we required the network to output 0 and it outputted a
1, then Error = -1
Learning algorithm

Error


The weightings on the connections are then adjusted to try to reduce
this error
Further iterations of weight adjustment proceed until we decide we’re
finished


And that is a study in itself
Epochs


There may be many thousands of epochs in one training run, before a
satisfactory model is achieved
But how many do we need?

The BIG question (or one of them!)
Perceptron learning rule
Weights are updated: wi = wi + wi
wi =  (t - o) xi
t= the target value
o is the perceptron output
 is a small constant (e.g. 0.12) called the learning rate
For each training example:
• If the output is correct (t=o) the weights wi are not changed
• If the output is incorrect (to) the weights wi are changed
such that the output of the perceptron for the new weights
is closer to t
• The algorithm converges to the correct classification
• If the training data is linearly separable
• And  is sufficiently small
Search space

It’s often useful to think of ANN training as a marble rolling
across a surface, which has valleys in it

Some valleys may be deeper than others – we want to find a
deep one

When we find it, we stop training
Local minimum
Global minimum
Gradient descent learning rule

Consider linear unit without threshold and continuous
output o (not just 0,1 as with perceptrons)


o = w0 + w1 x1 + … + wn xn
We train the wi such that they minimise the squared error

E[w1,…,wn] = ½ dD (td-od)2
where D is the set of training examples
Gradient descent
Gradient:
E[w]=[E/w0,… E/wn]
w=- E[w]
wi=- E/wi
(w1,w2)
(w1+w1,w2 +w2)
Gradient descent (linear units)
Gradient-descent(training_examples, )
Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is
the vector of input values, and t is the target output value,  is the learning
rate (e.g. 0.1)
 Initialize each wi to some small random value
 Until the termination condition is met:
 Initialize each wi to zero
 For each <(x1,…xn),t> in training_examples:
 Input the instance (x1,…,xn) to the linear unit and compute the
output o
 For each linear unit weight wi
 wi = wi + (t-o)xi
(update deltas)
 For each linear unit weight wi
 wi=wi+wi
(update weights)
Incremental stochastic gradient descent

Two approaches to this:

Batch mode = gradient descent over the entire data D

Incremental mode = gradient descent over individual training
examples d
Incremental gradient descent can approximate batch gradient
descent arbitrarily closely if  is small enough
Perceptron vs gradient descent rule
Perceptron learning rule guaranteed to succeed if


Training examples are linearly separable
Sufficiently small learning rate 
Gradient descent learning for linear units



Guaranteed to converge to hypothesis with minimum squared error
Given sufficiently small learning rate 
Even when training data contains noise
Decision boundaries

In simple cases, divide feature space by drawing a hyperplane
across it
 Known as a decision boundary

Discriminant function: returns different values on opposite
sides (straight line)

Problems which can be thus classified are linearly separable
Linear separability
X1
A
A
A
Decision
Boundary
B
A
B
A
B
B
A
A
B
B
B
B
X2
Decision surface of a perceptron
x2
x2
+
+
+
+
+
-
-
-
x1
x1
-
-
+
Linearly separable
Non-linearly separable
• Perceptron is able to represent some useful functions
• AND(x1,x2) choose weights w0=-1.5, w1=1, w2=1
• But functions that are not linearly separable (e.g. XOR)
are not representable – we need multilayer perceptrons here
Multilayer networks




Cascade neurons together
The output from one layer is the input to the next
Each layer has its own sets of weights
Hidden layer(s)…
Linear regression neural networks

What happens when we arrange linear neurons in a
multilayer network?
Linear regression neural networks

…nothing special happens

The product of two linear transformations is itself a linear
transformation
Neural networks

We want to introduce non-linearities to the network

Non-linearities allow a network to identify complex regions in space
Linear separability




1-layer cannot handle XOR
More layers can handle more complicated spaces – but
require more parameters
Each node splits the feature space with a hyperplane
If the second layer is AND, a 2-layer network can represent
any convex hull
Separability
Structure
Single-Layer
Exclusive-OR
problem
A
Classes with
Most general
meshed regions region shapes
B
B
Two-Layer
B
A
A
B
B
Three-Layer
B
A
A
B
B
B
A
A
A
A
Feed-forward networks

Predictions are fed forward through the network to
classify
Feed-forward networks

Predictions are fed forward through the network to
classify
Feed-forward networks

Predictions are fed forward through the network to
classify
Feed-forward networks

Predictions are fed forward through the network to
classify
Feed-forward networks

Predictions are fed forward through the network to
classify
Feed-forward networks

Predictions are fed forward through the network to
classify
Error backpropagation

Error backpropagation solves the gradient for each partial
component separately


The target values for each layer come from the next layer
This feeds the errors back along the network
Backpropagation (BP) algorithm

BP employs gradient descent to attempt to minimize the
squared error between the network output values and
the target values for these outputs

Two stage learning


Forward stage: calculate outputs given input pattern
Backward stage: update weights by calculating deltas
Termination conditions for BP

The weight update loop may be iterated thousands of times in
a typical application

The choice of termination condition is important because



Too few iterations can fail to reduce error sufficiently
Too many iterations can lead to overfitting the training data (see later!)
Termination criteria



After a fixed number of iterations (epochs)
Once the training error falls below some threshold
Once the validation error meets some criterion
Why BP works in practice
A possible scenario

Weights are initialized to values near zero

Early gradient descent steps will represent a very smooth
function (approximately linear). Why?


The weights gradually move close to the global minimum


The sigmoid function is almost linear when the total input (weighted
sum of inputs to a sigmoid unit) is near 0
As weights grow in the later stages of learning, they represent highly
non-linear network functions
Gradient steps in this later stage move toward local minima in
this region, which is acceptable
Representational power of MLPs

Every boolean function can be represented exactly by some
network with two layers of units


Note: The number of hidden units required may grow exponentially with
the number of network inputs
Continuous functions


Every bounded continuous function can be approximated with arbitrarily
small error, by network with one hidden layer (Cybenko 1989, Hornik
1989)
Any function can be approximated to arbitrary accuracy by a network
with two hidden layers (Cybenko 1988)
The hidden layer

Without this, network probably won’t perform well


No hidden layer means a linear model
But it’s another big question!

How many nodes do we need in the hidden layer?


And might it perform better with more than one?



http://vision.eng.shu.ac.uk/neural/FAQ/FAQ3.html#A_hu
http://vision.eng.shu.ac.uk/neural/FAQ/FAQ3.html#A_hl
(answers to both: in general, we don't know – experiment and see how
it works for your data)
See also http://www.heatonresearch.com/node/707
Interpretation of hidden layers

What are the hidden layers actually doing?!

Essentially feature extraction

The non-linearities in the feature extraction can make
interpretation of the hidden layers very difficult

This leads to neural networks being treated as black boxes
Overfitting

We’re trying to form a model using some data, which we can
then use to predict the result of other data


If we train too much, we end up with a model which can near perfectly
describe the data it’s seen, but can’t generalise
This model is said to have overtrained/overfitted

Very important factor in machine learning!
Overfitting example

We wanted the model to be able to predict the x points (the
model would have been simpler)


But it overtrained on the o points (this is a more complicated model)
Clearly this is a very exaggerated example…
y
x
o
x
x
x
o
o
o
x
Key
O Training set
X Test set
Desired train
Overfitted
Stopping the training

So we have to stop the ANN training after it’s found a good
model, but before it goes too far

Since all datasets have different characteristics, we have to
experiment to find out what is optimal

This overfitting problem occurs in other AI methods too and
you must recognise it
Avoiding overfitting

In order to avoid overfitting, we need to make sure we validate
our model: an overfitted model cannot generalise for unseen
data

Use your query (test) set to check: does it still work?

If so, you can assume your model is OK and will work on
further unseen data

This assumes there is some more data of course!
Advantages of ANNs

No programming (training instead)

Very fast in operation (parallel)

Can generalize over examples

Can tolerate noise

Degrades gracefully
Disadvantages of ANNs

Needs large training sets



Can take long time
Both positive and negative examples needed
Supervised learning (desired results known)

Can be unpredictable on untrained data

Can’t explain results
When to consider ANNs

Input is high-dimensional discrete or real-valued

E.g. raw sensor data

Output is discrete or real-valued

Possibly noisy data

Form of target function is unknown

Human readability of result is unimportant
Example: Handwriting recognition

Demo:
http://www.cs.washington.edu/education/courses/cse473/
06sp/NeuralNetsHandwriting/JRec.html

http://www.heatonresearch.com/encog/benchmark.html
More to try

Pythia




Evaluation copy, limited to 13 neurons
But good examples supplied
http://download.cnet.com/Pythia/3000-2651_4-192584.html
Sharky


Nice graphical displays, but more difficult to see what's
happening with the data
http://sharktime.com/
Other types of neural network




Multiple Outputs
Skip Layer Network
Recurrent Neural Networks
…
Multiple outputs
•Used for N-way classification
•Each node in the output layer corresponds to a different class
•No guarantee that the sum of the output vector will equal 1
Skip layer network

Input nodes are also sent directly to the output layer
Recurrent neural networks

Output or hidden layer information is stored in a context
or memory layer
Output Layer
Hidden Layer
Input Layer
Context Layer
Time Delayed Recurrent Neural Networks

Output layer from time t are used as inputs to the hidden
layer at time t+1
Output Layer
With an optional decay
Hidden Layer
Input Layer
Radial Basis Functions (RBFs)

Features


One hidden layer
Several outputs
Inputs
Radial Outputs
units
Radial Basis Functions (RBFs)

RBF hidden layer units have a receptive field which has a
centre

Generally, the hidden unit function is Gaussian

The output layer is linear

Realized function

s( x)   j 1W j  x  c j
K

 x  cj

 x  cj
 exp  
 j





2

Learning

The training is performed by deciding on



How many hidden nodes there should be
The centers and the sharpness of the Gaussians
2 steps


In the 1st stage, the input dataset is used to determine the parameters
of the basis functions
In the 2nd stage, functions are kept fixed while the second layer weights
are estimated (simple backpropagation algorithm like for MLPs)
MLPs versus RBFs

Classification



MLP
X2
Learning




MLPs separate classes via
hyperplanes
RBFs separate classes via
hyperspheres
MLPs use distributed learning
RBFs use localized learning
RBFs train faster
X1
Structure



MLPs have one or more hidden
layers
RBFs have only one layer
RBFs require more hidden neurons
X2
RBF
X1
Kohonen ANNs

= Self-organising maps (SOMs)

What is it?




It’s a neural network without any training outputs!
(unsupervised learning)
It doesn’t need to be told how to form its model; it organises
itself
A bit like PCA in that respect
http://www.cis.hut.fi/projects/somtoolbox/theory/somalgorithm
.shtml
Self organizing maps

The purpose of SOMs is to map a multidimensional input
space onto a topology-preserving map of neurons


Preserve topology so that neighboring neurons respond to “similar”
input patterns
The topological structure is often a 2 or 3 dimensional space

Each neuron is assigned a weight vector with the same
dimensionality as the input space

Input patterns are compared to each weight vector and the
closest wins (Euclidean distance)

The activation of the neuron is
spread in its direct
neighbourhood =>neighbours
become sensitive to the same
input patterns

Block distance

The size of the neighbourhood is
initially large but reduce over
time => Specialization of the
network
First neighbourhood
2nd neighbourhood
Adaptation

During training, the “winner”
neuron and its
neighborhood adapts to
make their weight vector
more similar to the input
pattern that caused the
activation

The neurons are moved
closer to the input pattern

The magnitude of the
adaptation is controlled via a
learning parameter which
decays over time
The algorithm




Randomise each output node’s weights
Choose a training input at random
See which output matches closest (this is the BMU: Best
Matching Unit)
Adjust weightings of all neighbouring nodes (nodes within
the radius of the BMU) to make them more like the
training input


The closer to BMU, the greater the change
Repeat from step 2
The radius

This defines which nodes are to be thought of as near to
the initial one

Initially (usually) it will encompass all the nodes in the
map

And decrease to 0 as the network proceeds
Testing

We can then present to the network a new, unseen input

The node which it falls closest to will be deemed to be its
classification
In action…

http://fbim.fhregensburg.de/~saj39122/begrolu/kohonen.html


a good applet demonstration of a Kohonen net training
http://www.generation5.org/jdk/demos.asp

(scroll down): applet running Kohonen net on colours. Note
how it’s clear that radius decreases over time
Further reading

Follow up all the links in the slides

Read Russell & Norvig, chapter 20 (especially the Neural
Network section), but don't worry too much about the
maths at this stage

Useful resources

http://www.ai-junkie.com/ann/som/som1.html


Good tutorial
Enough information to write your own Kohonen net!