Transcript ANN

Artificial neural networks
Biological inspiration
Animals are able to react adaptively to changes in their
external and internal environment, and they use their
nervous system to perform these behaviours.
An appropriate model/simulation of the nervous system
should be able to produce similar responses and
behaviours in artificial systems.
The nervous system is build by relatively simple units,
the neurons, so copying their behavior and functionality
should be the solution.
Comparison of Brains and Traditional
Computers
• 200 billion neurons, 32
trillion synapses
• Element size: 10-6 m
• Energy use: 25W
• Processing speed: 100 Hz
• Parallel, Distributed
• Fault Tolerant
• Learns: Yes
• Intelligent/Conscious:
Usually
• 1 billion bytes RAM but trillions
of bytes on disk
• Element size: 10-9 m
• Energy watt: 30-90W (CPU)
• Processing speed: 109 Hz
• Serial, Centralized
• Generally not Fault Tolerant
• Learns: Some
• Intelligent/Conscious: Generally
No
Biological Inspiration
Idea : To make the computer more robust, intelligent,
and learn, …
Let’s model our computer software (and/or hardware)
after the brain
“My brain: It's my second favorite organ.”
- Woody Allen
Biological inspiration
Dendrites
Soma (cell body)
Axon
Biological inspiration
axon
dendrites
synapses
The information transmission happens at the
synapses.
Biological inspiration
The spikes travelling along the axon of the presynaptic neuron trigger the release of neurotransmitter substances at the synapse.
The neuro-transmitters cause excitation or inhibition
in the dendrite of the post-synaptic neuron.
The integration of the excitatory and inhibitory
signals may produce spikes in the post-synaptic
neuron.
The contribution of the signals depends on the
strength of the synaptic connection.
Artificial neurons
Neurons work by processing information. They receive
and provide information in form of spikes.
x1
w1
x2
Inputs
w2
x3
…
xn-1
xn
w3
..
.
n
z   wi xi ; y  H ( z )
i 1
wn-1
wn
The McCullogh-Pitts model
Output
y
Artificial neurons
The McCullogh-Pitts model:
• spikes are interpreted as spike rates;
• synaptic strength are translated as synaptic
weights;
• excitation means positive product between the
incoming spike rate and the corresponding
synaptic weight;
• inhibition means negative product between the
incoming spike rate and the corresponding
synaptic weight;
Biological Neural Network
Soma
Dendrite
Axon
Synapse
Artificial Neural Network
Neuron
Input
Output
Weight
10
Artificial neurons
Nonlinear generalization of the McCulloghPitts neuron:
y  f ( x, w)
y is the neuron’s output, x is the vector of inputs,
and w is the vector of synaptic weights.
Artificial neural networks
Output
Inputs
An artificial neural network is composed of many artificial
neurons that are linked together according to a specific network
architecture. The objective of the neural network is to transform
the inputs into meaningful outputs.
Artificial neural networks
Tasks to be solved by artificial neural networks:
• controlling the movements of a robot based on
self-perception and other information (e.g., visual
information);
• deciding the category of potential food items
(e.g., edible or non-edible) in an artificial world;
• recognizing a visual object (e.g., a familiar face);
• predicting where a moving object goes, when a
robot wants to catch it.
Learning in biological systems
Learning = learning by adaptation
The young animal learns that the green fruits are
sour, while the yellowish/reddish ones are sweet. The
learning happens by adapting the fruit picking
behavior.
At the neural level the learning happens by changing
of the synaptic strengths, eliminating some synapses,
and building new ones.
Learning in biological neural
networks
The learning rules of Hebb:
• synchronous activation increases the synaptic
strength;
• asynchronous activation decreases the synaptic
strength.
These rules fit with energy minimization principles.
Maintaining synaptic strength needs energy, it should
be maintained at those places where it is needed, and
it shouldn’t be maintained at places where it’s not
needed.
Learning principle for
artificial neural networks
ENERGY MINIMIZATION
We need an appropriate definition of energy for
artificial neural networks, and having that we can
use mathematical optimisation techniques to find
how to change the weights of the synaptic
connections between neurons.
ENERGY = measure of task performance error
• Supervised learning
one set of observations, called inputs, is assumed
to be the cause of another set of observations,
called outputs.
• Unsupervised learning
can be used for bridging the causal gap between
input and output observations.
17
The Weights
• In supervised learning:
– Examples are presented along with the corresponding desired
output vectors.
– Weight is adjusted with each iteration until the actual output for
each input is close to the desired vector.
• In unsupervised learning:
– Examples are presented without any corresponding desired output
vectors.
– Weight is adjusted in accordance with naturally occurring patterns
in the data using a suitable training algorithm.
– Output vector represents the position of the input vector within
the discovered patterns of the data.
18
• When presented with noisy/incomplete data, ANN produce
approximate answer rather than incorrect.
• When presented with unfamiliar data within the range of its
previously seen examples, ANN will generally produce a reasonable
output interpolated between the example outputs.
• However, ANN is unable to extrapolate reliably beyond the range of
the previously seen examples.
• For Interpolation we can use fuzzy logic. Therefore, ANN and fuzzy
logic are alternative solutions to engineering problem and may be
combined in a hybrid system.
19
ANN applications
• ANN can be applied to many tasks.
• ANN associates input vector (x1, x2, … xn) with
output vector (y1, y2, … ym)
• The function linking the input and output may be unknown
and can be highly nonlinear.
– A linear function is one that can be represented as f(x) = mx + c, where m and
c are constants;
– a nonlinear one may include higher order terms for x, or trigonometric or
logarithmic functions of x.)
20
Application 1: Nonlinear estimation
• ANN technique can determine values of variables that cannot
be measured easily, but known to depend on other more
accessible variables.
• The measurable variables form the network input vector and
the unknown variables constitute the output vector.
• In Nonlinear estimation, the network is initially trained using a
set of examples known as the training data.
• Supervised learning is used;
– i.e: each example in the training comprises two vectors: an input vector
and its corresponding desired output vector.
– This assumes that some values for the less accessible variable have been
obtained to form the desired outputs.
21
Application 1: Nonlinear estimation
• During training, the network learns to associate the example
input vectors with their desired output vectors.
• When it is subsequently presented with a previously unseen
input vector, the network is able to interpolate between similar
examples in the training data to generate an output vector.
22
Classification
• Output vector classify input into one of a set of known possible
class.
• Example: speech recognition system:
– Classify input into 3 different words: yes, no, and maybe.
– Input: Preprocessed digitized sound of the words
– Output: (0, 0, 1) for yes
(0, 1, 0) for no
(1, 0, 0) for maybe.
– During training, the network learns to associate similar input vectors
with a particular output vector.
– When it is subsequently presented with a previously unseen input
vector, the network selects the output vector that offers the closest
match.
23
• Example
X 
n
 xi wi
i 1
1, if X  
Y
1, if X  
24
Clustering
• Unsupervised learning
• Input vectors are clustered into N groups, (N is integer, may be prespecified or may be allowed to grow according to the diversity of the data).
• Example: In speech recognition
– Input : only spoken words
– Training: cluster together examples that is similar to each other. (eg: according to
different words or voices).
– Once the clusters have formed, a second neural network is trained to associate each
cluster with a particular desired output.
– The overall system then becomes a classifier, where the first network is unsupervised and
the second one is supervised.
– Clustering is useful for data compression and is an important aspect of data mining, i.e.,
finding patterns in complex data.
25
26
Content-addressable memory
•
•
A form of unsupervised learning.
no desired output vectors associated with the training data. During training,
each example input vector becomes stored in a dispersed form through the
network.
•
When a previously unseen vector is subsequently presented to the network, it
is treated as though it were an incomplete or error-ridden version of one of the
stored examples.
•
So the network regenerates the stored example that most closely resembles
the presented vector.
•
This can be thought of as a type of classification, where each of the examples
in the training data belongs to a separate class, and each represents the ideal
vector for that class.
27
Nodes and interconnections
•
•
•
•
•
Node or neuron is a simple computing element
having an input side and an output side.
Each node may have directional connections to
many other nodes at both its input and output
sides.
Each input xi is multiplied by its associated weight
wi.
Typically, the node’s role is to sum each of its
weighted inputs and add a bias term w0 to form
an intermediate quantity called the activation, a.
It then passes the activation through a nonlinear
function ft known as the transfer function or
activation function. Figure shows the function of a
single neuron.
28
• Input layer to represent the number of inputs (
raw data).
• The hidden layer's job is to transform the inputs
into something that the output layer can use.
• The output layer transforms the hidden layer
activations into whatever scale you wanted your
output to be on.
29
Nodes and interconnections
•
•
•
•
•
The behavior of a neural network depends on its topology,
the weights, the bias terms, and the transfer function.
The weights and biases can be learned, and the learning
behavior of a network depends on the chosen training
algorithm.
Typically a sigmoid function is used as the transfer
function
For each neuron, the activation function is given by:
where n is the number of inputs and the bias term w0 is
defined separately for each node.
30
Typical Transfer Functions
•
Non-linear transfer function:
Step function
Sign function Sigmoid function Linear function
Y
Y
Y
Y
+1
+1
1
1
0
-1
X
Sigmoid function
0
-1
Ramp function
0, if X  0
1, if X  0
X
-1
-1
1, if X  0 sign 1, if X  0 sigmoid
step

Y

Y
Y 
0
X
0
X
Step function
1
1  e X
Y linear X
31
MultiLayer Perceptron (MLP)
• The neurons are organized in layers.
• Each neuron is totally connected to the
neurons in the layers above and below, but
not to the neurons in the same layer.
• These networks are also called feed
forward networks.
• MLPs can be used either for classification
or as nonlinear estimators.
• Number of nodes in each layer and the
number of layers are determined by the
network builder, often on a trial-and-error
basis.
• There is always an input layer and an
output layer; the number of nodes in each
is determined by the number of inputs and
outputs being considered.
32
MultiLayer Perceptron (MLP)
• Can have any number of hidden
layers between input and output
layers.
• have no obvious meaning
associated with them.
• If no hidden layers, the network
is a single layer perceptron (SLP).
• Network shown has
– Three input nodes
– Two hidden layers with four nodes
each.
– One output layer of two nodes.
– Short form name is 3–4–4–2 MLP.
33
Perceptrons as classifiers
• Normally there is one input node for each element of the input vector
and one output node for each element of the output vector.
• Each output node would usually represent a particular class
• Typical representation for a class would be
– ~1 for one class and the rest ~0.
– (0, 0, 1) for yes
(0, 1, 0) for no
(1, 0, 0) for maybe.
• Other representations are such as two output nodes to represent four
classes. Eg: (0,0), (0,1), (1,0), and (1,1).
34
Linear classifiers
•

Output, prior to application of the
transfer function, is given by

The dividing criterion is assumed to
be a = 0 corresponding to output of
0.5 after the application of the
sigmoid transfer function

Thus the hyper plane that separates
the two regions is given by:
Example: Single layer perceptron
–
–
–
–
Input : 2 neuron
Output: 3 classes
Each class has 1 dividing line
Linearly separable
35
Training a perceptron
• Training separate the regions in state space by adjusting its weights and
bias.
• Difference between the generated value and the desired value is the error
• The overall error is expressed as the root mean squares (RMS) of the errors
(both –ve and +ve)
• Training minimized RMS by altering the weights and bias, through many
passes of the training data.
• This search for weights and biases that gives the minimum RMS error is an
optimization problem with RMS error as the cost function.
• When RMS error is within a small range, we say that the network
converged.
36
Training Algorithm
• Most common is : back-error propagation (BP) algorithm (or
generalized delta rule)
• A gradient-proportional descent technique with continuous
and differentiable transfer function such as sigmoid.
• For sigmoid function
,
the derivative is
37
BP algorithm
 2 stages
 Gather error term
 Update weights
 Repeat as many
times required
38
BP
Input signals
1
x1
x2
2
xi
y1
2
y2
k
yk
l
yl
1
2
i
1
wij
j
wjk
m
n
xn
Input
layer
Hidden
layer
Output
layer
Error signals
39
Inputs
Desired
output
x1
x2
yd
1
0
1
0
1
1
0
0
0
1
1
0
Actual
output
y5
0.0155
0.9849
0.9849
0.0175
e
Sum of
squared
errors
0.0010
40
Hierarchical Perceptrons
• In complex problems
recommended to divide
MLP into several smaller
MLPs arranged in a
hierarchy.
• Each MLP independent
from each other, can be
trained separately or in
parallel.
41
Some Practical Considerations
• Stop training if RMS error remains constant so as not to overtrain the network (expert at giving correct output for training
data, but not with new data).
• Some reasons :
– too many cycles of training
– over-complex network (many hidden layers or numbers of neurons)
• To avoid:
– divide the data into training, testing, and validation.
– Use leave-one-out method
– Use scaled data.
42
Effects of Over training
43