WHY WOULD YOU STUDY ARTIFICIAL INTELLIGENCE? (1)
Download
Report
Transcript WHY WOULD YOU STUDY ARTIFICIAL INTELLIGENCE? (1)
ARTIFICIAL INTELLIGENCE
[INTELLIGENT AGENTS PARADIGM]
LEARNING IN NEURAL NETWORKS
Professor Janis Grundspenkis
Riga Technical University
Faculty of Computer Science and Information Technology
Institute of Applied Computer Systems
Department of Systems Theory and Design
E-mail: [email protected]
LEARNING IN NEURAL
NETWORKS (1)
• A (artificial) neural network is composed of a number of
simple arithmetic computing elements or nodes (units),
connected by links.
– Synonyms: connectionism, parallel distributed processing, and
neural computation.
• From a biological viewpoint, a neural network is a
mathematical model for the operation of brain.
• The nodes of a neural network correspond to neurons.
• A neuron is a cell in the brain whose principal function is
the collection, processing, and dissemination of electrical
signals.
LEARNING IN NEURAL
NETWORKS (2)
Synapse
Axonal
arborization
Axon from
another cell
Dendrite
Nucleus
Cell body or Soma
Synapses
• The brains information capacity is thought to
emerge primarily from networks of such
neurons.
COMPARING BRAINS WITH
COMPUTERS (1)
Computer
Human Brain
1011 neurons
Cycle time
1 CPU,
108 gates
1010 bits RAM
1011 bits disk
10-9 sec
Bandwidth
1010 bits/sec
1014 bits/sec
Memory
updates/sec
109
1014
Computational
units
Storage units
1011 neurons
1014 synapses
10-3 sec
COMPARING BRAINS WITH
COMPUTERS (2)
COMPARISON OF THE MEMORY
• Human brain is evolving very slowly, whereas
computer memories are growing rapidly.
COMPARISON OF SWITCHING SPEED AND
PARALLELISM
• Computer chips can execute an instruction in tens
of nanoseconds, whereas neurons require
milliseconds to fire.
• Most current computers have only one or at most
a few CPU, whereas all neurons and synapses are
active simultaneously (parallel processing).
COMPARING BRAINS WITH
COMPUTERS (3)
• A neural network running on a serial computer
requires hundreds of cycles to decide if a single
neuron-like unit will fire, whereas in a human
brain, all the neurons do this in a single step.
• CONCLUSION: Even though a computer is a
million times faster in raw switching speed, the
brain ends up being a billion times faster at what it
does.
COMPARING BRAINS WITH
COMPUTERS (4)
• Brains are more fault-tolerant than computers.
• Brains are constantly faced with novel input, yet manage
to do something with it. Computer programs rarely work
as well with novel input, unless the programmer has been
exceptionally careful.
• The attraction of neural networks is graceful degradation:
they tend to have a gradual rather than sharp drop-off in
performance as conditions worsen.
• The attraction of neural networks also is that they are
designed to be trained using an inductive learning
algorithm.
A MATHEMATICAL MODEL
FOR A NEURON (1)
• A simple mathematical model of the neuron
is devised by McCulloch and Pitts (1943).
Bias Weight
a0 = -1
aj
W0,j
Wj,i
Input
Links
ai = g(ini)
ini
g
ai
Activation Output
Input
Function Function
Output
Links
A MATHEMATICAL MODEL
FOR A NEURON (2)
• A neural network is composed of a number of
units, connected by links.
• Each link has a numeric weight associated with it.
Weights are the primary means of long-terms
storage in neural networks, and learning usually
takes place by updating the weights.
• Some of the units are connected to the external
environment, and can be designated as input or
output units.
A MATHEMATICAL MODEL
FOR A NEURON (3)
• The weights are modified so as to try to bring the
network’s input/output behaviour more into line
with that of the environment providing the inputs.
• Each unit has a set of input links from other units,
a set of output links to other units, a current
activation level, and a means of computing the
activation level at the next step in time.
• Each unit does a local computation based on
inputs of its neighbors, but without the need for
any global control over the set of units as a whole.
SIMPLE COMPUTING
ELEMENTS (1)
• Each unit performs a simple computation: it
receives signals from its input links and
computes a new activation level that it sends
along each of its output links.
• The computation is split into two components.
• First is a linear component, called the input
function, ini, that computes the weighted sum of
n input values
the unit’s
in i Wji a j
j0
SIMPLE COMPUTING
ELEMENTS (2)
• Second is a nonlinear component, called the activation
function, g, that transforms the weighted sum into the final
value that serves as the unit’s output (activation value), ai.
n
a i g(in i ) g( Wji a j )
j0
• The computation of the activation level is based on the
values of each input signal received from a neighbor node,
and the weights on each input line.
• A bias weight W0i connected to a fixed input a0 = -1 sets
the actual threshold for the unit.
ACTIVATION FUNCTION (1)
• The activation function g is designed to meet two
wishes.
• First, the unit must be “active” (near +1) when the
“right” inputs are given, and “inactive” (near 0)
when the “wrong” inputs are given.
• Second, the activation needs to be nonlinear,
otherwise the entire neural network collapses into
a simple linear function and will not be useful for
representation of more complex functions.
• Different models are obtained by using different
mathematical functions for g.
ACTIVATION FUNCTION (2)
• THREE COMMON CHOICES
•The threshold function
g(ini)
+1
ini
•The sign function
g(ini)
+1
1, if in i 0
g(in i )
0, if in i 0
g(ini)
ini
-1
1, if in i 0
g(in i )
1, if in i 0
•The sigmoid function
+1
ini
1
g (in i )
( in i )
1 e
REPRESENTATION OF BOOLEAN
FUNCTIONS BY THRESHOLD UNITS
• Individual units are able to represent basic
Boolean function (logic gates).
W0 = 1.5
W0 = 0.5
W1 = 1
W1 = 1
W0 = 0.5
W1 = 1
W2 = 1
W2 = 1
AND
OR
NOT
THE FUNCTION REPRESENTED
BY THE NETWORK
• With fixed structure and fixed activation function g, the
functions representable by a feed-forward network are
restricted to have specific parameterized structure.
• The weights chosen for the network determine which of
these function is actually represented.
a5 = g(W35a3 + W45a4) =
g(W35g(W13a1 + W23a2) +
+ W45g(W14a1 + W24a2))
EXAMPLE
I1
I2
W13
W14
W23
W24
H3
W35
O5
H4
W45
where g is the activation function,
and ai is the output of node i.
NETWORK REPRESENTATION
• The links are determined by three
parameters
– a start node of the link
– an end node of the link
– a numeric weight of the link
• Network topology (structure of
links)
– a weight matrix
NETWORK STRUCTURES (1)
TWO MAIN CATEGORIES
• FEED FORWARD NETWORKS
– The network structure is acyclic.
– A feed-forward network represents a function of its
current input (it has no internal state other than the
weights themselves).
EXAMPLES
NETWORK STRUCTURES (2)
• RECURRENT NETWORKS
– The network structure is cyclic.
EXAMPLES
NETWORK STRUCTURES (3)
– A recurrent network feeds its outputs back
into its own inputs.
– The activation levels of recurrent networks
form a dynamic system that may reach a
stable state or exhibit oscillations or even
chaotic behaviour.
– The response of the network to a given input
depends on its initial state, which may
depend on previous inputs. Hence, recurrent
networks can support short-term memory.
NETWORK STRUCTURES (4)
• Hopfield networks are the best-understood class of
recurrent networks
• They use bidirectional connections with symmetric
weights Wi,j = Wj,i
• All of the units are both input and output units
• The activation function g is the sign function, and the
activation levels can only be 1
• A Hopfield network functions as an associative memory –
after training on a set of examples, a new stimulus will
cause the network to settle into an activation pattern
correspoding to the example in the training set that most
closely resembles the new stimulus
FEED-FORWARD NEURAL
NETWORKS
• SINGLE LAYER FEED-FORWARD NEURAL
NETWORK
– A network with all the inputs connected directly to the
outputs is called a single layer network or a
perceptron network.
– Each output unit is independent of the others.
EXAMPLES
Perceptron network
Single perceptron
PERCEPTRON (1)
• With a threshold activation function, the perceptron
represents a Boolean function.
• A threshold perceptron returns 1 if and only if the
weighted sum of its inputs (including the bias) is positive:
n
W x
j0
j
j
0
• The equation
n
W x
j
j
a hyperplane in the
defines
0
0
input space, so thejperceptron
returns 1 if and only if the
input is on one side of that hyperplane.
• For this reason, the threshold perceptron is called a linear
separator.
PERCEPTRON (2)
• LINEAR SEPARABILITY IN PERCEPTRONS
I1
I1
1,5
1
1
I2
0
1 1,5
a) Boolean function AND
0,5
I2
• Each function is
represented as a
two dimensional
plot, based on the
values of the two
inputs.
0 0,5 1
b) Boolean function OR
• Black dots indicate a point in the input space where the
value of the function is 1, and white dots indicate a
point where the value is 0.
PERCEPTRON (3)
• In Figure (a) one possible separating “plane” (a line) is
defined by the equation
I1 + I2 = 1,5 or I1 = - I2 + 1,5
The region above the line, where the output is 1, is given
by
I1 + I2 – 1,5 > 0
• In Figure (b) one possible separating “plane” (a line) is
defined by the equation
I1 + I2 = 0,5 or I1 = - I2 + 0,5
The region above the line, where the output is 1, is given
by
I1 + I2 – 0,5 > 0
LINEAR CLASSIFICATION (1)
EXAMPLE
Problem: CLASSIFICATION OF AIRPLANES
• RULES
IF WEIGHT > 0,80 AND SPEED < 0,55 THEN BOMBER
IF WEIGHT < 0,90 AND SPEED > 0,25 THEN FIGHTER
Let have 10 examples of airplanes
3
2,5
2
WEIGHT 1,5
1
0,5
0
0
0,2
0,4
0,6
BOMBER
FIGHTER
0,8
1 SPEED
LINEAR CLASSIFICATION (2)
• One possible separating line is defined by the equation
I2 = 1,5 I1 + 0,5
• The equation may be used for the decision-making rule
f(I1, I2) = 1,5 I1 – I2 + 0,5
IF f(I1, I2) 0 THEN FIGHTER
IF f(I1, I2) < 0 THEN BOMBER
• THE CORRESPONDING PERCEPTRON
0,5
1
1,5
I
1
I2
-1,0
LEARNING LINEARLY
SEPARABLE FUNCTIONS (1)
• There is a perceptron algorithm that will
learn any linearly separable function,
given enough training examples.
• The idea behind most algorithms for neural
network learning is to adjust the weights of
the network to minimize some measure of
the error on the training set.
• The initial network has randomly assigned
weights, usually from the range [-0,5, 0,5].
LEARNING LINEARLY
SEPARABLE FUNCTIONS (2)
• The network is then updated to try to make
it consistent with the examples. This is done
by making small adjustments in the
weights to reduce the difference between
the observed and predicted values
(optimization search in weight space).
• Typically, the updating process is iterative.
Each iteration involves updating all the
weights for all the examples.
THE WEIGHT UPDATING
PROCESS
THE MEASURE OF
ERROR
•
For a single training example
First case: ERR = T – O
where O is the output of the perceptron on the example, and
T is the true output value.
1
Second case: ERR (T O) 2
2
• If the error is positive, then O must be increased.
• If the error is negative, then O must be decreased.
• Each input unit contributes Wj · Ij to the total input, so if Ij
is positive, an increase in Wj will tend to increase O, and if
Ij is negative, an increase in Wj will tend to decrease O.
THE PERCEPTRON LEARNING
RULE (1)
Wj = Wj + · Ij · Err
where is the learning rate.
The training example
EXAMPLE
The initial network
1 0,2
0,2
I
1
I2
-0,5
The fighter, speed = 0,4, weight =
0,8.
The output
O = 0,2 · 0,4 + (-0,5) · 0,8 + 0,2 =
0,08 – 0,4 + 0,2 = -0,12
f(I1, I2) < 0, the classification – bomber
The error
ERR = 1 – 0 = 1
THE PERCEPTRON LEARNING
RULE (2)
The updated weights ( = 1)
W1 = 0,2 + 0,4 · 1 = 0,6
W2 = -0,5 + 0,8 · 1 = 0,3
Wbias = 0,2 + 1 · 1 = 1,2
1
I1
I2
1,2
0,6
0,3
The new output
O = 0,6 · 0,4 + 0,3 · 0,8 + 1,2 = 0,24 + 0,24 + 1,2
1,68
f(I1, I2) > 0, the classification – fighter
=
MULTILAYERED FEEDFORWARD NETWORKS (1)
• Networks with one or more layers of hidden units
are called multilayer networks.
I1
I2
Wji
ai
ak
Wkj
aj
Input units
Hidden units Output units
• The advantage of adding hidden layers is that it enlarges
the space of hypothesis that the network can represent.
MULTILAYERED FEEDFORWARD NETWORKS (2)
• With a single, sufficiently large hidden layer, it
is possible to represent any continuous function
of the inputs with arbitrary accuracy.
• Unfortunately, for any particular network
structure, it is harder to characterize exactly
which functions can be represented and which
ones cannot. The problem of choosing the right
number of hidden units in advance is still not
well understood.
MULTILAYERED FEEDFORWARD NETWORKS (3)
EXAMPLE
Boolean XOR function
The two layer feed
forward network
I2
Linear classification
is not possible
1
1
I1
1,5
-1
-1
a1
-1
-1
a2
1
a3
I1
0
1
I2
1
-0,5
1
MULTILAYERED FEEDFORWARD NETWORKS (4)
First case
Input=(0,0)
a1 = (1·1,5)+(0·(-1)) + (0·(-1)) = 1,5 Output=1
a2 = (1·0,5)+(0·(-1))+(0·(-1)) = 0,5 Output=1
a3 = (1·(-0,5))+(1·1)+(1·(-1)) = -0,5 Output=0
Second case
Input=(1,0) a1 = (1·1,5)+(1·(-1)) + (0·(-1)) = 0,5 Output=1
a2 = (1·0,5)+(1·(-1))+(0·(-1)) = -0,5 Output=0
a3 = (1·(-0,5))+(1·1)+(0·(-1)) = 0,5 Output=1
MULTILAYERED FEEDFORWARD NETWORKS (5)
Third case
Input=(0,1)
a1 = (1·1,5)+(0·(-1)) + (1·(-1)) = 0,5 Output=1
a2 = (1·0,5)+(0·(-1))+(1·(-1)) = -0,5 Output=0
a3 = (1·(-0,5))+(1·1)+(0·(-1)) = 0,5 Output=1
Fourth case
Input=(1,1) a1 = (1·1,5)+(1·(-1)) + (1·(-1)) = -0,5 Output=0
a2 = (1·0,5)+(1·(-1))+(1·(-1)) = -1,5 Output=0
a3 = (1·(-0,5))+(0·1)+(0·(-1)) = -0,5 Output=0
CLASSIFICATION EXAMPLE
Problem: Any pair of persons must be classified into
two classes: brothers and friends. There are 3
brothers in each family. Persons from different
families are friends.
John
Paul
George
Charles
Dick
Rodger
1
1
1
0,5
1,5
FRIENDS
-1,5
BROTHERS
-1
1
1
0,5
-1
LEARNING IN MULTILAYER
NETWORKS
• Learning algorithms for multilayer networks are similar to
the perceptron learning algorithm.
• Inputs are presented to the network, and if the network
computes an output vector that matches the target nothing
is done.
• If there is an error (a difference between the output and
target), then the weights are adjusted to reduce this error.
• The trick is to assess the blame for an error and divide it
among the contributing weights, because in multilayer
networks there are many weights connecting each input to
each output, and each of these weights contributes to
more than one output.
BACK-PROPAGATION
LEARNING (1)
• The back-propagation algorithm is sensible approach to
dividing the contribution of each weight.
• At the output layer, in weight update rule the activation of
the hidden unit aj is used instead of the input value; and the
rule contains a term for the gradient of the activation
function
Wji = Wji + · aj ·Erri · g’(ini)
where Erri is the error (Ti - Oi) at the output node;
g’(ini) is the derivative of the activation function g (for
this reason only sigmoid function can be used in multilayer
network).
•
BACK-PROPAGATION
LEARNING
(2)
A new error term is defined, which for output
i
nodes is i = Erri · g’(ini).
• The update rule now is the following
Wji = Wji + · aj · i
• The propagation rule for the values is the
following
j g' (in j ) Wji i
i
• The weight update rule for the weights between the inputs
and the hidden layer is the following
Wkj = Wkj + · Ik · j
SUMMARIZED BACKPROPAGATION ALGORITHM
• Compute the values for the output units
using the observed error.
• Starting with output layer, repeat the
following for each layer in the network,
until the earliest hidden layer is reached:
– Propagate the values back to the previous
layer.
– Update the weights between the two layers.
BUILDING OF NEURAL
NETWORKS
• To build a neural network to perform some task, it
is needed:
– To decide how many units are to be used.
– To decide what kind of units are appropriate.
– To decide how the units are connected to form a
network.
– To initialize the weights of the network.
– To decide how to encode the examples in terms of
inputs and outputs of the network.
– To train the weights using a learning algorithm
applied to a set of training examples for the task.