Transcript ppt

From Biological to
Artificial Neural Networks
NPCDS/MITACS Spring School, Montreal,
May 23-27, 2006
Helmut Kroger
Laval University
Outline
1. From biology to artifial NNs
2. Perceptron – unsupervised learning
3. Hopfield model – associative memory
4. Kohonen map – self organization
References
1. Hertz J., Krogh A., and Palmer R.G.
Introduction to the theory of neural
computation
2. Pattern Classification, John Wiley, 2001
R.O. Duda and P.E. Hart and D.G. Stork
3. Haykin S (1999). Neural networks.
Prentice Hall International.
4. Bishop C (1995). Neural networks for
pattern recognition. Oxford: Clarendon
Press
1. Pattern Recognition and Neural Networks
by Brian D. Ripley. Cambridge University Press.
Jan 1996.
2. Neural Networks. An Introduction, SpringerVerlag Berlin, 1991 B. Mueller and J. Reinhardt
3. W.S. McCulloch & W. Pitts (1943). “A logical
calculus of the ideas immanent in nervous
activity”, Bulletin of Mathematical Biophysics, 5,
115-137.
Use of NNs:
Neural Networks Are
For
Applications
Science
Character recognition
Neuroscience
Optimization
Physics, Mathematics
Data mining
Computer science
…
…
Biological neural networks
Nerve cells are called neurons. Many different types
exist. Neurons are extremely complex.
 Approx. 1011 neurons in the brain. Each neuron
has about 103 connections
Neural communication
Neurons transport
information via
electrical action
potentials. At the
synapse the
transmission is
mediated by chemical
macromolecules
(neurotransmitter
proteins).
How two neurons communicate
Biology of neurons:
 Single neurons are highly complex
electrochemical devices
 Many forms of interneuron
communication now known – acting over
many different spatial and temporal
scales
 Local
 Gaseous
 Volume signaling etc.
The neuron is a computer:
Hillock
input
Output
From biology of brain to information
processing.
 (1) Information is distributed.
 (2) Brain works in deterministic mode as well
as in stochastic mode (generation of nerve
signals, synaptic transmission).
 (3) Associative memory: Retrieval not by
checking bit by bit, but by gradual
reproduction of the whole.
 (4) Architecture of brain:
- optimizes information transmission.
- stable against errors
- physical constraints: oxygen, blood, energy, cooling.
- Small World Architecture.
 (5) 1/f frequency scaling in EEG: fractals, self-similarity.
Dynamical origin: Model of self-organized criticality
(sand pile analogue).
Neural avalanches – do they transmit information?
Does the brain work at some critical point?
 (6) How is neural connectivity generated? Nerve
growth: random process, guided by chemical markers,
enhanced by stochastic resonance.
Layered structure of neocortex
Organization of neo-cortex
Biological parameters of neocortex
Artificial Neural Networks
A network with interactions: An attempt to mimic the
brain:
 Unit elements are artificial neurons (linear or
nonlinear input-output unit).
 Communications are encoded by weights,
measure of how strong neurons affect each other.
 Architectures can be feed-forward, feedback or
recurrent.
Firing of a group of neurons: biology vs model
x1
w1: synaptic strength
y  f ( wj x j  bi )
wn
xn
1. Multi-layer feed forward network
Biological motivation for multilayer feed-forward network:
Modeled after visual cortex being muli-layered
(6 layers). Dominantly feedforward. There is
strong lateral inhibition: Neurons in the same
layer don’t talk to each other.
High connectivity between neurons (10^4 per
neuron) provides basis for massive parallel
computing.
High redundance against errors.
The principal building blocks:





Input vector x_k
Weight matrix w_ik
Activation function f
Output vector y_i
Dynamical update rule:
yi (t  1)  f ( j wij x j (t )   i )
 Goal: Find weights and activation function such
that for given input the output is close to desired
output
 (Training supervised by teacher)
Examples of activation functions
The Perceptron (1962): A simple feedforward network:1 input layer 1output
layer.
• The simplest possible artificial neural network able to do
a calculation consists from two inputs and one output. It
can be used to classify patterns or to perform basic
logical operations such as AND, OR and NOT.
x
y
1
1
Truth Table for
Logical AND
x+y-1.5
(x+y-1.5)
inputs
-1
weights
1.5
sum
output
x
y
x&y
0
0
1
1
0
1
0
1
0
0
0
1
inputs
output
Perceptron learning algorithm
Training any unit consists of adjusting the weight vector and
threshold so that desired classification is performed.
Algorithm (inspired by Hebb’s rule): For each training
vector x evaluate the output y. Take the difference
desired output t – current output y. wi=wi+wi.
wi=(t-y)xi. Repeat until y=t.
inputs
*
weights
Σxi wi
sum output
Rule:
If the sum of the weighted inputs
exceeds a threshold, output 1, else
output -1.
1 if Σ inputi * weighti > threshold
-1 if Σ inputi * weighti < threshold
Interpretation of weights
Heaviside function has threshold at x=0. Decision
boundary given by:
a = w*x+ w0 = w0 + w1 x1 + w2 x2 = 0
Thus: x2 = - (w0 + w1 x1)/w2 .
1.5
0
1
1.5
0
0
Look-up table: OR fct, XOR fct
Linear discriminant function:
separation of 2 classes
A linear discriminant function is a mapping which partitions feature
space using a linear function (straight line, or hyperplane)
D=2 dimensions: decision boundary is straight line
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
8
6
Decision Region 2
Decision
Region 1
Simple form of
classifier:
Feature 2
4
“separate two
classes using a
straight line in
feature space”
2
0
-2
Decision
Boundary
-4
-4
-2
0
2
4
6
Feature 1
8
10
12
14
w11
x1
y1
wk1
xd
wkd
-1
yk
wk0
Weight to output j from input k is
wjk
Separation of
K classes
yj = g(Swjk xk + wk0)
 Perceptron can be used to discriminate between k classes by having
k output nodes:
 x is in class Cj if yj (x)>= yk for all k
 Resulting decision boundaries divide the feature space into convex
decision regions
C1
C2
C3
Network training via gradient
descend
Set of training data from known classes used in conjunction with an
error function E(w) (eg. squared difference between target t and
response y) which must be minimized.
Then:
w new = w old -  E(w)
E
wjk1  wjk  
w jk
where:  E(w) is a vector representing the gradient and
 is the learning rate (small, positive)
1. Move downhill in direction E(w) (steepest downhill since E(w)
is the direction of steepest increase)
2. Termination is controlled by 
Moving along the error function landscape
 Equivalent to climbing hill up and down
 Problem: when to stop?
 Local minima
 Possibility of multiple local minima. Note: for
single-layer perceptron, E(w) only has a single
global minimum - no problem!
 Gradient descent goes to the closest local
minimum:
 General solution: random restarts from multiple
places in weight space (simulated annealing).
Training multi-layer NN via back
propagation algorithm
Back-Propagation
Back propagation cont’d
Perceptron as Classifier
For d-dimensional data perceptron consists of d-weights, a bias and a
thresholding activation function. For 2D data we have:
x1
x2
1
w1
w2
w0
View the bias as another
weight from an input
which is constantly on
a = w0 + w1 x1 + w2 x2
1. Weighted Sum
of the inputs
y=g(a)
2. Pass thru
Heaviside function:
T(a)= -1 if a < 0
T(a)= 1 if a >= 0
{-1, +1}
Output
= class
decision
If we group the weights as a vector w we therefore have the
net output y given by:
y = g(w . x + w0)
Failure of the Perceptron
Successful
Few
Hours
in the
Gym
per
Week
Many
Hours
in the
Gym
per
Week
Footballers
Academics
…despite the simplicity
of their relationship:
Academics =
Successful XOR Gym
Unsuccessful
In this example, a perceptron would not be able to
discriminate between the footballers and the academics (XOR
cannot be represented by single theshold sigma node).
This failure caused the majority of researchers to walk away.
Classification: Decision Trees
Color = dark
#nuclei=1
#tails=1
healthy
#tails=2
cancerous
#nuclei=2
cancerous
Color = light
#nuclei=1
#nuclei=2
healthy
#tails=1
#tails=2
healthy
cancerous
Example of Classification from
Neural Networks:
“Which factors determine if a cell is cancerous?”
Color = dark
# nuclei = 1
…
# tails = 2
Healthy
Cancerous
Problem: Over-fitting
Feed forward NN . Letter recognition from writing on
PC screen. Letter scanner – transformation into
ASCII code
1 2 3 4 5 6
7 8 9 10 11
0
 
0
0
0
 
0
0
 
0
1
 
0
.
.
 
.
.
1
 
. 
9
15
SWN topology: Network of
5 Layers by 8 Neurons.
Simulation with 5 neurons
per layers and 8 layers. NN
was trained with 40 patterns
for 50 different runs.
Learning 40 patterns. The
regular network almost
fails to learn. With a few
short-cuts the network
learns well. The SWN
architecture is better than
regular and random
random architecture.
2. Hopfield NN:
Model of associative memory
Example: Restoring corrupted
memory patterns
Original T
Half is
Corrupted
20% of T
corrupted
Use in search machines (Alt Vista):
Search from incomplete or
corrupted items.
Approximate solution of combinatorial
optimization problem: Travelling Salesman
Problem
Associative memory problem:

Store a set of patterns x
When presenting pattern z,

network finds pattern x,
closest to pattern z.
Hopfield Model
 Dynamical variables: Neuron i ( i=1,..,N) represented
by variable Si (t )
 At any time t network determined by state vector
 { Si (t ) i=1…N}.
 Dynamical update rule:
Si (t  1)  sgn(  j wij S j (t )   i )
 System evolves from a given state to some stable
network state (attractor state, fixed point).
Hopfield Nets
1. Every node is connected to
every other node.
2. Weights are symmetric and
wi,j=0.
3. The flow of information is not
unidirectional.
4. The state of the system is
given by the node output.
Energy function of Hopfield net:
multidimensional landscape
1
H    wi , j Si S j
2
Training of Hopfield Nets: Inspired by
biological Hebb’s rule(unsupervised)
1. Present components of the
patterns to be stored at the
outputs of corresponding
nodes of the net
wij  v v
p p
i j
2. If two nodes have the same
value then make a small
positive increment to
internode weight. If they
have opposite values then
make a small negative
decriment to the node
Distance of NN state from 1st training
pattern
Phase diagram of attractor network
Load: alpha=no patterns/no connections per node
Character
recognition
For a 256 x 256 character we have
65, 536 pixels. One input for each
pixel is not efficient. Reasons:
b
1. Poor generalisation: data set would have to be vast to
be able to properly constrain all the parameters.
2. Takes long time to train
3. Answer: use averages of N2 pixels dimensionality
reduction – each average could be a feature.
Restoration of memory: regular network
Restoration of memory: SWN
3. Self-organization: Topographic
maps
Extend the ideas of competitive learning to incorporate the neighborhood
around inputs and neurons
We want a nonlinear transformation of input pattern space onto output
feature space which preserves neighbourhood relationship between the
inputs -- a feature map where nearby neurons respond to similar inputs
Eg Place cells, orientation columns, somatosensory cells etc
Idea is that neurons selectively tune to particular input patterns in such a
way that the neurons become ordered with respect to each other so that a
meaningful coordinate system for different input features is created
Topographic map: spatial locations are indicative of the intrinsic
statistical features of the input patterns: ie close in the input =>
close in the output
EG: When the black input in the lower layer is active, the black
neuron in the upper layer is the winner so when the red input in
the lower layer is active, we want the upper red neuron to be the
winner
Example: Activity-based self-organization (von
der Malsburg, 1973)
incorporation of competitive and
cooperative mechanisms to generate
feature maps using unsupervised learning
networks
cortical units
Biologically motivated: how can
visual space
activity-based learning using highly
interconnected circuits lead to orderly
mapping of visual stimulus space onto
cortical surface? (visual-tectum map)
2 layer network each cortical unit fully connect to visual space via
Hebbian units
Interconnections of cortical units described by ‘Mexican-hat’ function
(Garbor function): short-range excitation and long-range inhibition
Kohonen NN
Example of grid: mapping 2D to 2D
Knowledge vs Topology: SWN
during most of organization
phase.
Problem
SOM tends to overrepresent regions of low density and
underrepresent regions of high density
Wx  p (x)
2/3
where p(x) is the distribution density of input data
Also, can have problems when there is no natural 2d ordering
of the data.