Hebbian Model Learning

Download Report

Transcript Hebbian Model Learning

Cognitive Neuroscience
and Embodied Intelligence
Hebbian Model Learning
Based on a courses taught by
Prof. Randall O'Reilly, University of Colorado,
Prof. Włodzisław Duch, Uniwersytet Mikołaja Kopernika
and http://wikipedia.org/
http://grey.colorado.edu/CompCogNeuro/index.php/CECN_CU_Boulder_OReilly
http://grey.colorado.edu/CompCogNeuro/index.php/Main_Page
Janusz A. Starzyk
EE141
1
So far
Elements: neurons, ions, channels, membranes, conductivity,
impulse generation...
Neural networks: signal transformation, filtering specific information,
amplification, contrast, network stability, winner takes most (WTM),
noise, network attractors...
Many specific mechanisms, eg. mechanoelectrical transduction of sensory signals:
hair cells in the ear open ion channels with
the help of proteins, functioning like springs
attached to the ion channels, converting
mechanical vibrations into electrical
impulses.
How do network configurations form which do interesting things?
Learning is necessary!
EE141
2
Learning: types
1.
2.
How should an ideal learning system look?
How does a human being learn?
Detectors (neurons) can change
local parameters but we want to
achieve a change in the
functioning of the entire
information processing network.
We will consider two types of
learning, requiring other
mechanisms:



EE141
Learning an internal model of the environment (spontaneous).
Learning a task set by the network (supervised).
Connection of both.
3
Model learning
Internal representations of patterns appearing in incoming
signals in the environment of a given neural group.
Discovering correlations between signals.
positive correlation
Elements of images, movements, animal behavior or emotions, we can
correlate everything by creating a behavioral model.
Only strong correlations are relevant, there are too many weak ones
and they can be coincidental.
Example: hebb_correl.proj, in Chapter 4
4
EE141
Simulation
Select:
hebb_correl.proj, in Chapter 4
Click on r.wt in the network window, after
clicking on the hidden neuron we see the
initialization of weights of the entire network
to 0.5.
Click on act in the network window, on
run in the control window
In effect we get binary weights =>
lrate = e =0.005
pright = probability of the first event
Defaults changes pright =1 to 0.7
5
EE141
Biological foundations: LTP, LTD
Long-Term Potentiation, LTP, was discovered in 1966 first in the
hippocampus, then in the cortex.
Stimulating a neuron with a current
Of ~100Hz for 1 second increases
synaptic efficiency by 50-100%,
it's a long-term effect.
Opposite effect:
LTD, Long-Term Depression.
The most common form of LTP/LTD is related to NMDA receptors.
Activity of NMDA channels requires presynaptic as well as postsynaptic
activity, and so is in compliance with the rule introduced by Donald
Hebb in 1949, tersely summarized thus:
Neurons that fire together wire together.
Neurons showing simultaneous activity strengthen their bonds.
EE141
6
NMDA receptors
1. Mg+ ions block NMDA channels.
An increase in postsynaptic potential is
necessary to remove them and enable
interactions with glutamate.
2. Presynaptic activity is necessary to
release the glutamate, which opens
NMDA channels.
3. Ca++ ions enter these channels
triggering a series of chemical
reactions, which are not completely
tested.
The effect is nonlinear: small amounts of
Ca++ give LTD and large amounts give
LTP.
Many other processes play a role in LTP.
More detailed information on LTP/LTD.
EE141
7
Hebbian Correlation
From a theoretical point of view the biological mechanism LTP is not
very relevant, we test only the simplest versions.
Simple Hebb's rule: Dwij = e ai aj
Change in weights is proportional to pre- and post-synaptic activity.
xi
yj=xw
Weights increase for neurons with strongly correlated activity, don't
change for neurons whose activity doesn't show a correlation.
8
EE141
Hebb - normalization
Simple Hebb's rule:
Dwij = e xi yj
leads to an infinite increase in weights.
This can be avoided in many ways;
often employed is a normalization of weights:
Dwij = e (xi -wij) yj
This has a biological justification:
when x and y are large we have a strong LTP, much Ca++
 when y is large but x is small we have LTD, some Ca++
 when y is small nothing happens because Mg+ ions block the NMDA
channels

9
EE141
Model learning
Hebb's mechanism allows for learning correlations.
What happens if we add more postsynaptic neurons?
They will learn the same correlations!
If we use kWTA then output units will
compete with each other.
Learning = survival of the fittest (Darwin's mechanism) + specialization.
Learning based on self-organization



Inhibition of kWTA: only the strongest units remain active.
Hebbian learning: the winners become even stronger.
Result: different neurons react to different signal properties.
10
EE141
What do we want from model learning?
The environment supplies a lot of information, but the signals are
variable and of poor quality, the identification of objects and
relationships between them isn't possible without extensive knowledge
of what can be expected.
We need an environmental state model biased for recognition and
correct behavior; correlations are a necessary (but not sufficient)
condition of causal relationships.
11
EE141
What do we want from model learning?
Expectations based on previous experience can
ease adaptation to a new situation
Example – it's easier to learn a new video game
if you've already played other video games and
when the designers keep similar game elements
This experience (bias) can also be a factor limiting recognition when we
stubbornly look for old solutions in the new game.
We assume that in genetic development nature worked out proven
mechanisms of getting to know the world.
- problem: these mechanisms aren't obvious and easy to identify.
Nativists (psychologists who stress genetic influences on behavior)
assume that people are born with specified knowledge about the world
- this isn't genetically justified
In opposition to this, a genetic record of connective structures is possible
and can constitute genetically encoded knowledge (for example how to
breathe or nurse)
12
EE141
What do we want from model learning?
It's more pragmatic to consider the necessity of
introducing beginning knowledge through the model
designer
The designer must substitute the mechanism of
property selection with his own model
This is why many people avoid the introduction
of preliminary assumptions (biases), preferring
general machine learning mechanisms
This leads to a discrepancy between the model and reality,
also called the bias-variance dilemma
- a precise model hinders generalization
- an oversimplified model prevents correct representation
A simple (parsimonious) model was preferred in the 14th
century by William of Occam leading to Occam's razor –
which cuts in preference of the simplest explanation of a
13
phenomenon.
EE141
Standard PCA
Principal component analysis (PCA) is a mathematical
technique for finding linear signal combinations with the
greatest variance.
The first neuron should learn the most
important correlations, so first we
calculate the correlations of its inputs
averaged over time:
Cik=xixkt for the first element; then for
the next, but each neuron should be
independent, so it should calculate
orthogonal combinations.
For the set of images consecutive
components look like this
===============>
How to do this with the help of
neurons?
EE141
14
PCA on one neuron
Let's assume that the environment is composed of
diagonal lines.
Let's accept a linear activation for moment t (image nr t):
y j =  xk wkj
k
Let the change in weights be specified by
the simple Hebb's rule:
wij(t+1) = wij(t) + e xi yj
After presentation of all the images:
1
Dwij = e  xi y j = e '  xi y j =e ' xi y j
n t
t
t
The change in weights is proportional to the average of the product of
the inputs/outputs. Correlation can replace average.
15
EE141
Hebbian Correlations
Correlation:
Cij =
 xi  xi   y j  y j 
t
 i2 2j
If the averages are zero and the variance is one then the average of the
product is the correlation; the change in weights is proportional to:
Dwij ~ xi y j
t
= xi  xk wkj
=  Cik wkj
k
k
=  xi xk
t
k
t
wkj
t
t
Cik=xixkt are correlations between inputs; the average of the weights changes
slowly. The change in weight for input i is then the weighted average of the
correlations between the activity of this input and the remaining ones.
After the presentation of many images, the weights will be dominated by the
strongest correlations and yj will calculate the strongest component of PCA
16
EE141
Example
The two first inputs are completely correlated; the third is uncorrelated.
Changes follow according to Hebb's rule for e=1.
Let's assume that the signals have a zero average (xi=+1 the same
number of times as xi=-1); for each vector x =(x1,x2,x3), y is calculated,
and then the new weights.
Correlated units determine the symbol and scale of the weights,
and weights of these inputs grow quickly, whereas the weight of the
uncorrelated input x3 decreases.
17
The
weights
of
unit
j
change
in
this
way:
w(t+1)=w(t)+Cw(t)
EE141
Normalization
The simplest normalization avoiding an infinite increase in weights:
Dwij = e (xi – wij) yj
Erkki Oja (1982) proposed:
Dwij = e (xi –yj wij) yj
For one unit, after learning the weights stop changing:
Dwij = 0 =e (xi –yj wij) yj
Weight wij = xi /yj = xi / Sk xk wkj
The weight of a given input signal is then a fraction of the complete
weighted activity of all the signals.
This rule also leads to the calculation of the most important main
component. How to calculate the other components?
18
EE141
Problems of PCA
How to generate the succeeding PCA components in neural networks?
We numerically perform orthogonalization of successive yj but this is
not easy to do with the help of a neural network.
Sequential PCA orders components, from the most important to the
least; this can be achieved by introducing connections between hidden
neurons, but this is an artificial solution.
PCA assumes a hierarchical structure: the most important component
for all images, in effect we get eg. for image analysis, successive
components as chessboards with an increasing number of squares
since the correlations of pixels for a large number of images disappear.
The problem with PCA can be characterized as: PCA calculates
correlations in the entire input space whereas useful correlations exist in
local subspaces.
Natural images create heterarchies, different combinations are equally
important for different images, subsets of features relevant for certain
categories are not important for differentiating others.
19
EE141
Conditional PCA
Conditional principal component analysis (CPCA): calculate correlations
not for all features but only for these features which are present.
PCA functions on all features, giving
orthogonal components.
CPCA functions on subsets of
features, ensuring that different
components encode different
interesting combinations of signal
features, eg. edges.
The competition realized with the help
of kWTA will ensure the activity of
different neurons for different images.
In effect: encoding images => How to
do this with the help of neurons?
20
EE141
CPCA equations
A neuron is trained only on a subset of images
with predetermined features, eg.
edges slanting in a certain way.
Normalized Hebb's rule:
Dwij = e (xi -wij) yj
The weights move in direction xi, on condition of the activity of yj.
In effect the conditional probability:
P(xi=1|yj=1) = P(xi|yj) = wij
The weight wij = the probability that the input unit xi is active given
that the receiving unit yj is also active.
21
EE141
Probabilistic interpretation
The success of CPCA depends on the selection of a function
determining the activity of neurons – an automatic determination
process is possible in a few ways: self-organization or error correction.
Activations averaged over time are represented by probabilities P(xi|t),
P(yj|t). The change in weights for all images t appearing with P(t):
Dwij = e [St (P(yj|t) P(xi|t) -P(yj|t)wij] P(t)
In a state of equilibrium Dwij =0 so:
wij = St P(yj|t)P(xi|t)P(t) / St P(yj|t)P(t) =
St P(yj,xi,t) / St P(yj,t) = P(xi ,yj)/P(yj) = P(xi|yj)
Weight wij = conditional probability xi under condition yj.
How to biologically justify normalization?
22
EE141
Biological interpretation
Normalized Hebb's rule:
Dwij = e (xi -wij) yj
Let's assume that the weights are wij ~0.5, there are then 3 possibilities:
1. xi , yj ~1 (a strong pre- and postsynaptic activity), so xi > wij,
weights increase, so we have LTP, as in NMDA channels.
2. yj ~1 but xi < wij, weight decrease, we have LTD, a weak input signal
will suffice to unblock the Mg+ ion of NMDA channel.
A strong postsynaptic activity can also unblock other voltage
dependent channels and introduce a small amount of Ca++.
3. Activity yj ~0 doesn't give any changes, voltage channels and NMDA
aren't active.
Learning happens faster for small wij, because xi < wij more often.
Qualitatively consistent with observations of weight saturation.
23
EE141
Simulations
Select:
hebb_correl.proj, in Chapter 4
Description: Chapter 4. 6
Look at Events
Evt Label, and within this FreqEvent is 1 for Right and 0 for Left
Change in weight values: Graph_log
lrate = 0.005, try 0.1
Change p_right from 1 to 0.7 and to 0.5
Change Env_type from One_line to Three_lines and p_right=0.7
Notice that the weights are becoming small, diffuse, because the
conditional probabilities for images learning entire categories are
becoming small; the output unit contributes to this because it has a
small selectivity.
24
EE141
Normalization of weights in CPCA
CPCA weights are not very selective, don't lead to image differentiation
– they don't have dynamic range; for typical situations P(xi|yj) is small,
but we want it around 0.5.
Solution: renormalization of weights and contrast enhancement.
Normalization: uncorrelated signals should have a weight of 0.5, but in
simulations with seldom appearing signals xi approach a value of
a~0.1-0.2. Let's factorize the weight change into two terms:
Dwij = e (xi -wij) yj = e [(1-wij) xi yj+(1-xi)(0-wij)yj ]
The first term causes an increase in weights in the direction of 1, the
second causes a decrease in the direction of 0; if we want to maintain
average weights around 0.5 we must increase the first term, eg. :
Dwij = e [(0.5/awij) xi yj+(1-xi)(0-wij)yj ]
The linear correlation is still wij = P(xi|yj) 0.5/a . The simulator has a
parameter savg_cor[0,1] determining the degree of normalization 25
EE141
Contrast in CPCA
Instead of a linear weight change we want to ignore weak correlations
and strengthen strong correlations – to increase the contrast between
interesting aspects of signals and those that are not.
This increases the simplicity of the connections (the weak ones can be
skipped) and accelerates the learning process, helping the weights
decide what to do.
Contrast enhancement:
instead of a linear weight
change use a sigmoidal one:
wˆ ij =
1
 wij 

1  q
 1 w 
ij 

Two parameters:
gain g and offset q. Where q>1
imposes higher threshold
EE141
Attention: this is a scaling of
26
individual weights not of activations!
Simulations
Select:
hebb_correl.proj, in Chapt. 4
Description: Chapt. 4. 6
Change Env_type from One_line to Five_lines and p_right=0.7
For these lines CPCA gives identical weights around 0.2.
Change the normalization, setting savg_cor=1
The weights should be around 0.5
The parameter savg_cor allows us to influence the number of features
used by the hidden units.
Contrast: set wt_gain=6 instead of 1, PlotEffWt will show the curve of
effective weights.
Influence on learning: for Three_lines, savg_cor=1
Change wt_off from 1 to 1.25
27
EE141