Connectionist Models: The Briefest Course

Download Report

Transcript Connectionist Models: The Briefest Course

Connectionist Models:
The Briefest Course
Robert M. French
LEAD – CNRS UMR 5022
Dijon, France
What do cows drink?
Symbolic AI
ISA(cow, mammal)
ISA(mammal, animal)
Rule1: IF animal(X) AND thirsty(X) THEN lack_water(X)
Rule2: IF lack_water(X) then drink_water(X)
Conclusion:
Cows drink water.
What do cows drink?
Connectionism:
What interests Symbolic AI
COW
MILK
DRINK
What interestsConnectionism
100 ms.
What do cows drink?
Connectionism:
COW
MILK
DRINK
These neurons are activated
without ever have heard the
word “milk”
100 ms.
Artificial Neural Networks
“Systems that are deliberately constructed to make use of
some of the organizational principles that are felt to be used
in the human brain.” (Anderson & Rosenfeld, 1990,
Neurocomputing, p. xiii)
The Origin of Connectionist Networks
Major Dates
William James (1892): the idea of a network of associations in the brain.
McCulloch & Pitts (1943, 1947): the “logical” neuron
Hebb (1949): The Organization of Behavior: Hebbian learning and the
formation of cell assemblies
Hodgkin and Huxley (1952): Description of the chemistry of neuron-firing.
Rochester, Holland, Haibt, & Duda (1956): first real neural network
computer model
Rosenblatt (1958, 1962): perceptron
Minsky and Papert (1969) bring the walls down on perceptrons
Hopfield (1982, 1984): Hopfield network, settling to an attractor
Kohonen (1982): unsupervised learning network
Rumelhart & McClelland and the PDP Research Group (1986):
backpropagation, etc.
Elman (1990): the simple recurrent network
Hinton (1980 – present ): just about everything else...
McCulloch & Pitts (1943, 1947)
1
0
Inputs
0
T
Output
The McCulloch & Pitts representation of the “essential”
neuron was that it was a logic gate (here an AND gate)
Output
Inputs
The real neuron was far, far more
complex,
but they felt that they had captured its
essence. Neurons were the biological
equivalent of logic gates.
Conclusion: Collections of neurons,
appropriately wired together, can do
logical calculus. Cognition is just a
complex logical calculus.
Hebb (1949)
Connecting changes in neurons to cognition
Hebb asked: What changes at the neuronal level might make possible our
acquisition of high-level (semantic) information?
His answer: Learning rule of synaptic reinforcement (Hebbian learning).
When neuron A fires and is followed
immediately by the firing of neuron B,
the synapse between the two neurons is
strengthened, i.e., the next time A fires, it
will be easier for B to fire.
Connecting neural function to behavior
High level models of human cognition and behavior
The
Hebbian
Gap
Neuronal
population
coding
models
Low-level models of single neurons
Even lower-level models of synapses and ion channels
Cell assemblies:
Closing the Hebbian Gap
Cell assemblies at the neuronal level give rise to categories at
the semantic level.
The formation of cell assemblies involves
• persistence of activity without external input. Cell assemblies can
overlap. e.g., the cell assembly associated with “dog” will overlap
with those associated with “wolf”, “cat”, etc.
•
recruitment: creation of a new cell assembly (via Hebbian
learning) corresponding to a new concept
•
fractionation: creation of new cell assemblies from an old
one, corresponding to the refinement of a concept.
A Hebbian Cell Assembly
By means of the Hebbian Learning Rule, a circuit of continuously
firing neurons could be learned by the network.
The continuing activation in this cell assembly does not require
external input.
The activation of the neurons in this circuit would correspond to the
perception of a concept.
A Cell Assembly
Input from the environment
A Cell Assembly
Input from the environment
A Cell Assembly
Input from the environment
A Cell Assembly
Input from the environment
A Cell Assembly
Notice that the input from the
environment is gone...
A Cell Assembly
A Cell Assembly
Rochester, Holland, Haibt, & Duda (1956)
• First real simulation that attempted to implement the principles
outlined by Hebb in real computer hardware
• Attempted to simulate the emergence of cell assemblies in a small
network of 69 neurons. They found that everything became active in
their network.
• They decided that they needed to include inhibitory synapses. (Hebb
only discussed excitatory synapses). This worked and cell assemblies
did, indeed, form.
• Probably the earliest example in neural network modeling of a network
which made a prediction (i.e., inhibitory synapses are needed to form
cell assemblies), that was later confirmed in real brain circuitry.
Rosenblatt (1958, 1962): The Perceptron
• Rosenblatt’s perceptron could learn to associate inputs
with outputs.
• He believed this was how the visual system learned to
associate low-level visual input with higher level
concepts.
• He introduced a learning rule (weight-change
algorithm) that allowed the perceptron to learn
associations.
The elementary perceptron
Consists of:
• two layers of nodes (one layer of weights)
• only feedforward connections
• a threshold function on each output unit
• a linear summation of the weights times inputs
t
desired output (“teacher”)
y
actual output
Threshold = T
INPUTS
if
w1
x1
w2
w x
i 1
i
i
 Threshold then y  1
else y  0
x2
The perceptron (Widrow-Hoff) learning rule (weightchange rule) is:
wnew  wold   x(t  y)
where 0    1
 is the learning constant,
“X”
I
wi
X
xi
This perceptron learns to associate the visual input of two crossed straight
lines with the character “X”. In other words, the output of the network will
be the character “X”.
Generalization
“X”
I
wi
xi
The real image in the world is degraded, but if the network has
already learned to correctly identify the original complete “X”, it
will recognize the degraded X as being an “X”.
Fundamental limitations of the perceptron
Minsky & Papert (1969) showed that the
Rosenblatt two-layer perceptron had some
fundamental limitations: They could only
classify linearly separable sets.
But not this:
This:
Y
Y
X
Y
X
X
X
X
Y
Y
X
X
X
Y
X
Y
X
Y Y Y
Y Y
X
X
The (infamous) XOR problem
• Minsky and Papert showed there were a number of extremely simple
patterns that no perceptron could learn, including a logic function
XOR.
• Since cognition supposedly required elementary logical operations,
this severely weakened the perceptron’s claim to be able to do general
cognition.
XOR
Input
Output
0
0
0
0
1
1
1
0
1
1
1
0
There is no set of weights w1 and w2 and a threshold T,
such that the perceptron below can learn the above
XOR function.
t
y
desired output (“teacher”)
actual output
Threshold = T
The activation arriving at the output node is . w1 x1
If w1 x1  w2 x2  T
But
w1
x1
w1 x1  w2 x2  T
then we output 1, otherwise 0.
is a straight line if we consider x1
and x2 to be the axis of a coordinate system.
w2
x2
 w2 x2
w1 x1  w2 x2  T
x2
(0,1)
NO!
(1,1)
0
1
(0,0)
(1,0)
x1
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T
with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
w1 x1  w2 x2  T
x2
(0,1)
NO!
(1,1)
0
1
(0,0)
(1,0)
x1
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T
with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
w1 x1  w2 x2  T
x2
(0,1)
NO!
(1,1)
0
1
(0,0)
(1,0)
x1
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T
with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
w1 x1  w2 x2  T
x2
(0,1)
NO!
(1,1)
0
1
(0,0)
(1,0)
x1
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T
with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
w1 x1  w2 x2  T
x2
(0,1)
NO!
(1,1)
0
1
(0,0)
(1,0)
x1
No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T
with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.
The Revival of the (Multi-layered) Perceptron:
The Connectionist Revolution (1985)
and the Statistical Nature of Cognition
By the early 1980’s Symbolic AI had hit a wall. “Simple” tasks that
humans do (almost) effortlessly (face, word, speech recognition, retrieving
information from incomplete cues, generalizing, etc) proved to be
notoriously hard for symbolic AI.
• Minsky (1967): “Within a generation the problem of creating
‘artificial intelligence’ will be substantially solved.”
• Minsky (1982): “The AI problem is one of the hardest ever
undertaken by science.”
By the early 1980’s the statistical nature of much
of cognition became ever more apparent.
Three factors contributed to the revival of the perceptron:
• the radical failure of AI to achieve the goals announced in
the 1960’s
• the growing awareness of the statistical and “fuzzy” nature
of cognition
• the development of improved perceptrons, capable of
overcoming the linear separability problems brought to
light by Minsky & Papert.
Advantages of Connectionist Models
compared to Symbolic AI
• Learning: Specifically designed to learn.
• Pattern completion of familiar patterns.
• Generalization: Can generalize to novel patterns based on previously
learned patterns.
• Retrieval with partial information: Can retrieve information in memory
based on nearly any attribute of the representation.
• Massive parallelism.
100-step processing constraint (Feldman & Balard, 1982) Neural
hardware is too slow and too unreliable for sequential models of
processing. But we can do very complex processing in a few hundred
ms. But transmission across a synapse (~10-6 in.) occurs in about ~1
ms. Thus, complex tasks must be accomplished in no more than a few
hundred serial steps, which is impossible.
• Graceful degradation: when they are damaged, their performance
degrades gradually.
Real Brains and Connectionist Networks
Some characteristics of real brains that serve as the basis of ANN design:
•Neurons receive input from lots of other neurons.
•Massive parallelism: neurons are slow but there are lots of them
•Learning involves modifying the strength of synaptic connections.
•Neurons communicate with one another via activation or inhibition.
•Connections in the brain have a clear geometric and topological structure.
•Information is continuously available to the brain.
•Graceful degradation of performance in the face of damage and information
overload
•Control is distributed, not central (i.e., no central executive).
•One primary way of understanding what the brain does is relaxation to
attractors.
General principles of all connectionist networks
• a set of processing units
• a state of activation defined over all of the units
• an output function (“squashing function”) for each unit: Transforms
unit activation into outgoing activation;
• a connectivity pattern with two features:
•
- weights of the connections
•
- locations of the connections
• an activation rule for combining inputs impinging on a unit to produce
a total activation for the unit
• a learning rule, by which the connectivity pattern is changed.
• an environment in which the system operates (i.e., how is the i/o
represented and given to/taken from the system)
Knowledge storage and Learning
• Knowledge storage: Knowledge is stored exclusively in the
pattern of strengths of the connections (weights) between units.
The network stores multiple patterns in the SAME set of
connections.
• Learning: The system learns by automatically adjusting the
strengths of these weights as it receives information from its
environment.
There are no high-level rules programmed into the system. Because all
patterns are stored in the same set of connections, generalization, graceful
degradation, etc. are relatively easy in connectionist networks. It is also what
makes planning, logic, etc. are so hard.
Two major classes of networks
• Supervised: Includes all error-driven learning algorithms. The error
between the desired output and the actual output determines how to
change the weights. This error is gradually decreased by the learning
algorithm.
• Unsupervised: There is no error feedback signal. The network
automatically clusters the input into categories.
Example: if the network is presented with 100 patterns, half of
which are different kinds of ellipses and half of which are different
types of rectangles, it would automatically group these patterns into the
two appropriate categories. There is no feedback to tell the network
explicitly “this is a rectangle” or “this is an ellipse.”
So, how did they solve the problem of linear
separability?
ANSWER:
i)
By adding another “hidden” layer to the
perceptron between the input and output
layers,
ii) introducing a differentiable squashing function
and
iii) discovering a new learning rule (the
“generalized delta rule”)
“Concurrent” learning
Learning a series of patterns:
If each pattern in the series is learned to criterion (i.e., completely)
sequentially, the learning of the new patterns will erase the learning
of the previously-learned patterns. This is why concurrent learning
must be used. Otherwise, catastrophic forgetting may occur.
Concurrent learning
1
epoch
- 1st pattern presented to the network, change its weights a little to reduce
the error on that pattern;
2nd pattern, change its weights a little to reduce the error on that
pattern;
etc.
- last pattern, change its weights a little to reduce the error on that pattern;
- REPEAT until the error for all patterns is below criterion
Backpropagation
desired output (“teacher”)
output layer (nodes
subscripted with i’s)
hidden layer (nodes
subscripted with j’s)
error
actual output
wij
hidden layer representation
wjk
input layer (nodes
subscripted with k’s)
input from the environment
Training of a backpropagation network
i)
Feedforward activation pass with activation “squashed” at hidden
layer.
ii)
The output is compared with the desired output (= error signal)
iii) This error signal is “backpropagated” through the network to change the
network’s weights (with gradient descent).
iv) When the overall error is below a predefined criterion, learning stops.
Backpropagation networks are excellent
function-learners...
...but they also suffer from catastrophic
interference.
Backpropagation networks:
Humans:
correct
correct
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.3
0.4
0.3
A-B List
0.2
0.1
A-C List
0
0
1
5
10
20
A-B List
0.2
0.1
A-C List
0
5 10
15
20
25
30
35
40 45
50
Learning Trials on A-C List
Learning Epochs on A-C List
They can learn to read words aloud
(NetTalk, 1987) ....
... but they have trouble learning sequences.
Much of our cognition involves learning sequences of patterns. Standard BP
networks are fine for learning input-output patterns, they cannot be used
effectively to learn sequences of patterns.
Consider the sequence: A B C D E F G H I
For this sequence we could train a network to associate the following
AB
BC
CD
DE
EF
FG
GH
HI
If we give the network A as it’s “seed”, it would produce B on output, which
we would feed back into the network to produce C on output, and so on. Thus,
we could reproduce the original sequence.
But what about context-dependent sequences?
But what if the sequence were:
ABC DE FC HI
Here C is repeated. The technique above would give:
A B
BC
CD
DE
EF
FC
CH
HI
But the network could not learn this sequence since it has no
context to distinguish the two different outputs associated with C
(for the first occurrence, D; for the second, H).
A “sliding window” solution
Consider a “sliding window” solution to provide the context. Instead of having
the network learn single-letter inputs, it will learn two-letter inputs, thus:
AB  C
BC  D
CD  E
DE  F
EF  G
FG  H
GH  I
Now the network is fed AB (here, “A” servers as “context” for “B”) as its seed
and it can reproduce the sequence with the repeated C without difficulty. But
what if we needed more than one letter’s worth of context, as in a sequence
like this:
ABCDEBCHI
Now the network needs another context letter...and so on.
Conclusion: The Sliding Window technique doesn’t work in general.
Elman’s solution (1990)
The Simple Recurrent Network
Output units
Hidden units
Input units
copy
Context units
SRN Bilingual language learning
(French, 1998; French & Jacquet, 2004)
Input to the SRN:
- Two “micro” languages, Alpha & Beta, 12 words each
- An SVO grammar for each language
- Unpredictable language switching
Attempted Prediction
BOY LIFTS TOY MAN SEES PEN GIRL PUSHES BALL BOY PUSHES BOOK
FEMME SOULEVE STYLO FILLE PREND STYLO GARÇON TOUCHE LIVRE
FEMME POUSSE BALLON FILLE SOULEVE JOUET WOMAN PUSHES TOY....
(Note: absence of markers between sentences and between languages.)
The network tries each time to predict the next element.
We do a cluster analysis of its internal (hidden-unit) representations after
having seen 20,000 sentences.
Clustering of the internal representations
formed by the SRN
BOY :
GIRL:
MAN:
WOMAN:
LIFTS:
SEES:
TAKES:
PUSHES:
TOY:
BALL:
PEN:
BOOK:
GARCON:
FILLE:
FEMME:
HOMME:
SOULEVE:
POUSSE:
PREND:
VOIT:
JOUET:
BALLON:
STYLO:
LIVRE:
Alpha
Beta
N.B. It also works for micro languages with 768 words each
Unsupervised learning:
Kohonen networks
Kohonen networks cluster inputs in a non-supervised manner.
There is no activation spreading or summing processes here: Kohonen networks
adjust weight vectors to match input vectors.
1
output nodes
2
w52
w11 w12
input layer
1
2
3
4
5
w62
6
The next frontier...
Computational neuroscience using spiking neurons, and variables such as their
connection density, their firing timing and synchrony, and so on, to better
understand human cognitive functions.
We are almost at a point where the population dynamics of large networks of
these kinds of simulated neurons can realistically be studied.
Further in the future neuronal models with Hodgkin-Huxley equations of
membrane potentials and neuronal firing, will be incorporated into our
computational models of cognition.
Ultimately...
Gradually, neural network models and the computers they
run on will become good enough to give us a deep
understanding of neurophysiological processes and their
behavioral counterparts and to make precise predictions
about them.
They will be used to study epilepsy, Alzheimer’s disease,
and the effects of various kinds of stroke, without
requiring the presence of human patients.
They will be, in short, like the models used in all of the
other hard sciences. Neural modeling and neurobiology
will then have achieved a truly symbiotic relationship.