PPT - Sheffield Department of Computer Science

Download Report

Transcript PPT - Sheffield Department of Computer Science

Basics of Neural Nets and Past-Tense model
References to read
Chapter 10 in Copeland’s Artificial Intelligence.
Chapter 2 in Aleksander and Morton ‘Neurons and
Symbols’.
Chapters 1,3 and 5 in Beale and Jackson, ‘Neural
Computing: and Introduction’.
Chapter 18 in Rich and Knight, ‘Artificial Intelligence’.
The Human Brain
Contains approximately ten thousand million basic
units, called Neurons.
Each neuron is connected to many others.
Neuron is basic unit of the brain.
A stand-alone analogue logical processing unit.
Only basic details of neurons really understood.
Neuron accepts many inputs, which are all added up (in
some fashion).
If enough active inputs received at once, neuron will be
activated, and fire. If not, remains in inactive quiet
state.
Soma is the body of neuron.
Attached to soma are long filaments: dendrites.
Dendrites act as connections through which all the
inputs to the neuron arrive.
Axon: electrically active. Serves as output channel of
neuron.
Axon is non-linear threshold device. Produces pulse,
called action potential when resting potential within
soma rises above some threshold level.
Axon terminates in synapse which couples axon with
dendrite of another cell.
No direct linkage, temporary chemical one. Synapse
releases neurotransmitters which chemically activate
gates on dendrites.
Activating gates, when open allow charged ions to flow.
These charged ions alters the dendritic potential and
provides voltage pulse on dendrite which is
conducted to next neuron body/Soma.
A single neuron will have many synaptic inputs on its
dendrites, and may have many synaptic outputs
connecting it to other cells.
axon
synapse
Learning: occurs when modifications made to effective
coupling between one cell and another at the
synaptic junction.
More neurotransmitters are released, which opens more
gates in dendrite.
i.e. coupling is adjusted to favourably reinforce good
connections.
The human brain: poorly understood, but capable of
immensely impressive tasks.
For example: vision, speech recognition, learning etc.
Also, fault tolerant: distributed processing, many simple
processing elements sharing each job. Therefore can
tolerate some faults without producing nonsense.
Graceful degradation: with continual damage,
performance gradually falls from high level to
reduced level, but without dropping catastrophically
to zero.
(Computers do not exhibit graceful degradation:
intolerant of faults).
Idea behind neural computing: by modelling major
features of the brain and its operation, we can
produce computers that exhibit many of the useful
properties of the brain.
Modelling single neuron
Important features to model
 The output from a neuron is either on or off
 The output depends only on the inputs. A certain
number must be on at any one time in order to make
the neuron fire.
The efficiency of the synapses at coupling the incoming
signal into the cell body can be modelled by having a
multiplicative factor (I.e. weights) on each of the
inputs to the neuron.
More efficient synapse has correspondingly larger
weight.
Total input = weight on line 1 x input on 1 +
weight on line 2 x input on 2 +
weight on line n x input on n (for all n)
Basic model: performs weighted sum of inputs,
compares this to internal threshold level, and turns on
if this level exceeded.
This model of neuron proposed in 1943 by McCulloch
and Pitts.
Model of neuron, not a copy: does not have complex
patterns and timings of actual nervous activity in real
neural systems.
Because it is simplified, can implement on digital
computer (Logic gates and neural nets!)
Remember: it is only one more metaphor of the brain!
Learning in simple neurons
Training nets of interconnected units.
Essentially, if a neuron produces incorrect input, we
want to reduce the chances of it happening again. If it
gives correct output, do nothing.
For example, think of problem of teaching neural net to
tell the difference between a set of handwritten As
and a set of handwritten Bs.
In other words, to output a 1 when an A is presented,
and a 0 when a B is presented.
Start up with random weights on input lines and present
an A.
Neuron performs weighted sum of inputs and compares
this to threshold.
If it exceeds threshold, output a 1, otherwise output of 0.
If correct, do nothing.
If it outputs a 0 (when A is presented) increase
weighted sum so next time it will output a 1
Do this by increasing the weights.
If it outputs a 1 in response to a B, decrease the
weights so next time output will be 0.
Summary

Set the weights and thresholds randomly
 Present an input
 Calculate the actual output by taking the threshold
value of the weighted sum of the inputs.
 Alter the weights to reinforce correct decisions – ie
reduce the error
Learning is guided by knowing what we want it to
achieve = supervised learning.
The above shows the essentials of the early Percepton
learning algorithm.
This early history was called Cybernetics (rather than
AI)
But there are limitations to Perceptron learning.
10
11
1
0
0
00
1
01
Exclusive-OR truth table
Consider two propositions, either of
which may be true or false
 Exclusive-or is the relationship between
them when JUST ONE OF THEM is
true.
 It EXCLUDES the case when both are
true,so exclusive-or of the two is…
 False when both are true or both are
false, and true in the other two cases.

But there are limitations to Perceptron learning.
Consider Perceptron trying to find straight line that
separates classes.
In some cases, cannot draw straight line to separate
classes.
10
11
1
0
0
00
1
01
Eg Xor (if 0 is FALSE and 1 is TRUE)
Input 0 1 – 1
Input 1 0 – 1
Input 1 1 – 0
Input 0 0 -- 0
Cannot separate the two pattern classes by a straight
line: They are linearly inseparable
This failure to solve apparently simple problems like
XOR pointed out by Minsky and Papert in
Perceptrons in 1969.
Stopped by research in the area for the next 20 years!
During which time (non-neural) AI got under way.
1986: Rumelhart and McClelland: multi-layer
perceptron.
Output Units
Hidden Units
bias
Input Units
bias
A feedforward net with two weight layers and three sets of units.
Adapted perceptron, with units arranged in layers: an
input layer, an output layer, and a hidden layer.
Added threshold function, and alter learning rule.
New learning rule: backpropagation (also a form of
supervised learning)
Net is shown pattern, and output is compared to desired
output (target).
Weights in the network adjusted, by calculating the
value of the error function for a particular input, and
then backpropagating the error from one layer to the
previous one.
Output weights (weights connected to output layer) can
adjust so that value of error function reduced.
Less obvious how to adjust weights for hidden units (not
directly producing an output). Input weights adjusted
in direct proportion to the error in the units to which it
is connected.
0.5
1
-1
0.5
1.5
1
1
1
1
Inputs: 00
01
10
11
A solution to XOR problem
Right-hand hidden unit detects when both inputs are on,
ensures output unit gets a net input of zero. Only one of
two on never meets right-hand threshold (which multiplies
with the negative weight)
When only one of the inputs on, left-hand hidden unit is
on, turning on output unit.
When both inputs are off, hidden units are inactive, and
output unit is off
BUT learning rule not guaranteed to produce
convergence: can fall into situation where it cannot
learn correct output. = local minimum.
BUT training requires repeated presentations.
Training multi-layer perceptrons, an inexact science: no
guarantee that net will converge on a solution (ie that
it will learn to produce the required output in
response to inputs).
Can involve long training times.
Little guidance about a number of parameters, including
the number of hidden units needed for a particular
task.
Also need to find a good input representation of a
problem. Often need to search for a good
preprocessing method.
Generalisation
Main feature of neural networks: ability to generalise
and to go beyond the patterns they have been trained
on.
Unknown pattern will be classified with others that have
same features.
Therefore learning by example is possible; net trained
on representative set of patterns, and through
generalisation similar patterns will also be classified.
Fault Tolerance
Multi-layer perceptrons are fault-tolerant because each
node contributes to final output. If node or weights
lost, only slight deterioration.
Ie graceful degradation
Brief History of Neural Nets
Connectionism/ Neural Nets/ Parallel Distributed
Processing.
McCulloch and Pitts (1943) Brain-like mechanisms –
showing how artificial neurons could be used to
compute logical functions.
Simplification 1: Neural communication thresholded –
neuron is either active enough to fire, or it is not.
Thus can be thought of as binary computing device
(ON or OFF).
Simplification 2: Synapses – equivalent to weighted
connections. So can add up weighted inputs to an
element, and use binary threshold as output function.
1949: Donald Hebb showed how neural nets could form
a system that exhibits memory and learning.
Learning – a process of altering the strengths of
connection between neurons.
Reinforcing active connections.
Rosenblatt (1962) and the Perceptron. Single layer
net, which can learn to produce an output given an
input.
But connectionist research almost killed off by Minsky
and Papert, by their book called ‘Perceptrons’.
Argument: Perceptrons computationally weak. (certain
problems which cannot be solved by a 1-layer net,
and no learning mechanism for 2 layer net).
But resurgence of interest in neural computation.
- result of new neural network architectures, new
learning algorithms. Ie Backpropagation and 2 layer
nets.
Rumelhart and McClelland and PDP Research group
(1986) 2 books on Parallel Distributed Processing.
Presented variety of NN models – including Past-tense
model (see below).
Huge impact of these volumes partly because they
contain cognitive models, I.e a model of some aspect
of human cognition.
Cognition: thinking, understanding language, memory.
Human abilities that imply our ability to represent the
world.
Best contrasted to behaviour-based approach.
Example applications of Neural Nets
NETtalk; Sejnowski and Rosenberg, 1987: network that
learns to pronounce English text.
Takes text, maps text onto speech phonemes, and then
produces sounds using electronic speech generator.
It is difficult to specify rules to govern translation of text
into speech – many exceptions and complicated
interactions between rules.
For example, ‘x’ pronounced as in ‘box’ or ‘axe’, but
exceptions eg ‘xylophone’.
Connectionist approach: present words and their
pronunciations, and see if net can discover mapping
relationship between them.
203 input units, 80 hidden units, and 26 output units,
corresponding to phonemes (basic sound in
language).
A window seven letters wide is moved over the text, and
net learns to pronounce the middle letter.
Each character is 29 input units, one for each of 26
letters, and one for blanks, periods and other
punctuation. (7 x 29 inputs=203)
Trained on 1024 word text, after 50 passes NETtalk
learns to perform at 95% accuracy on training set.
Able to generalise to unseen words at level of 78%.
Note here: training vs. test sets
So, for the string ‘SHOEBOX’
The first of the seven inputs is
 00000000000000000010000000
 because S is the 19th letter
 and the output will be the zerophoneme
because the E is silent in SHOEBOX,
I.e if the zerophoneme is placed first:
 10000000000000000000000000

Particularly influential example: taperecorded NETtalk
starting out with poor babbling speech and gradually
improving until output intelligible. Sounds like child
learning to speak. Passed the Breakfast TV test of
real AI.
Cotterell et al 1987: image compression.
Gorman and Sejnowski 1988: classification of sonar
signals (mines versus rocks)
Tesauro and Sejnowski, 1989: playing backgammon.
Le Cun et al, 1989: recognising handwritten
postcodes
Pomerleau, 1989: navigation of car on winding road
Summary: What are Neural Nets?
Important characteristics:
 Large number of very simple neuronlike processing
elements.
 Large number of weighted connections between
these elements.
 Highly parallel.
 Graceful degradation and fault tolerant
Key concepts
Multi-layer perceptron.
Backpropagation, and supervised learning.
Generalisation: nets trained on one set of data, and
then tested on a previously unseen set of data.
Percentage of previously unseen set they get right
shows their ability to generalise.
What does ‘brain-style computing’ mean?
Rough resemblance between units and weights in
Artificial Neural Network (or ANNs) and neurons in
brain and connections between them.
 Individual units in a net are like real neurons.
 Learning in brain similar to modifying connection
strengths.
 Nets and neurons operate in a parallel fashion.
 ANNs store information in a distributed manner as do
brains.
 ANNs and brain degrade gracefully.
 BUT these structures still model logic gates as well
and are not a different kind of non-von Neumann
machine
BUT
Artificial Neural Net account is simplified. Several
aspects of ANNs don’t occur in real brains. Similarly
brain contains many different kinds of neurons,
different cells in different regions.
e.g. not clear that backpropagation has any biological
plausibility. Training with backpropagation needs
enormous numbers of cycles.
Often what is modelled is not the kinds of process that
are likely to occur at neuron level.
For example, if modelling our knowledge of kinship
relationships, unlikely that we have individual
neurons corresponding to ‘Aunt’ etc.
Edelman, 1987 suggests that it may take units ‘in the
order of several thousand neurons to encode
stimulus categories of significance to animals’.
Better to talk of Neurally inspired or Brain-style
computation.
Remember too that (as with Aunt) even the best
systems have nodes pre-coded with artificial notion
slike the phonemes (corresponding to the phonetic
alphabet). These cannot be precoded in the brain (as
they are n Sejnowski’s NETTALK) but must
themselves be learned.
Getting closer to real intelligence?
Idea that intelligence is adaptive behaviour.
I.e. an organism that can learn about its environment is
intelligent.
Can contrast this with approach that assumes that
something like playing chess is an example of
intelligent behaviour.
Connectionism still in its infancy:
- still not impressive compared to ants, earthworms or
cockroaches.
But arguably still closer to computation that does occur
in brain than is the case in standard symbolic AI.
Though remember McCarthy’s definition of AI as
common-sense reasoning (esp. of a prelinguistic
child).
And might still be a better approach than the symbolic
one.
Like analogy of climbing a tree to reach the moon – may
be able to perform certain tasks in symbolic AI, but
may never be able to achieve real intelligence.
Ditto with connectionism/ANNs ---both sides use this
argument.
Past-tense learning model
references:
Chapter 18: On learning the past tenses of English
verbs. In McClelland, J.L., Rumelhart, D.E. and the
PDP Research Group (1986) Parallel Distributed
Processing: Explorations in the Microstructure of
Cognition, vol 2: Psychological and Biological
Models, Cambridge, MA: MIT Press/Bradford Books.
Chapter 6: Two simulations of higher cognitive
processes. Bechtel, W. and Abrahamsen, A. (1991)
Connectionism and the mind: An introduction to
parallel processing in networks. Basil Blackwell.
Past-tense model
A model of human ability to learn past-tenses of verbs.
Presented by Rumelhart and McClelland (1986) in their
‘PDP volumes’:
Main impact of these volumes: introduced and
popularised the ideas of Multi-layer Perceptron,
trained by means of Backpropagation
Children learning to speak:
Baby: DaDa
Toddler: Daddy
Very young child: Daddy home!!!!
Slightly older child: Daddy came home!
Older child: Daddy comed home!
Even older child: Daddy came home!
Stages of acquisition in children:
Stage 1
: past tense of a few specific verbs, some regular
e.g. looked, needed
Most irregular: came, got, went, took, gave
As if learned by rote (memorised).
Stage 2
: evidence of general rule for past-tense.
I.e. add ed to stem of verb.
And often overgeneralise irregulars
e.g. camed or comed instead of came.
Also (Berko, 1958) can generate past tense for an
invented word. E.g. if they use rick describe an
action, will tend to say ricked when using the word in
the past-tense.
Stage 3
: Produce correct forms for both regular and irregular
verbs.
Table: Characteristics of 3 stages of past-tense
acquisition
Verb Type
Stage 1
Stage 2
Stage 3
Early verbs
Correct
Regularised
Correct
Regular
Correct
Correct
Irregular
Regularised
Correct
Novel
Regularised Regularised
U-shaped curve – correct past-tense form used for
verbs in Stage 1, errors in Stage 2 (overgeneralising
rule), few errors in Stage 3.
Suggests Stage 2 children have acquired rule, and
Stage 3 children have acquired exceptions to rule.
Aim of Rumelhart and McClelland: to show that
connectionist network could show many of same
learning phenomena as children.
- same stages and same error patterns.
Overview of past-tense NN model
Not a full-blown language processor that learns pasttenses from full sentences heard in everyday
experience.
Simplified: model presented with pairs, corresponding to
root form of word, and phonological structure of
correct past-tense version of that word.
Can test model by presenting root form of word, and
looking at past-tense form it generates.
More detailed account
Input and Output Representation
To capture order information used Wickelfeatures
method of encoding words.
460 inputs:
Wickelphones: represent target phoneme and
immediate context.
e.g. came - #Ka, kAm, aM#
These are coarse-coded onto Wickelfeatures, where 16
wickelfeatures correspond to each wickelphone.
Input and output of net consist of 460 units.
Inputs are ‘standard present’ forms of verbs, outputs are
corresponding past forms, regular or irregular, and all
are in the special ‘wikel’ format.
This is a good example of need to find a good way of
representing the input can’t just present words to a
net; have to find a way of encoding those words so
they can be presented as a set of inputs.
Assessing output: compare the pattern of output
Wickelphone activations to the pattern that the
correct response would have generated.
Hits: a 1 in output when a 1 in target and a 0 in output
when a 0 in target.
False alarms: 1s in the output not in the target.
Misses: 0s in output, not in target.
Training and Testing
Verb is input, and propagated across weighted
connections – will activate wickelfeatures in output
that correspond to past-tense of verb.
Used perceptron-convergence procedure to train net.
(NB not multi-layer perceptron: no hidden layer, and not
trained with backpropagation. Problem must be
linearly separable).
Target tells output unit what value it should have. When
actual output matches target, no weights adjusted.
When computed output is 0, and target is 1, need to
increase the probability that unit will be active the
next time that pattern presented. All weights from all
active input units increased by small amount eta. Also
threshold reduced by eta.
When computed output is 1 and target is 0, we want to
reduce the likelihood of this happening. All weights
from active units are reduced by eta, and threshold
increased by eta.
Perceptron convergence procedure will find a set of
weights that will allow the model to get each output
unit correct, provided such a set of weights exist.
Before training: Divided 560 verbs into high frequency
(regular and irregular), medium (regular and irregular)
and low frequency (regular and irregular).
Train on 10 high frequency verbs (8 irregular)
Live – lived
Look – looked
Come – came
Get – got
Give – gave
Make – made
Take – took
Go – went
Have – had
Feel – felt
1.
After 10 epochs, 410 medium frequency verbs
added (76 irregular)
190 more epochs (training cycles)
Net showed dip in performance on irregular verbs
which is like Stage 2 in children.
And when net made errors, these errors were like
children‘s – I.e. adding ‘ed’.
e.g. for come – comed
3.
Tested 86 low frequency verbs it had not been
trained on.
Got 92% right of regular verbs, 84% right for irregular.
2.
Results
With simple network, and no explicit encoding of rules it
could simulate important characteristics of human
children learning English past-tense. Same U-shaped
curve produced for irregular words.
Main point: past-tense forms can be described using a
few general rules, but can be accounted for by
connectionist net which has no explicit rules.
Both regular and irregular words handled by the same
mechanism.
Objectives:
To show that Past-tense formulation could be carried
out by net, rather than by rule system.
To capture U-shaped function.
Rule-system
Linguists: stress importance of rules in describing
human behaviour.
We know the rules of language, in that we are able to
speak grammatically, or even to make judgements of
whether a sentence is or is not grammatical.
But this does not mean we know the rule like we know
the rule ‘i before e except after c’: may not be able to
state them explicitly.
But has been held (e.g. Pinker, 1984 following
Chomsky), that our knowledge of language is stored
explicitly as rules. Only we cannot describe them
verbally because they are written in a special code
only the language processing system can
understand:
Explicit inaccessible rule view
Alternative view: no explicit inaccessible rules. Our
performance is characterisable by rules, but they are
emergent from the system, and are not explicitly
represented anyway.
e.g. honeycomb: structure could be described by a rule,
but this rule is not explicitly coded. Regular structure
of honeycomb arises from interaction of forces that
wax balls exert on each other when compressed.
Parallel distributed processing view: no explicit (albeit
inaccessible) rules.
Advantages of using NNs to model aspects of human
behaviour.
Neurally plausible, or at least ‘brain-style computing’.
 Learned: not explicitly programmed.
 No explicit rules; permits new explanation of
phenomenon.
 Model both produces the behaviour and fits the data:
errors emerge naturally from the operation of the
model.
Contrast to symbolic models in all 4 respects (above)

Rumelhart and McClelland:
…lawful behaviour and judgements maybe produced by
a mechanism in which there is no explicit
representation of the rule. Instead, we suggest that
the mechanisms that process language and make
judgements of grammaticality are constructed in such
a way that their performance is characterizable by
rules, but that the rules themselves are not written in
explicit form anywhere in the mechanism..’
Important counter-argument to linguists, who tend to
think that people were applying syntactic rules.
Point: can have syntactic rules that describe language,
but that doesn’t mean that when we speak
syntactically (as if we were following those rules) that
we literally are following rules.
Many philosophers have made a similar point against
the reality of explicit rules --e.g. Wittgenstein.
The ANN approach provides a computational model of
how that might be possible in practice----to have the
same behavioural effect as rules but without there
being any anywhere in the system.
On the other hand, the standard model of science is of
possible rule systems describing the same
phenomenon--that also allows that real rules (in a
brain) could be quite different from the ones we
invent to describe a phenomenon.
Some computer scientists (e.g Charniak) refuse to
accept incomprehensible explanations.
Specific criticisms of the past tense model:
Criticism 1
Performance of model depends on use of Wickelfeature
representation: and this is an adaptation of standard
linguistic featural analysis. – ie it relies on symbolic
input representation(cf. phonemes in NETALK)
Ie what’s the contribution of the architecture?
Criticism 2
Pinker and Prince (1988): role of input and U-shaped
curve.
Model’s entry to Stage 2 due to addition of 410 medium
frequency verbs.
This change is more abrupt than is the case with
children--there may be no relation between this
method of partitioning the training data and what
happens to children.
But later research (Plunkett and Marchman 1989) show
that U-shaped curves can be achieved without abrupt
changes in input. Trained on all examples together
(using backpropogation net).
Presented more irregular verbs, but still found
regularization, and other Stage 2 phenomena for
certain verbs.
Criticism 3
Nets are not simply exposed to data, so that we can
then examine what they learn.
They are programmed in a sense: Decisions have to be
made about several things including
 Training algorithm to be used
 Number of hidden units
 How to represent the task in question


Input and output representation
Training examples, and manner of presentation
Criticism 4
At some point after or during learning this kind of thing,
humans become able to articulate the rule.
Eg regular past tenses end in –ed.
Also can control and alter these rules – eg could
pretend to be a younger child and say ‘runned’ even
though she knows it is incorrect (cf. some use
learned and some learnt, lit is UK and lighted US)
Hard to see how such kind of behaviour would emerge
from a set of interconnected neurons.
Conclusions
Although the Past-tense model can be criticised, it is
best to evaluate it in the context of the time (1986)
when it was first presented.
At the time, it provided a tangible demonstration that
 Possible to use neural net to model an aspect of
human learning
 Possible to capture apparently rule-governed
behaviour in a neural net