Stage 2 - Sheffield Department of Computer Science

Download Report

Transcript Stage 2 - Sheffield Department of Computer Science

COM1070: Introduction to
Artificial Intelligence: week 9
Yorick Wilks
Computer Science Department
University of Sheffield
www.dcs.shef.ac.uk/-yorick
Summary: What are Neural Nets?
Important characteristics:
 Large number of very simple neuronlike processing
elements.
 Large number of weighted connections between
these elements.
 Highly parallel.
 Graceful degradation and fault tolerant
Key concepts
Multi-layer perceptron.
Backpropogation, and supervised learning.
Generalisation: nets trained on one set of data, and
then tested on a previously unseen set of data.
Percentage of previously unseen set they get right
shows their ability to generalise.
What does ‘brain-style computing’ mean?
Rough resemblance between units and weights in
Artificial Neural Network (or ANNs) and neurons in
brain and connections between them.
 Individual units in a net are like real neurons.
 Learning in brain similar to modifying connection
strengths.
 Nets and neurons operate in a parallel fashion.
 ANNs store information in a distributed manner as do
brains.
 ANNs and brain degrade gracefully.
 BUT these structures still model logic gates as well
and are not a different kind of non-von Neumann
machine
BUT
Artificial Neural Net account is simplified. Several
aspects of ANNs don’t occur in real brains. Similarly
brain contains many different kinds of neurons,
different cells in different regions.
e.g. not clear that backpropogation has any biological
plausibility. Training with backpropogation needs
enormous numbers of cycles.
Often what is modelled is not the kinds of process that
are likely to occur at neuron level.
For example, if modelling our knowledge of kinship
relationships, unlikely that we have individual
neurons corresponding to ‘Aunt’ etc.
Edelman, 1987 suggests that it may take units ‘in the
order of several thousand neurons to encode
stimulus categories of significance to animals’.
Better to talk of Neurally inspired or Brain-style
computation.
Remember too that (as with Aunt) even the best
systems have nodes pre-coded with artificial notion
slike the phonemes (corresponding to the phonetic
alphabet). These cannot be precoded in the brain (as
they are n Sejnowski’s NETTALK) but must
themselves be learned.
Getting closer to real intelligence?
Idea that intelligence is adaptive behaviour.
Ie an organism that can learn about its environment is
intelligent.
Can contrast this with approach that assumes that
something like playing chess is an example of
intelligent behaviour.
Connectionism still in its infancy:
- still not impressive compared to ants, earthworms or
cockroaches.
But arguably still closer to computation that does occur
in brain than is the case in standard symbolic AI.
Though remember McCarthy’s definition of AI as
common-sense reasoning (esp. of a prelinguistic
child).
And might still be a better approach than the symbolic
one.
Like analogy of climbing a tree to reach the moon – may
be able to perform certain tasks in symbolic AI, but
may never be able to achieve real intelligence.
Ditto with connectionism/ANNs ---both sides use this
argument.
Past-tense learning model
references:
Chapter 18: On learning the past tenses of English
verbs. In McClelland, J.L., Rumelhart, D.E. and the
PDP Research Group (1986) Parallel Distributed
Processing: Explorations in the Microstructure of
Cognition, vol 2: Psychological and Biological
Models, Cambridge, MA: MIT Press/Bradford Books.
Chapter 6: Two simulations of higher cognitive
processes. Bechtel, W. and Abrahamsen, A. (1991)
Connectionism and the mind: An introduction to
parallel processing in networks. Basil Blackwell.
Past-tense model
A model of human ability to learn past-tenses of verbs.
Presented by Rumelhart and McClelland (1986) in their
‘PDP volumes’:
Main impact of these volumes: introduced and
popularised the ideas of Multi-layer Perceptron,
trained by means of Backpropagation
Children learning to speak:
Baby: DaDa
Toddler: Daddy
Very young child: Daddy home!!!!
Slightly older child: Daddy came home!
Older child: Daddy comed home!
Even older child: Daddy can home!
Stages of acquisition in children:
Stage 1
: past tense of a few specific verbs, some regular
e.g. looked, needed
Most irregular: came, got, went, took, gave
As if learned by rote (memorised).
Stage 2
: evidence of general rule for past-tense.
I.e. add ed to stem of verb.
And often overgeneralise irregulars
e.g. camed or comed instead of came.
Also (Berko, 1958) can generate past tense for an
invented word. E.g. if they use rick describe an
action, will tend to say ricked when using the word in
the past-tense.
Stage 3
: Produce correct forms for both regular and irregular
verbs.
Table: Characteristics of 3 stages of past-tense
acquisition
Verb Type
Stage 1
Stage 2
Stage 3
Early verbs
Correct
Regularised
Correct
Regular
Correct
Correct
Irregular
Regularised
Correct
Novel
Regularised Regularised
U-shaped curve – correct past-tense form used for
verbs in Stage 1, errors in Stage 2 (overgeneralising
rule), few errors in Stage 3.
Suggests Stage 2 children have acquired rule, and
Stage 3 children have acquired exceptions to rule.
Aim of Rumelhart and McClelland: to show that
connectionist network could show many of same
learning phenomena as children.
- same stages and same error patterns.
Overview of past-tense NN model
Not a full-blown language processor that learns pasttenses from full sentences heard in everyday
experience.
Simplified: model presented with pairs, corresponding to
root form of word, and phonological structure of
correct past-tense version of that word.
Can test model by presenting root form of word, and
looking at past-tense form it generates.
More detailed account
Input and Output Representation
To capture order information used Wickelfeatures
method of encoding words.
460 inputs:
Wickelphones: represent target phoneme and
immediate context.
e.g. came - #Ka, kAm, aM#
These are coarse-coded onto Wickelfeatures, where 16
wickelfeatures correspond to each wickelphone.
Input and output of net consist of 460 units.
Inputs are ‘standard present’ forms of verbs, outputs are
corresponding past forms, regular or irregular, and all
are in the special ‘wikel’ format.
This is a good example of need to find a good way of
representing the input can’t just present words to a
net; have to find a way of encoding those words so
they can be presented as a set of inputs.
Assessing output: compare the pattern of output
Wickelphone activations to the pattern that the
correct response would have generated.
Hits: a 1 in output when a 1 in target and a 0 in output
when a 0 in target.
False alarms: 1s in the output not in the target.
Misses: 0s in output, not in target.
Training and Testing
Verb is input, and propagated across weighted
connections – will activate wickelfeatures in output
that correspond to past-tense of verb.
Used perceptron-convergence procedure to train net.
(NB not multi-layer perceptron: no hidden layer, and not
trained with backpropagation. Problem must be
linearly separable).
Target tells output unit what value it should have. When
actual output matches target, no weights adjusted.
When computed output is 0, and target is 1, need to
increase the probability that unit will be active the
next time that pattern presented. All weights from all
active input units increased by small amount eta. Also
threshold reduced by eta.
When computed output is 1 and target is 0, we want to
reduce the likelihood of this happening. All weights
from active units are reduced by eta, and threshold
increased by eta.
Perceptron convergence procedure will find a set of
weights that will allow the model to get each output
unit correct, provided such a set of weights exist.
Before training: Divided 560 verbs into high frequency
(regular and irregular), medium (regular and irregular)
and low frequency (regular and irregular).
Train on 10 high frequency verbs (8 irregular)
Live – lived
Look – looked
Come – came
Get – got
Give – gave
Make – made
Take – took
Go – went
Have – had
Feel – felt
1.
After 10 epochs, 410 medium frequency verbs
added (76 irregular)
190 more epochs (training cycles)
Net showed dip in performance on irregular verbs
which is like Stage 2 in children.
And when net made errors, these errors were like
children‘s – I.e. adding ‘ed’.
e.g. for come – comed
3.
Tested 86 low frequency verbs it had not been
trained on.
Got 92% right of regular verbs, 84% right for irregular.
2.
Results
With simple network, and no explicit encoding of rules it
could simulate important characteristics of human
children learning English past-tense. Same U-shaped
curve produced for irregular words.
Main point: past-tense forms can be described using a
few general rules, but can be accounted for by
connectionist net which has no explicit rules.
Both regular and irregular words handled by the same
mechanism.
Objectives:
To show that Past-tense formulation could be carried
out by net, rather than by rule system.
To capture U-shaped function.
Rule-system
Linguists: stress importance of rules in describing
human behaviour.
We know the rules of language, in that we are able to
speak grammatically, or even to make judgements of
whether a sentence is or is not grammatical.
But this does not mean we know the rule like we know
the rule ‘i before e except after c’: may not be able to
state them explicitly.
But has been held (e.g. Pinker, 1984 following
Chomsky), that our knowledge of language is stored
explicitly as rules. Only we cannot describe them
verbally because they are written in a special code
only the language processing system can
understand:
Explicit inaccessible rule view
Alternative view: no explicit inaccessible rules. Our
performance is characterisable by rules, but they are
emergent from the system, and are not explicitly
represented anyway.
e.g. honeycomb: structure could be described by a rule,
but this rule is not explicitly coded. Regular structure
of honeycomb arises from interaction of forces that
wax balls exert on each other when compressed.
Parallel distributed processing view: no explicit (albeit
inaccessible) rules.
Advantages of using NNs to model aspects of human
behaviour.
Neurally plausible, or at least ‘brain-style computing’.
 Learned: not explicitly programmed.
 No explicit rules; permits new explanation of
phenomenon.
 Model both produces the behaviour and fits the data:
errors emerge naturally from the operation of the
model.
Contrast to symbolic models in all 4 respects (above)

Rumelhart and McClelland:
…lawful behaviour and judgements maybe produced by
a mechanism in which there is no explicit
representation of the rule. Instead, we suggest that
the mechanisms that process language and make
judgements of grammaticality are constructed in such
a way that their performance is characterizable by
rules, but that the rules themselves are not written in
explicit form anywhere in the mechanism..’
Important counter-argument to linguists, who tend to
think that people were applying syntactic rules.
Point: can have syntactic rules that describe language,
but that doesn’t mean that when we speak
syntactically (as if we were following those rules) that
we literally are following rules.
Many philosophers have made a similar point against
the reality of explicit rules --e.g. Wittgenstein.
The ANN approach provides a computational model of
how that might be possible in practice----to have the
same behavioural effect as rules but without there
being any anywhere in the system.
On the other hand, the standard model of science is of
amny possible rule systems describing the same
phenomenon--that also allows that real rules (in a
brain) could be quite different from the ones we
invent to describe a phenomenon.
Some computer scientists (e.g Charniak) refuse to
accept incomprehensible explanations.
Specific criticisms of the model:
Criticism 1
Performance of model depends on use of Wickelfeature
representation: and this is an adaptation of standard
linguistic featural analysis. – ie it relies on symbolic
input representation(cf. phonemes in NETALK)
Ie what’s the contribution of the architecture?
Criticism 2
Pinker and Prince (1988): role of input and U-shaped
curve.
Model’s entry to Stage 2 due to addition of 410 medium
frequency verbs.
This change is more abrupt than is the case with
children--there may be no relation between this
method of partitioning the training data and what
happens to children.
But later research (Plunkett and Marchman 1989) show
that U-shaped curves can be achieved without abrupt
changes in input. Trained on all examples together
(using backpropogation net).
Presented more irregular verbs, but still found
regularization, and other Stage 2 phenomena for
certain verbs.
Criticism 3
Nets are not simply exposed to data, so that we can
then examine what they learn.
They are programmed in a sense: Decisions have to be
made about several things including
 Training algorithm to be used
 Number of hidden units
 How to represent the task in question


Input and output representation
Training examples, and manner of presentation
Criticism 4
At some point after or during learning this kind of thing,
humans become able to articulate the rule.
Eg regular past tenses end in –ed.
Also can control and alter these rules – eg could
pretend to be a younger child and say ‘runned’ even
though she knows it is incorrect (cf. some use
learned and some learnt, lit is UK and lighted US)
Hard to see how such kind of behaviour would emerge
from a set of interconnected neurons.
Conclusions
Although the Past-tense model can be criticised, it is
best to evaluate it in the context of the time (1986)
when it was first presented.
At the time, it provided a tangible demonstration that
 Possible to use neural net to model an aspect of
human learning
 Possible to capture apparently rule-governed
behaviour in a neural net
Contrasting Neural Computing with Symbolic Artificial
Intelligence


Overview of main differences between them.
Relationship to the brain
(a) Similarities between Neural Computing and
the brain
(b) Differences between brain and Symbolic AI –
evidence that brain does not have a von Neumann
architecture.

Ability to provide an account of thought and
cognition
(a) Argument by symbolicists that only symbol
system can provide an account of cognition
(b) Counter-argument that neural computing
(subsymbolic) can also provide an account of
cognition
(c) Hybrid account?
•
Main differences between Connectionism and
Symbolic AI
Knowledge: knowledge represented by weights and
activations versus explicit propositions.
Rules: rule-like behaviour without explicit rules versus
explicit rules.
Learning: Connectionist nets trained versus
programmed. But there are now many machine
learning algorithms that are wholly symbolic----both
kinds only work in a specialised domain.
Examinability: Can examine symbolic program to ‘see
how it works’. Less easy in the case of Neural
Computing – problems with black box nature – set of
weights opague.
Relationship to the brain: Brain-style computing
versus manipulation of symbols. Different models of
human abilities.
Ability to provide an account of human thought: see
following discussion about need for symbol system to
account for thought.
Applicability to problems: Neural computing more
suited to pattern recognition problems, Symbolic
computing to systems characterisable by rules.
But for a different view, that stresses similarities
between GOFAI and NN approaches see Boden, M.
(1991) Horses of a different colour, In Ramsey, W.,
Stich, S.P. and D.E. Rumelhart, ‘Philosophy and
Connectionist Theory’, Lawrence Erlbaum
Associates: Hillsdale, New Jersey, pp 3-19, where
she points out some of the similarities. See also YW
in Foundations of AI book, on web course list.
Fashions: historical tendency to model brain on
fashionable technology.
mid 17th century: water clocks and hydraulic puppets
popular
Descartes developed hydraulic theory of brain
Early 18th century Leibniz likened brain to a factory.
Freud: relied on electromagnetics and hydraulics in
descriptions of mind.
Sherrington: likened nervous system to telegraph.
Brain also modelled as telephone switchboard.
Might use computer to model human brain; but is
human brain itself a computer?
Differences between Brains and von Neumann
machines
McCulloch and Pitts: simplified account of neurons as
On/Off switch.
In early days seemed that neurons were like flip-flops in
computers.
Flip-flop: can be thought of as tiny switches that can be
either off or on. But now clear that there are
differences:
- rate of firing of neuron important, as well as on/off
feature
- Neuron has enormous number of input and output
connections, compared to logic gates.
speed: neuron much slower. Takes thousandth of a
second to respond, whereas flip-flop can shift
position from 0 to 1 in thousand-millionth of a second.
I.e. brain takes a million times longer.
Thus if brain running an AI program, stepping through
instructions, would take at least 1000th sec for each
instruction.
Brain can extract meaning from sentence, or recognise
visual pattern in about 1/10th second.
So, if this is being accomplished by stepping through
program, program can only be 100 instructions long.
But current AI programs contain 1000s of instructions!
Suggests brain operates in parallel, rather than as
sequential processor.
-
Symbol manipulators: most (NOT ALL) are sequential –
carrying out instructions in sequence.
- human memories: content-addressable. Access to
memory via its content.
E.g. can retrieve memory via description:
(e.g. could refer to Turing Test either as ‘Turing Test’, or
as ‘assessment of intelligence based on Victorian
parlour game, and would still access memory).
But memory in computer has unique address: cannot
get at memory without knowing its address (at the
bottom level that is!).
- memory distribution
In computer string of symbol tokens exists at specific
physical location in hardware.
But our memories do not seem to function like that.
E.g. Lashley and search for the engram
Trained rats to learn route through maze to food.
Destroyed different areas of brain. As long as only 10
percent destroyed, no loss of memory, regardless of
which area of brain destroyed.
Lashley (1950) ‘…There are no special cells reserved
for special memories… The same neurons which
retain memory traces of one experience must also
participate in countless other activities…’
and conversely a single memory must be stored in
many places across a brain----there was brief fashion
for ‘the brain as a hologram’ because of the way a
hologram stores information.