Neural Coding

Download Report

Transcript Neural Coding

Neural Coding
What Kind of Information is
Represented in a Neural Network and
How?
Outline
• ANN Basics
• Local -vs- Distributed Representations
• The Emergence/Learning of Salient Features in Neural
Networks
• Feed-Forward Neural Networks Embody Mappings
• Linearly Separable Mappings
• Classification in Spaces that are NOT Linearly Separable
• Coding Heuristics
• Hopfield Networks
• Summary
NeuroPhysiology
Neurons
Synapses
Nucleus
Axon
Dendrites
• Dense: Human brain has 1011 neurons
• Highly Interconnected: Human neurons have 104 fan-in.
• Neurons firing: send action potentials (APs) down the axons when
sufficiently stimulated by SUM of incoming APs along the dendrites.
•Neurons can either stimulate or inhibit other neurons.
•Synapses vary in transmission efficiency
Development: Formation of basic connection topology
Learning: Fine-tuning of topology + Major synaptic-efficiency changes.
The matrix IS the intelligence!
NeuroComputing
wi
wj
wk
• Nodes fire when sum (weighted inputs) > threshold.
– Other varieties common: unthresholded linear, sigmoidal, etc.
• Connection topologies vary widely across applications
• Weights vary in magnitude & sign (stimulate or inhibit)
• Learning = Finding proper topology & weights
– Search process in the space of possible topologies & weights
– Most ANN applications assume a fixed topology.
• The matrix IS the learning machine!
Tasks & Architectures
• Supervised Learning
In
Out
– Feed-Forward networks
• Concept Learning: Inputs = properties, Outputs = classification
• Controller Design: Inputs = sensor readings, Outputs = effector actions
• Prediction: Inputs = previous X values, Outputs = predicted future X value
– Learn proper weights via back-propagation
• Unsupervised Learning
–
Pattern Recognition
In
• Hopfield Networks
Out
Excitatory & Inhibitory Arcs
in the Clique
– Data Clustering
• Competitive Networks In
Out
Maxnet: Clique =
only inhibitory arcs
1
2
Node Types
x1
w1
x2
j
n
w2
net j   xi w ji
xj = fT(netj)
i 1
xn
n
•
•
wn
Most ANNs use nodes that sum the weighted inputs.
But many types of transfer functions, fT are used.
– Thresholded (Discontinuous)
• Step
• Ramp
– Non-thresholded (Continuous, Differentiable)
• Linear
• Sigmoid
Transfer Functions
Step
xj
Ramp
xj
netj
netj
Linear
xj
Sigmoidal
xj
netj
netj
• Step functions are useful in classifier nets, where data partitioning is important.
• Linear & Sigmoidal are everywhere differentiable, thus popular for backprop nets.
• Sigmoidal has most biological plausibility.
Learning = Weight Adjustment
wj,i
xi
xj
zj
• Generalized Hebbian Weight Adjustment:
– The sign of the weight change = the sign of the correlation
between xi and zj:
∆wji
xizi
– zj is:
• xj
Hopfield networks
• dj - xj
Perceptrons (dj = desired output)
• dj - ∑xiwji
ADALINES “
“
i
Cellular Automata
Step N
Step N+1
Update rule: If exactly 2 red neighbors, change to red;
else change to green.
Distributed Representations: Picture Copying
• Update rule: If an odd number of neighbors are on, turn on, else turn off.
• In CA’s and ANNs, you need to learn to think differently about representation!
Local -vs- Distributed Representations
• Assume examples/concepts have 3 features:
– Age : {Young, Middle, Old}
– Sex: {Male, Female}
– Marital Status: {Single, Samboer, Married}
Young, Single,
Male!
Old, Female
Samboer!
Old Female!
Local: One neuron
represents an entire
conjuctive concept.
Young, Married
Female!
Samboer!
Semi-Local: Together they rep
a conjunctive concept, and each
neuron reps one or a few conjuncts i.e. concept broken into clean pieces.
Distributed: Together they rep
a conjunctive concept, but the
individual conjuncts cannot
necessarily be localized
to single neurons
Local -vs- Distributed (2)
• Size requirements to represent the whole set of 18 3-feature concepts assuming binary neurons (on/off)
– Local: 3x3x2 = 18
• Instance is EXACTLY 1 of 18 neurons being on.
– Semi-Local: 3+3+2 = 8 (Assume one feature value per neuron)
• Instance is EXACTLY 3 of 18 neurons being on.
– Distributed: log2 18 = 5
• Instance is any combination of on/off neurons
• Add 1 bit and DOUBLE the representational capacity, so each
concept can be represented by 2 different codes (redundancy).
• The same neural network (artificial or real) may have different types of
coding in different regions of the network.
Young
Old
Single
Married
Male
Female
+5
+1
Young, Married
Female!
+3
Semi-Local => Local
Representational Hierarchies
• In the brain, neurons involved in early processing are often semi-local,
while neurons occuring later along the processing path (i.e. higher
level neurons), are often local.
• In simpler animals, there appears to be a lot of local coding. In
humans, it is still debatable.
Line tilted 45o
@ {3o,28o}
Dark dot
@ {3o,28o}
Grandma!!
Human Face
Vector Coding
•
•
•
•
•
An organism’s sensory apparatus uses vector coding as a representation of its
inputs.
Semi-local coding, since the components of a conjunctive concept are
localized to individual neurons.
A particular color, flavor, sound, etc. = a vector of receptor states (not a single
receptor state).
Combinatorics: nk possible vector states, k = # receptors, n = # possible
receptor states. Note: n > 2 in many cases.
The fact that humans are much better at disciminating sensory inputs than
actually describing them illustrates the relative density of sensory vector space
-vs- the sparseness of language.
Tongue
0.1
0.8 ``Tyrkisk
0.2
Pebel´´
0.9
Comparison of Coding Forms
• Compact Representation: Local (NO!), Distributed (YES!)
• Graceful Degredation (Code works when a few neurons are faulty):
Local (NO!), Distributed (Yes- due to redundancy).
• Binding Problem (How to represent two concepts that occur
simultaneously): Local (EASY! - two active nodes), Distributed
(HARD - but may be possible by quick shifts back and forth between
the 2 activation patterns)
E.g. “Where’s Waldo”: Easy to pick out a human face among a bunch
of round objects, or your mother’s face among a bunch of other faces,
thus indicating that we probably have relatively local codes for these
all-important concepts. But, it’s VERY HARD to find Waldo (i.e. a
generic-faced cartoon man with a red-and-white striped shirt) in a
crowd of several hundred generic cartoon characters wearing all sorts
of colors & patterns. Why? “Red-and-white stripes” is probably not
locally coded in the human brain and hence not quickly/effortlessly
detected. It probably shares neurons with concepts such as “stripe”
“red”, “white”, etc.
• In more complex animals, all 3 coding forms are probably present,
with local for the most salient concepts for that organism.
Species-Specific Saliency
• The key stimuli for an organism are often locally or semi-locally
encoded, with direct connections from the detector neuron(s) to a
motor (action-inducing) neuron.
The movement of this simple pattern
ressembles a hawk and scares small chickens.
The movement of the reverse pattern
ressembles a goose and elicits no response
from the chicks.
Fish Dinner
• Three-spined sticklebacks respond to these simple stimuli:
• But not these:
• Salient feature: Red belly!
Toad Turn-ons
Length of
stimulus
Anti-Worm
Square
T5(2) Firing rate
Worm
# Turns
• The behavioral response (i.e. number of times that it turns around per
minute) of a toad as a function of the length of the stimulus is mirrored
by the firing rates of neurons in the T5(2) region of its brain.
Length of
stimulus
Emergent Salience
• Animal bodies and brains have evolved to maximize the odds of
survival and reproduction (i.e., fitness). Both are tailored to the
survival task at hand.
• Hence salient features will emerge (via evolution and learning) as the
activating conditions for various neurons. When fired, those neurons
will then help to initiate the proper (motor) response to a salient input.
• Similarly, if an ANN is given a task and the ability to adapt (i.e. learn
and/or evolve), the salient features of that task will emerge as the
activating conditions for hidden-layer and output neurons.
• Salient features can then be read off the input weights to those neurons.
• So, the only features that need to be given to the ANN are the very
primitive ones at the input layer. The rest are discovered!
Face Recognition
•
•
Animals differ as to their abilities to disciminate sounds, tastes, smells, colors , etc.
Humans are very good at disciminating faces, at least faces of the type that they
grow up around.
•
Hypothesized # dimensions in face-coding space = 20 (Churchland)
Face Space
Pg. 28
Morphing
Pg. 34
Choose evenly-spaced points along the vector that
connects the source & target faces
ANN for Face Recognition
• Garrison Cottrell et. al. (1991)
• Feed-forward net with backprop learning
Pg. 40
Training & Testing
•
•
Training: 64 photos of 11 different faces + 13 non-face photos
Performance Criteria: Classify each picture as to:
– face or non-face?
– male or female?
– Name?
•
Results:
–
–
–
–
Training Accuracy: 100%
Test with same faces but new pictures: 98%
Test with new faces : 100% (face-non-face?), 81% (male-female?)
Test with known face but with 20% of picture erased:
• Vector completion: the firing patterns of middle-layer neurons are very similar
to those patterns when the non-erased image is presented. Hence, in its
understanding of the pictures, the ANN fills in the missing parts.
• Generally good performance, but erased foreheads caused problems (71%
recognition).
– Holons: Middle-layer nodes represent generic combi-faces instead of individual
features.
Combi-Faces (Holons) at Hidden Nodes
+2
+6
A
+1
+7
B
-3
+5
C
Incoming weights to a node indicate what it “prefers”:
• Likes eyes at positions shown
• Has slight preference for noses right below and between eyes.
• Prefers smiles over frowns
• “Turned on” by sexy movie-star cheek moles
Node B’s Dream Face
Darker color => Higher preference
Similar methods for interpreting the
concepts represented by ANN nodes.
Facial Holons
Pg. 48
•
•
•
Each input case
satisfies a subset
of the 80 holons.
I.e., each input case
is a combination of
holons
Prefered stimuli: By looking at the signs of the input weights to a hidden node, we can
construct a prototypical input vector that the node would fire on. E.g. If w ji > 0, then xi
> 0 is desired, and if wji < 0, then xi < 0 is desired.
Doing this for each of the 80 hidden nodes of the face net yields an interesting set of
hybrid faces as prefered stimuli.
Enhanced robustness: since recognition of particular features is now spread over many
hidden nodes/holons, the network can still successfully recognize faces if a node or two
are inoperable.
How Realistic is it?
•
Anatomical:
– In the brain, 5 levels of synapses connected the retina to the (known) region of face
coding.
– But, those 5 levels perform many other tasks too.
•
Functional:
– ANNs trained with many more Asian than Caucasian faces were much better at
discriminating the latter than the former.
• ``They all look alike’’ is result of past experiences and their effects upon the observer’s
neural development, not any objective differences in homogeneity within the different
races.
– Similar ANNs were also trained to recognize emotional states in the faces.
• Results were promising (~80% accuracy on test phase), but the acting ability of the
student subjects was very poor, so better results can be expected.
• Emotion recognition is a VERY important aspect of human social behavior.
Neural Nets as Mappings
•
The main application of feed-forward ANNs is to learn a general function (mapping)
between a particular domain (D) and range (R) when given a set of examples: {(d, r): d
in D, r in R}.
•
D and R may contain vectors or scalars
Domain
Range
F
r1
d1
d2
r2
d3
r3
d4
Example set = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)}
Goal: Build an ANN that can take ANY element d of D on its input
layer and produce F(d) on its output layer.
Problem: The example set normally represents a very small fraction of
the complete mapping set (which may be infinite).
Sensorimotor Coordination: Mapping Sensations to Actions
Brain
Output vector of desired muscle
activation levels
Input vector
Muscles
Senses
•
Intelligent Physical Behavior: Performance of the proper motor movements in response
to the current sensory stimuli.
–
•
•
A large and well-defined brain is just evolution’s latest and highest achievement in
sensorimotor coordination, not its earliest or only example…Churchland, pg. 95-6
Vector processing: Transformation of sensory input vectors into motor output vectors
Coordinated Behavior: Proper sequence of muscle-activations = proper trajectory in
output-vector space.
Walk
Run
XC Ski
ANN for Crab Control
•
•
•
Simple feed-forward net that maps points in visual space to points in claw-angle space.
93% accurate
Simple, one-shot movement- assumes muscles snap into proper position.
Pg. 94 picture
Classification = Mapping
• M: Features => Classes
In
Out
Weight
Hibernate?
Hidden
Bear
Habitat
Sheep
Max Speed
Coat type
Horse
Classification
case
x y
1
2
3
4
1 3
-5 2
1 3
-3 -1
2 +1
-3 -1
1 9
3 +1
5
6
7
-2 4
-7 2
5 5
1 +1
4 +1
-5 -1
X
Y
Wx= -1
1
Wy = 1
Wz = -5
0
The perceptron should compute the proper
class for each input x-y pair. For a single perceptron, this
is only possible when the input vectors are linearly separable
Simple Boolean Functions
True: +1
False: -1
X
Y
X
.5
-.8
X
.8
X
Y
.5
1
or
-.5
1
and
.5
.3
-.5
.5
1
Y
~and
X
Y
-.5
-.5
1
-.5
1
-.3
~or
not
0
Linear Separability of Booleans
.5x + .5y -.8 > 0
<=>
x + y > 1.6
<=>
y > -x + 1.6
Y
1
AND
-
X
1
-1
-1
-
.5x + .5y + .3 > 0
<=>
x + y > -.6
<=>
y > -x - .6
+
OR
+
Y
1
-
+
X
1
-1
-
-1
+
XOR
+
Y
1
-
X
Y
.5
.5
X
1
-1
-
-1
1
1
-.5
-.8
+
-.5
and
and
.5
.5
1
*Not linearly separable =>
More than 1 perceptron is needed
or
.3
All boolean functions can be represented by a feedforward ANN with 2 layers or less.
Proof: All boolean functions can be expressed as a
conjunction of disjunctions (CNF) =>
Disjuncts = layer 1& the conjunct = layer 2
-.8
Linear Separability of Reals
Y
10
y<x-3
<=>
y-x+3<0
<=>
x-y-3>0
-
+
-
-
-
+
X
-10
-
-
+
-10
X
+
10
+
+
Y
1
-1
1
-3
+
f(x,y)
This outputs a 1 for
all pos instances, and a
-1 for all neg instances
*When one hyperplane separates
all pos from neg examples, then a single perceptron can be the classifier
Separable by N Hyperplanes
L3
+
L2
a
b
+
Y
10
-
-
c
+
-10
+
-
+
+
X
10
-
-
L1
+
-
-10
L1: y = x
L2: y = -x + 5
L3: y = - 4x + 30
Classification of positive instances:
C1: Above L1 & Below L2
OR
C2: Above L1 & Above L3
OR
C3: Below L1 & Above L2 & Below L3
ANN Component Nodes
X
Y
X
-1
1
1a
1
1
1
0
Y
X
1
Y
-5
1
1
2a
-1
0
1a. Above L1:
y > x <=> y - x > 0
1b. Below L1:
y < x <=> x - y > 0
2a. Above L2:
y > -x+ 5 <=> y + x - 5 > 0
2b. Below L2:
y < -x + 5 <=> -x - y + 5 > 0
3a. Above L3
y > -4x + 30 <=> y + 4x - 30 > 0
3b. Below L3:
y < -4x + 30 <=> - 4x - y + 30 > 0
X
1b
Y
-1
1
-1
5
X
2b
Y
4
1
1
-30
3a
X
Y
-4
1
-1
30
3b
The Complete ANN
Hyperplanes
X
Y
-1
-1
1
1a
(0)
1
1b
(0)
1
-1 1
2a
(5)
-1
-1
3a 1
(30)
4
2b
(-5)
1
1
ANDs
C1
(1.5)
1
1
1
C2
(1.5)
1
1
1
1
f(x,y)
(-1.5)
OR
C3
(2.5)
1
-4 3b
(-30)
A Simpler ANN
Hyperplanes
X
Assume:
Below Li replaced
by Not-Above-Li
Y
-1
1a
(0)
1
1
1
3a 1
(30)
4
2a
(5)
1
1
ANDs
C1
(1.5)
-1*
1
1
C2
(1.5)
1
1
-1*
1
f(x,y)
(-1.5)
OR
C3
(2.5)
-1*
Sigmoidals & Linear Separability
• Using a sigmoidal transfer function (which is non-linear) does not
drastically change the nature of linear-separability analysis.
• It just introduces a wider linear separator (a “gray area”) which only
creates problems when points lie within it.
• So, if a set of points are not linearly separable using linear threshold
transfer functions, then adding nonlinear sigmoidal transfer functions
will not help!
Y
10
Sigmoidal
xj
-
+
-
-
-
+
X
-10
netj
Linear: x - y - 3 > 0
Sigmoidal: S(x-y-3)
Given an X, this
S outputs higher
values (blue) for
lower values of Y.
-
-
+
-10
+
10
+
+
+
Hidden Layer Design Decisions
• Number of Hidden Layers & Nodes
– Too few => Can’t partition data properly
– Too many => Partitions are too detailed => over-specialized for
the training set => Can’t generalize to handle new cases.
Points in Space
Step Functions
Hyperplanes
Convex Regions
Ors
Groups of Regions
Ands
Input Encoding for Feed-Forward Networks
• Reals => scaled values in [0 1] or [-1 1]
• Colors => pixel intensities => scaled values in [0 1]
• Symbols => integers =>
”
” ” ”
– (small, medium, large) => (.2 .5 .8)
• Number of input nodes per input vector element:
– One node per element
– One node per discrete subrange of the element’s possible values
wyx
Age
[-1 1]
x
y
No matter how we choose wyx,
node y is forced to treat old age
inversely to the way it treats youth.
In fact, it must treat all ages in a
linear fashion, since there’s only 1
weight relating all ages to y.
Input Encodings (2)
•
With discrete classes for an input element, nodes in the next layer are free to treat
different ranges of inputs in different (possibly non-linear) ways, since the
incoming arcs from each input class can have different weights.
•
So if wyx1 = 0, wxy2 = 5 and wyx3 = 1, node y is very sensitive to middle age, mildly
sensitive to old age, and insensitive to youth. This would be a useful
discrimination to make when diagnosing job-related stress, for example.
wyx1
Age
Young
[-1 1]
x1
Middle
[-1 1]
x2
Old
[-1 1]
x3
y
wyx2
wyx3
[0 1] –vs- [-1 1]
Example:
– I = Yearly Income (scaled to [0 1] or [-1 1])
– C = Credit history ”
”
” denotes bad(untrustworthy) or good.
– L = Should the person be given a loan: Yes = 1, No = 0 or -1
I
WLI
L
C
•
•
•
•
•
•
WLC
Assume L fires (and outputs a 1) if its weighted sum of inputs is  1.
Assume a customer has a bad credit history (i.e. Has not paid back a few loans).
Assume WLC = WLI = +1, which makes intuitive sense, since both should
contribute positively to the loan decision.
If Bad credit => C = 0, then L can still fire if I = 1.
If Bad credit => C = -1, then L cannot fire.
So by using –1 (instead of 0) as the lower bound, the left end of the scale can
have a strong influence on the excitation (if the connecting weight is negative) or
inhibition (if that weight is positive) of the downstream node. In short, both ends
of the scale have similar (but opposite) effects upon the downstream node.
Output Encodings
• Similar to Input encodings
• 1-n encoding a key issue
– More weights to train
– But greater discriminability
• Take account of the range of fT of the output nodes.
– E.g. Sigmoids output values in (0 1)
Mapping Thoughts to Actions in the Brain
•
•
The cerebellum, which controls a good deal of motor activity, has a feed-forward
structure with few backward (i.e., recurrent) connections.
The cerebrum sends commands to initiate action, which are fed forward from mossy to
granule to parallel to Purkinje and out to motor neurons.
Parallel Fibers
*Arrows denote
signal direction
Granule
Cell
Thought
Cerebral
Neocortex
Mossy
Fiber
Purkinje
Cell
Climbing
Fiber
From inferior
olive
To motor
cortex (Action!)
Distributed Coding in the Motor Cortex
Firing Rate
• Cortical area # 4 = The Motor Cortex (M1)
• Pyramidal cells in M1 get inputs from the cortex & thalamus; they
send outputs to motor neurons.
• But pyramidals => motor neurons is an N-N mapping.
• So during any particular movement, MANY pyramidal and motor
neurons are firing. I.e. Movement coding is DISTRIBUTED across the
pyramidal cells.
Pyramidal Cells
A
A
B
Motion Angle
Motor Neurons
B
Associative-Memory Networks
Input: Pattern (often noisy/corrupted)
Output: Corresponding pattern (complete / relatively noise-free)
Process
1. Load input pattern onto core group of highly-interconnected
neurons.
2. Run core neurons until they reach a steady state.
3. Read output off of the states of the core neurons.
Inputs
Input: (1 0 1 -1 -1)
Outputs
Output: (1 -1 1 -1 -1)
Distributed Information Storage &
Processing
wi
wj
wk
Information is stored in the weights with:
• Concepts/Patterns spread over many weights, and nodes.
• Individual weights can hold info for many different concepts
Hebb’s Rule
Connection Weights ~ Correlations
``When one cell repeatedly assists in firing another, the axon of the first cell
develops synaptic knobs (or enlarges them if they already exist) in contact
with the soma of the second cell.” (Hebb, 1949)
In an associative neural net, if we compare two pattern components (e.g. pixels)
within many patterns and find that they are frequently in:
a) the same state, then the arc weight between their NN nodes should be positive
b) different states, then ”
”
”
” negative
Matrix Memory:
The weights must store the average correlations between all pattern components
across all patterns. A net presented with a partial pattern can then use the correlations
to recreate the entire pattern.
Correlated Field Components
• Each component is a small portion of the pattern field (e.g. a pixel).
• In the associative neural network, each node represents one field component.
• For every pair of components, their values are compared in each of several patterns.
• Set weight on arc between the NN nodes for the 2 components ~ avg correlation.
a
a
??
??
b
b
Avg Correlation
wab
a
b
Hopfield Nets in the Brain??
•
The cerebral cortex is full of recurrent connections, and there is solid evidence for
Hebbian synapse modification there. Hence, the cerebrum is believed to function as an
associative memory.
•
Flip-flop figures indicate distributed hopfield-type coding, since we cannot hold both
perceptions simultaneously (binding problem)
The Necker Cube
H
E
Which face is
closer to the viewer?
BCGF or ADHE?
G
F
D
A
B
C
Only one side of the
(neural) network can
be active at a time.
Closer(A,B)
Closer(H,G)
Closer(G,H)
Convex(A)
Hidden(G)
Showing(G)
Steven Pinker (1997) “How the Mind Works”, pg. 107.
Closer(C,D)
Convex(G)
Excitatory
Inhibitory
What’s in a Link?
An implicit coding of the preferences that
a node has for upstream values.
An implicit coding of the correlation
between the data elements represented
by the two nodes.
+2
+6
+1
+7
-3
X
.5
1
-.8
A
+5
B
a
C
b
Correlation
Y
.5
-.5 -.5
and
and
.5 .5
or
1
H
E
-.8
G
F
1
.3
D
A
B
C
Architectures & Node/Link Semantics
•
Feedforward Networks & Competitive Networks
– Nodes = Semi-local or local coding of low-level and high-level concepts
– Arcs = Preferred upstream values; I.e. preconditions for concept membership. (The
inter-layer inhibitory arcs in competitive networks embody the control information
that only one node can win/fire)
In
•
Out
In
Out
Hopfield Networks
– Nodes = Semi-local or distributed coding for elements of the input pattern
– Arcs = Average correlations (across many patterns) between the input
elements represented by the arc’s 2 nodes. The inter-layer nodes are just
for transferring the inputs to the clique.
In
Out