#### Transcript Neural Coding

Neural Coding What Kind of Information is Represented in a Neural Network and How? Outline • ANN Basics • Local -vs- Distributed Representations • The Emergence/Learning of Salient Features in Neural Networks • Feed-Forward Neural Networks Embody Mappings • Linearly Separable Mappings • Classification in Spaces that are NOT Linearly Separable • Coding Heuristics • Hopfield Networks • Summary NeuroPhysiology Neurons Synapses Nucleus Axon Dendrites • Dense: Human brain has 1011 neurons • Highly Interconnected: Human neurons have 104 fan-in. • Neurons firing: send action potentials (APs) down the axons when sufficiently stimulated by SUM of incoming APs along the dendrites. •Neurons can either stimulate or inhibit other neurons. •Synapses vary in transmission efficiency Development: Formation of basic connection topology Learning: Fine-tuning of topology + Major synaptic-efficiency changes. The matrix IS the intelligence! NeuroComputing wi wj wk • Nodes fire when sum (weighted inputs) > threshold. – Other varieties common: unthresholded linear, sigmoidal, etc. • Connection topologies vary widely across applications • Weights vary in magnitude & sign (stimulate or inhibit) • Learning = Finding proper topology & weights – Search process in the space of possible topologies & weights – Most ANN applications assume a fixed topology. • The matrix IS the learning machine! Tasks & Architectures • Supervised Learning In Out – Feed-Forward networks • Concept Learning: Inputs = properties, Outputs = classification • Controller Design: Inputs = sensor readings, Outputs = effector actions • Prediction: Inputs = previous X values, Outputs = predicted future X value – Learn proper weights via back-propagation • Unsupervised Learning – Pattern Recognition In • Hopfield Networks Out Excitatory & Inhibitory Arcs in the Clique – Data Clustering • Competitive Networks In Out Maxnet: Clique = only inhibitory arcs 1 2 Node Types x1 w1 x2 j n w2 net j xi w ji xj = fT(netj) i 1 xn n • • wn Most ANNs use nodes that sum the weighted inputs. But many types of transfer functions, fT are used. – Thresholded (Discontinuous) • Step • Ramp – Non-thresholded (Continuous, Differentiable) • Linear • Sigmoid Transfer Functions Step xj Ramp xj netj netj Linear xj Sigmoidal xj netj netj • Step functions are useful in classifier nets, where data partitioning is important. • Linear & Sigmoidal are everywhere differentiable, thus popular for backprop nets. • Sigmoidal has most biological plausibility. Learning = Weight Adjustment wj,i xi xj zj • Generalized Hebbian Weight Adjustment: – The sign of the weight change = the sign of the correlation between xi and zj: ∆wji xizj zj is: • xj • dj - xj • dj - ∑xiwji i Hopfield networks Perceptrons (dj = desired output) ADALINES “ “ Cellular Automata Step N Step N+1 Update rule: If exactly 2 red neighbors, change to red; else change to green. Distributed Representations: Picture Copying • Update rule: If an odd number of neighbors are on, turn on, else turn off. • In CA’s and ANNs, you need to learn to think differently about representation! Local -vs- Distributed Representations • Assume examples/concepts have 3 features: – Age : {Young, Middle, Old} – Sex: {Male, Female} – Marital Status: {Single, Samboer, Married} Young, Single, Male! Old, Female Samboer! Old Female! Local: One neuron represents an entire conjuctive concept. Young, Married Female! Samboer! Semi-Local: Together they rep a conjunctive concept, and each neuron reps one or a few conjuncts i.e. concept broken into clean pieces. Distributed: Together they rep a conjunctive concept, but the individual conjuncts cannot necessarily be localized to single neurons Local -vs- Distributed (2) • Size requirements to represent the whole set of 18 3-feature concepts assuming binary neurons (on/off) – Local: 3x3x2 = 18 • Instance is EXACTLY 1 of 18 neurons being on. – Semi-Local: 3+3+2 = 8 (Assume one feature value per neuron) • Instance is EXACTLY 3 of 18 neurons being on. – Distributed: log2 18 = 5 • Instance is any combination of on/off neurons • Add 1 bit and DOUBLE the representational capacity, so each concept can be represented by 2 different codes (redundancy). • The same neural network (artificial or real) may have different types of coding in different regions of the network. Young Old Single Married Male Female +5 +1 Young, Married Female! +3 Semi-Local => Local Representational Hierarchies • In the brain, neurons involved in early processing are often semi-local, while neurons occuring later along the processing path (i.e. higher level neurons), are often local. • In simpler animals, there appears to be a lot of local coding. In humans, it is still debatable. Line tilted 45o @ {3o,28o} Dark dot @ {3o,28o} Grandma!! Human Face Vector Coding • • • • • An organism’s sensory apparatus uses vector coding as a representation of its inputs. Semi-local coding, since the components of a conjunctive concept are localized to individual neurons. A particular color, flavor, sound, etc. = a vector of receptor states (not a single receptor state). Combinatorics: nk possible vector states, k = # receptors, n = # possible receptor states. Note: n > 2 in many cases. The fact that humans are much better at disciminating sensory inputs than actually describing them illustrates the relative density of sensory vector space -vs- the sparseness of language. Tongue 0.1 0.8 ``Tyrkisk 0.2 Pebel´´ 0.9 Comparison of Coding Forms • Compact Representation: Local (NO!), Distributed (YES!) • Graceful Degredation (Code works when a few neurons are faulty): Local (NO!), Distributed (Yes- due to redundancy). • Binding Problem (How to represent two concepts that occur simultaneously): Local (EASY! - two active nodes), Distributed (HARD - but may be possible by quick shifts back and forth between the 2 activation patterns) E.g. “Where’s Waldo”: Easy to pick out a human face among a bunch of round objects, or your mother’s face among a bunch of other faces, thus indicating that we probably have relatively local codes for these all-important concepts. But, it’s VERY HARD to find Waldo (i.e. a generic-faced cartoon man with a red-and-white striped shirt) in a crowd of several hundred generic cartoon characters wearing all sorts of colors & patterns. Why? “Red-and-white stripes” is probably not locally coded in the human brain and hence not quickly/effortlessly detected. It probably shares neurons with concepts such as “stripe” “red”, “white”, etc. • In more complex animals, all 3 coding forms are probably present, with local for the most salient concepts for that organism. Species-Specific Saliency • The key stimuli for an organism are often locally or semi-locally encoded, with direct connections from the detector neuron(s) to a motor (action-inducing) neuron. The movement of this simple pattern ressembles a hawk and scares small chickens. The movement of the reverse pattern ressembles a goose and elicits no response from the chicks. Fish Dinner • Three-spined sticklebacks respond to these simple stimuli: • But not these: • Salient feature: Red belly! Toad Turn-ons Length of stimulus Anti-Worm Square T5(2) Firing rate Worm # Turns • The behavioral response (i.e. number of times that it turns around per minute) of a toad as a function of the length of the stimulus is mirrored by the firing rates of neurons in the T5(2) region of its brain. Length of stimulus Emergent Salience • Animal bodies and brains have evolved to maximize the odds of survival and reproduction (i.e., fitness). Both are tailored to the survival task at hand. • Hence salient features will emerge (via evolution and learning) as the activating conditions for various neurons. When fired, those neurons will then help to initiate the proper (motor) response to a salient input. • Similarly, if an ANN is given a task and the ability to adapt (i.e. learn and/or evolve), the salient features of that task will emerge as the activating conditions for hidden-layer and output neurons. • Salient features can then be read off the input weights to those neurons. • So, the only features that need to be given to the ANN are the very primitive ones at the input layer. The rest are discovered! Face Recognition • • Animals differ as to their abilities to disciminate sounds, tastes, smells, colors , etc. Humans are very good at disciminating faces, at least faces of the type that they grow up around. • Hypothesized # dimensions in face-coding space = 20 (Churchland) Face Space Pg. 28 Morphing Pg. 34 Choose evenly-spaced points along the vector that connects the source & target faces ANN for Face Recognition • Garrison Cottrell et. al. (1991) • Feed-forward net with backprop learning Pg. 40 Training & Testing • • Training: 64 photos of 11 different faces + 13 non-face photos Performance Criteria: Classify each picture as to: – face or non-face? – male or female? – Name? • Results: – – – – Training Accuracy: 100% Test with same faces but new pictures: 98% Test with new faces : 100% (face-non-face?), 81% (male-female?) Test with known face but with 20% of picture erased: • Vector completion: the firing patterns of middle-layer neurons are very similar to those patterns when the non-erased image is presented. Hence, in its understanding of the pictures, the ANN fills in the missing parts. • Generally good performance, but erased foreheads caused problems (71% recognition). – Holons: Middle-layer nodes represent generic combi-faces instead of individual features. Combi-Faces (Holons) at Hidden Nodes +2 +6 A +1 +7 B -3 +5 C Incoming weights to a node indicate what it “prefers”: • Likes eyes at positions shown • Has slight preference for noses right below and between eyes. • Prefers smiles over frowns • “Turned on” by sexy movie-star cheek moles Node B’s Dream Face Darker color => Higher preference Similar methods for interpreting the concepts represented by ANN nodes. Facial Holons Pg. 48 • • • Each input case satisfies a subset of the 80 holons. I.e., each input case is a combination of holons Prefered stimuli: By looking at the signs of the input weights to a hidden node, we can construct a prototypical input vector that the node would fire on. E.g. If w ji > 0, then xi > 0 is desired, and if wji < 0, then xi < 0 is desired. Doing this for each of the 80 hidden nodes of the face net yields an interesting set of hybrid faces as prefered stimuli. Enhanced robustness: since recognition of particular features is now spread over many hidden nodes/holons, the network can still successfully recognize faces if a node or two are inoperable. How Realistic is it? • Anatomical: – In the brain, 5 levels of synapses connected the retina to the (known) region of face coding. – But, those 5 levels perform many other tasks too. • Functional: – ANNs trained with many more Asian than Caucasian faces were much better at discriminating the latter than the former. • ``They all look alike’’ is result of past experiences and their effects upon the observer’s neural development, not any objective differences in homogeneity within the different races. – Similar ANNs were also trained to recognize emotional states in the faces. • Results were promising (~80% accuracy on test phase), but the acting ability of the student subjects was very poor, so better results can be expected. • Emotion recognition is a VERY important aspect of human social behavior. Neural Nets as Mappings • The main application of feed-forward ANNs is to learn a general function (mapping) between a particular domain (D) and range (R) when given a set of examples: {(d, r): d in D, r in R}. • D and R may contain vectors or scalars Domain Range F r1 d1 d2 r2 d3 r3 d4 Example set = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)} Goal: Build an ANN that can take ANY element d of D on its input layer and produce F(d) on its output layer. Problem: The example set normally represents a very small fraction of the complete mapping set (which may be infinite). Sensorimotor Coordination: Mapping Sensations to Actions Brain Output vector of desired muscle activation levels Input vector Muscles Senses • Intelligent Physical Behavior: Performance of the proper motor movements in response to the current sensory stimuli. – • • A large and well-defined brain is just evolution’s latest and highest achievement in sensorimotor coordination, not its earliest or only example…Churchland, pg. 95-6 Vector processing: Transformation of sensory input vectors into motor output vectors Coordinated Behavior: Proper sequence of muscle-activations = proper trajectory in output-vector space. Walk Run XC Ski ANN for Crab Control • • • Simple feed-forward net that maps points in visual space to points in claw-angle space. 93% accurate Simple, one-shot movement- assumes muscles snap into proper position. Pg. 94 picture Classification = Mapping • M: Features => Classes In Out Weight Hibernate? Hidden Bear Habitat Sheep Max Speed Coat type Horse Classification case x y 1 2 3 4 1 3 -5 2 1 3 -3 -1 2 +1 -3 -1 1 9 3 +1 5 6 7 -2 4 -7 2 5 5 1 +1 4 +1 -5 -1 X Y Wx= -1 1 Wy = 1 Wz = -5 0 The perceptron should compute the proper class for each input x-y pair. For a single perceptron, this is only possible when the input vectors are linearly separable Simple Boolean Functions True: +1 False: -1 X Y X .5 -.8 X .8 X Y .5 1 or -.5 1 and .5 .3 -.5 .5 1 Y ~and X Y -.5 -.5 1 -.5 1 -.3 ~or not 0 Linear Separability of Booleans .5x + .5y -.8 > 0 <=> x + y > 1.6 <=> y > -x + 1.6 Y 1 AND - X 1 -1 -1 - .5x + .5y + .3 > 0 <=> x + y > -.6 <=> y > -x - .6 + OR + Y 1 - + X 1 -1 - -1 + XOR + Y 1 - X Y .5 .5 X 1 -1 - -1 1 1 -.5 -.8 + -.5 and and .5 .5 1 *Not linearly separable => More than 1 perceptron is needed or .3 All boolean functions can be represented by a feedforward ANN with 2 layers or less. Proof: All boolean functions can be expressed as a conjunction of disjunctions (CNF) => Disjuncts = layer 1& the conjunct = layer 2 -.8 Linear Separability of Reals Y 10 y<x-3 <=> y-x+3<0 <=> x-y-3>0 - + - - - + X -10 - - + -10 X + 10 + + Y 1 -1 1 -3 + f(x,y) This outputs a 1 for all pos instances, and a -1 for all neg instances *When one hyperplane separates all pos from neg examples, then a single perceptron can be the classifier Separable by N Hyperplanes L3 + L2 a b + Y 10 - - c + -10 + - + + X 10 - - L1 + - -10 L1: y = x L2: y = -x + 5 L3: y = - 4x + 30 Classification of positive instances: C1: Above L1 & Below L2 OR C2: Above L1 & Above L3 OR C3: Below L1 & Above L2 & Below L3 ANN Component Nodes X Y X -1 1 1a 1 1 1 0 Y X 1 Y -5 1 1 2a -1 0 1a. Above L1: y > x <=> y - x > 0 1b. Below L1: y < x <=> x - y > 0 2a. Above L2: y > -x+ 5 <=> y + x - 5 > 0 2b. Below L2: y < -x + 5 <=> -x - y + 5 > 0 3a. Above L3 y > -4x + 30 <=> y + 4x - 30 > 0 3b. Below L3: y < -4x + 30 <=> - 4x - y + 30 > 0 X 1b Y -1 1 -1 5 X 2b Y 4 1 1 -30 3a X Y -4 1 -1 30 3b The Complete ANN Hyperplanes X Y -1 -1 1 1a (0) 1 1b (0) 1 -1 1 2a (5) -1 -1 3a 1 (30) 4 2b (-5) 1 1 ANDs C1 (1.5) 1 1 1 C2 (1.5) 1 1 1 1 f(x,y) (-1.5) OR C3 (2.5) 1 -4 3b (-30) A Simpler ANN Hyperplanes X Assume: Below Li replaced by Not-Above-Li Y -1 1a (0) 1 1 1 3a 1 (30) 4 2a (5) 1 1 ANDs C1 (1.5) -1* 1 1 C2 (1.5) 1 1 -1* 1 f(x,y) (-1.5) OR C3 (2.5) -1* Sigmoidals & Linear Separability • Using a sigmoidal transfer function (which is non-linear) does not drastically change the nature of linear-separability analysis. • It just introduces a wider linear separator (a “gray area”) which only creates problems when points lie within it. • So, if a set of points are not linearly separable using linear threshold transfer functions, then adding nonlinear sigmoidal transfer functions will not help! Y 10 Sigmoidal xj - + - - - + X -10 netj Linear: x - y - 3 > 0 Sigmoidal: S(x-y-3) Given an X, this S outputs higher values (blue) for lower values of Y. - - + -10 + 10 + + + Hidden Layer Design Decisions • Number of Hidden Layers & Nodes – Too few => Can’t partition data properly – Too many => Partitions are too detailed => over-specialized for the training set => Can’t generalize to handle new cases. Points in Space Step Functions Hyperplanes Convex Regions Ors Groups of Regions Ands Input Encoding for Feed-Forward Networks • Reals => scaled values in [0 1] or [-1 1] • Colors => pixel intensities => scaled values in [0 1] • Symbols => integers => ” ” ” ” – (small, medium, large) => (.2 .5 .8) • Number of input nodes per input vector element: – One node per element – One node per discrete subrange of the element’s possible values wyx Age [-1 1] x y No matter how we choose wyx, node y is forced to treat old age inversely to the way it treats youth. In fact, it must treat all ages in a linear fashion, since there’s only 1 weight relating all ages to y. Input Encodings (2) • With discrete classes for an input element, nodes in the next layer are free to treat different ranges of inputs in different (possibly non-linear) ways, since the incoming arcs from each input class can have different weights. • So if wyx1 = 0, wxy2 = 5 and wyx3 = 1, node y is very sensitive to middle age, mildly sensitive to old age, and insensitive to youth. This would be a useful discrimination to make when diagnosing job-related stress, for example. wyx1 Age Young [-1 1] x1 Middle [-1 1] x2 Old [-1 1] x3 y wyx2 wyx3 [0 1] –vs- [-1 1] Example: – I = Yearly Income (scaled to [0 1] or [-1 1]) – C = Credit history ” ” ” denotes bad(untrustworthy) or good. – L = Should the person be given a loan: Yes = 1, No = 0 or -1 I WLI L C • • • • • • WLC Assume L fires (and outputs a 1) if its weighted sum of inputs is 1. Assume a customer has a bad credit history (i.e. Has not paid back a few loans). Assume WLC = WLI = +1, which makes intuitive sense, since both should contribute positively to the loan decision. If Bad credit => C = 0, then L can still fire if I = 1. If Bad credit => C = -1, then L cannot fire. So by using –1 (instead of 0) as the lower bound, the left end of the scale can have a strong influence on the excitation (if the connecting weight is negative) or inhibition (if that weight is positive) of the downstream node. In short, both ends of the scale have similar (but opposite) effects upon the downstream node. Output Encodings • Similar to Input encodings • 1-n encoding a key issue – More weights to train – But greater discriminability • Take account of the range of fT of the output nodes. – Sigmoids output values in (0 1) – Inverse hyperbolic tangent (tanh -1) similar to sigmoidal but has range (-1 1) Mapping Thoughts to Actions in the Brain • • The cerebellum, which controls a good deal of motor activity, has a feed-forward structure with few backward (i.e., recurrent) connections. The cerebrum sends commands to initiate action, which are fed forward from mossy to granule to parallel to Purkinje and out to motor neurons. Parallel Fibers *Arrows denote signal direction Granule Cell Thought Cerebral Neocortex Mossy Fiber Purkinje Cell Climbing Fiber From inferior olive To motor cortex (Action!) Distributed Coding in the Motor Cortex Firing Rate • Cortical area # 4 = The Motor Cortex (M1) • Pyramidal cells in M1 get inputs from the cortex & thalamus; they send outputs to motor neurons. • But pyramidals => motor neurons is an N-N mapping. • So during any particular movement, MANY pyramidal and motor neurons are firing. I.e. Movement coding is DISTRIBUTED across the pyramidal cells. Pyramidal Cells A A B Motion Angle Motor Neurons B Associative-Memory Networks Input: Pattern (often noisy/corrupted) Output: Corresponding pattern (complete / relatively noise-free) Process 1. Load input pattern onto core group of highly-interconnected neurons. 2. Run core neurons until they reach a steady state. 3. Read output off of the states of the core neurons. Inputs Input: (1 0 1 -1 -1) Outputs Output: (1 -1 1 -1 -1) Distributed Information Storage & Processing wi wj wk Information is stored in the weights with: • Concepts/Patterns spread over many weights, and nodes. • Individual weights can hold info for many different concepts Hebb’s Rule Connection Weights ~ Correlations ``When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell.” (Hebb, 1949) In an associative neural net, if we compare two pattern components (e.g. pixels) within many patterns and find that they are frequently in: a) the same state, then the arc weight between their NN nodes should be positive b) different states, then ” ” ” ” negative Matrix Memory: The weights must store the average correlations between all pattern components across all patterns. A net presented with a partial pattern can then use the correlations to recreate the entire pattern. Correlated Field Components • Each component is a small portion of the pattern field (e.g. a pixel). • In the associative neural network, each node represents one field component. • For every pair of components, their values are compared in each of several patterns. • Set weight on arc between the NN nodes for the 2 components ~ avg correlation. a a ?? ?? b b Avg Correlation wab a b Hopfield Nets in the Brain?? • The cerebral cortex is full of recurrent connections, and there is solid evidence for Hebbian synapse modification there. Hence, the cerebrum is believed to function as an associative memory. • Flip-flop figures indicate distributed hopfield-type coding, since we cannot hold both perceptions simultaneously (binding problem) The Necker Cube H E Which face is closer to the viewer? BCGF or ADHE? G F D A B C Only one side of the (neural) network can be active at a time. Closer(A,B) Closer(H,G) Closer(G,H) Convex(A) Hidden(G) Showing(G) Steven Pinker (1997) “How the Mind Works”, pg. 107. Closer(C,D) Convex(G) Excitatory Inhibitory What’s in a Link? An implicit coding of the preferences that a node has for upstream values. An implicit coding of the correlation between the data elements represented by the two nodes. +2 +6 +1 +7 -3 X .5 1 -.8 A +5 B a C b Correlation Y .5 -.5 -.5 and and .5 .5 or 1 H E -.8 G F 1 .3 D A B C Architectures & Node/Link Semantics • Feedforward Networks & Competitive Networks – Nodes = Semi-local or local coding of low-level and high-level concepts – Arcs = Preferred upstream values; I.e. preconditions for concept membership. (The inter-layer inhibitory arcs in competitive networks embody the control information that only one node can win/fire) In • Out In Out Hopfield Networks – Nodes = Semi-local or distributed coding for elements of the input pattern – Arcs = Average correlations (across many patterns) between the input elements represented by the arc’s 2 nodes. The inter-layer nodes are just for transferring the inputs to the clique. In Out