#### Transcript ppt

From Biological to Artificial Neural Networks NPCDS/MITACS Spring School, Montreal, May 23-27, 2006 Helmut Kroger Laval University Outline 1. From biology to artifial NNs 2. Perceptron – unsupervised learning 3. Hopfield model – associative memory 4. Kohonen map – self organization References 1. Hertz J., Krogh A., and Palmer R.G. Introduction to the theory of neural computation 2. Pattern Classification, John Wiley, 2001 R.O. Duda and P.E. Hart and D.G. Stork 3. Haykin S (1999). Neural networks. Prentice Hall International. 4. Bishop C (1995). Neural networks for pattern recognition. Oxford: Clarendon Press 1. Pattern Recognition and Neural Networks by Brian D. Ripley. Cambridge University Press. Jan 1996. 2. Neural Networks. An Introduction, SpringerVerlag Berlin, 1991 B. Mueller and J. Reinhardt 3. W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, 5, 115-137. Use of NNs: Neural Networks Are For Applications Science Character recognition Neuroscience Optimization Physics, Mathematics Data mining Computer science … … Biological neural networks Nerve cells are called neurons. Many different types exist. Neurons are extremely complex. Approx. 1011 neurons in the brain. Each neuron has about 103 connections Neural communication Neurons transport information via electrical action potentials. At the synapse the transmission is mediated by chemical macromolecules (neurotransmitter proteins). How two neurons communicate Biology of neurons: Single neurons are highly complex electrochemical devices Many forms of interneuron communication now known – acting over many different spatial and temporal scales Local Gaseous Volume signaling etc. The neuron is a computer: Hillock input Output From biology of brain to information processing. (1) Information is distributed. (2) Brain works in deterministic mode as well as in stochastic mode (generation of nerve signals, synaptic transmission). (3) Associative memory: Retrieval not by checking bit by bit, but by gradual reproduction of the whole. (4) Architecture of brain: - optimizes information transmission. - stable against errors - physical constraints: oxygen, blood, energy, cooling. - Small World Architecture. (5) 1/f frequency scaling in EEG: fractals, self-similarity. Dynamical origin: Model of self-organized criticality (sand pile analogue). Neural avalanches – do they transmit information? Does the brain work at some critical point? (6) How is neural connectivity generated? Nerve growth: random process, guided by chemical markers, enhanced by stochastic resonance. Layered structure of neocortex Organization of neo-cortex Biological parameters of neocortex Artificial Neural Networks A network with interactions: An attempt to mimic the brain: Unit elements are artificial neurons (linear or nonlinear input-output unit). Communications are encoded by weights, measure of how strong neurons affect each other. Architectures can be feed-forward, feedback or recurrent. Firing of a group of neurons: biology vs model x1 w1: synaptic strength y f ( wj x j bi ) wn xn 1. Multi-layer feed forward network Biological motivation for multilayer feed-forward network: Modeled after visual cortex being muli-layered (6 layers). Dominantly feedforward. There is strong lateral inhibition: Neurons in the same layer don’t talk to each other. High connectivity between neurons (10^4 per neuron) provides basis for massive parallel computing. High redundance against errors. The principal building blocks: Input vector x_k Weight matrix w_ik Activation function f Output vector y_i Dynamical update rule: yi (t 1) f ( j wij x j (t ) i ) Goal: Find weights and activation function such that for given input the output is close to desired output (Training supervised by teacher) Examples of activation functions The Perceptron (1962): A simple feedforward network:1 input layer 1output layer. • The simplest possible artificial neural network able to do a calculation consists from two inputs and one output. It can be used to classify patterns or to perform basic logical operations such as AND, OR and NOT. x y 1 1 Truth Table for Logical AND x+y-1.5 (x+y-1.5) inputs -1 weights 1.5 sum output x y x&y 0 0 1 1 0 1 0 1 0 0 0 1 inputs output Perceptron learning algorithm Training any unit consists of adjusting the weight vector and threshold so that desired classification is performed. Algorithm (inspired by Hebb’s rule): For each training vector x evaluate the output y. Take the difference desired output t – current output y. wi=wi+wi. wi=(t-y)xi. Repeat until y=t. inputs * weights Σxi wi sum output Rule: If the sum of the weighted inputs exceeds a threshold, output 1, else output -1. 1 if Σ inputi * weighti > threshold -1 if Σ inputi * weighti < threshold Interpretation of weights Heaviside function has threshold at x=0. Decision boundary given by: a = w*x+ w0 = w0 + w1 x1 + w2 x2 = 0 Thus: x2 = - (w0 + w1 x1)/w2 . 1.5 0 1 1.5 0 0 Look-up table: OR fct, XOR fct Linear discriminant function: separation of 2 classes A linear discriminant function is a mapping which partitions feature space using a linear function (straight line, or hyperplane) D=2 dimensions: decision boundary is straight line TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE 8 6 Decision Region 2 Decision Region 1 Simple form of classifier: Feature 2 4 “separate two classes using a straight line in feature space” 2 0 -2 Decision Boundary -4 -4 -2 0 2 4 6 Feature 1 8 10 12 14 w11 x1 y1 wk1 xd wkd -1 yk wk0 Weight to output j from input k is wjk Separation of K classes yj = g(Swjk xk + wk0) Perceptron can be used to discriminate between k classes by having k output nodes: x is in class Cj if yj (x)>= yk for all k Resulting decision boundaries divide the feature space into convex decision regions C1 C2 C3 Network training via gradient descend Set of training data from known classes used in conjunction with an error function E(w) (eg. squared difference between target t and response y) which must be minimized. Then: w new = w old - E(w) E wjk1 wjk w jk where: E(w) is a vector representing the gradient and is the learning rate (small, positive) 1. Move downhill in direction E(w) (steepest downhill since E(w) is the direction of steepest increase) 2. Termination is controlled by Moving along the error function landscape Equivalent to climbing hill up and down Problem: when to stop? Local minima Possibility of multiple local minima. Note: for single-layer perceptron, E(w) only has a single global minimum - no problem! Gradient descent goes to the closest local minimum: General solution: random restarts from multiple places in weight space (simulated annealing). Training multi-layer NN via back propagation algorithm Back-Propagation Back propagation cont’d Perceptron as Classifier For d-dimensional data perceptron consists of d-weights, a bias and a thresholding activation function. For 2D data we have: x1 x2 1 w1 w2 w0 View the bias as another weight from an input which is constantly on a = w0 + w1 x1 + w2 x2 1. Weighted Sum of the inputs y=g(a) 2. Pass thru Heaviside function: T(a)= -1 if a < 0 T(a)= 1 if a >= 0 {-1, +1} Output = class decision If we group the weights as a vector w we therefore have the net output y given by: y = g(w . x + w0) Failure of the Perceptron Successful Few Hours in the Gym per Week Many Hours in the Gym per Week Footballers Academics …despite the simplicity of their relationship: Academics = Successful XOR Gym Unsuccessful In this example, a perceptron would not be able to discriminate between the footballers and the academics (XOR cannot be represented by single theshold sigma node). This failure caused the majority of researchers to walk away. Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous Color = light #nuclei=1 #nuclei=2 healthy #tails=1 #tails=2 healthy cancerous Example of Classification from Neural Networks: “Which factors determine if a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous Problem: Over-fitting Feed forward NN . Letter recognition from writing on PC screen. Letter scanner – transformation into ASCII code 1 2 3 4 5 6 7 8 9 10 11 0 0 0 0 0 0 0 1 0 . . . . 1 . 9 15 SWN topology: Network of 5 Layers by 8 Neurons. Simulation with 5 neurons per layers and 8 layers. NN was trained with 40 patterns for 50 different runs. Learning 40 patterns. The regular network almost fails to learn. With a few short-cuts the network learns well. The SWN architecture is better than regular and random random architecture. 2. Hopfield NN: Model of associative memory Example: Restoring corrupted memory patterns Original T Half is Corrupted 20% of T corrupted Use in search machines (Alt Vista): Search from incomplete or corrupted items. Approximate solution of combinatorial optimization problem: Travelling Salesman Problem Associative memory problem: Store a set of patterns x When presenting pattern z, network finds pattern x, closest to pattern z. Hopfield Model Dynamical variables: Neuron i ( i=1,..,N) represented by variable Si (t ) At any time t network determined by state vector { Si (t ) i=1…N}. Dynamical update rule: Si (t 1) sgn( j wij S j (t ) i ) System evolves from a given state to some stable network state (attractor state, fixed point). Hopfield Nets 1. Every node is connected to every other node. 2. Weights are symmetric and wi,j=0. 3. The flow of information is not unidirectional. 4. The state of the system is given by the node output. Energy function of Hopfield net: multidimensional landscape 1 H wi , j Si S j 2 Training of Hopfield Nets: Inspired by biological Hebb’s rule(unsupervised) 1. Present components of the patterns to be stored at the outputs of corresponding nodes of the net wij v v p p i j 2. If two nodes have the same value then make a small positive increment to internode weight. If they have opposite values then make a small negative decriment to the node Distance of NN state from 1st training pattern Phase diagram of attractor network Load: alpha=no patterns/no connections per node Character recognition For a 256 x 256 character we have 65, 536 pixels. One input for each pixel is not efficient. Reasons: b 1. Poor generalisation: data set would have to be vast to be able to properly constrain all the parameters. 2. Takes long time to train 3. Answer: use averages of N2 pixels dimensionality reduction – each average could be a feature. Restoration of memory: regular network Restoration of memory: SWN 3. Self-organization: Topographic maps Extend the ideas of competitive learning to incorporate the neighborhood around inputs and neurons We want a nonlinear transformation of input pattern space onto output feature space which preserves neighbourhood relationship between the inputs -- a feature map where nearby neurons respond to similar inputs Eg Place cells, orientation columns, somatosensory cells etc Idea is that neurons selectively tune to particular input patterns in such a way that the neurons become ordered with respect to each other so that a meaningful coordinate system for different input features is created Topographic map: spatial locations are indicative of the intrinsic statistical features of the input patterns: ie close in the input => close in the output EG: When the black input in the lower layer is active, the black neuron in the upper layer is the winner so when the red input in the lower layer is active, we want the upper red neuron to be the winner Example: Activity-based self-organization (von der Malsburg, 1973) incorporation of competitive and cooperative mechanisms to generate feature maps using unsupervised learning networks cortical units Biologically motivated: how can visual space activity-based learning using highly interconnected circuits lead to orderly mapping of visual stimulus space onto cortical surface? (visual-tectum map) 2 layer network each cortical unit fully connect to visual space via Hebbian units Interconnections of cortical units described by ‘Mexican-hat’ function (Garbor function): short-range excitation and long-range inhibition Kohonen NN Example of grid: mapping 2D to 2D Knowledge vs Topology: SWN during most of organization phase. Problem SOM tends to overrepresent regions of low density and underrepresent regions of high density Wx p (x) 2/3 where p(x) is the distribution density of input data Also, can have problems when there is no natural 2d ordering of the data.