#### Transcript Document

Neural Networks (類神經網路概論) BY 胡興民老師 1) What is a neural network ? A machine designed to model(模擬) the way in which the brain performs a particular task or function of interest; the network is implemented(實 作) by using electronic components(元件) or is simulated(模擬) in software on a digital computer [1]. Viewed as a adaptive(調適型) machine, a massively parallel(平行) distributed(分散的) processor, made up of simple processing units which has propensity(傾向) for storing experiential(經驗上的) knowledge and make it available for use [2]. [1] adapted from Neural Networks (2nd ed.) by Simon Haykin [2] adapted from Aleksander and Morton, 1990 2) significance(重要意義) for a neural network the ability to learn from its environment(環境) and to improve its performance(性能) through learning. Learning is a process by which the free parameters(參數) of a neural network are adapted through a process of stimulation by environment [3]. The type of learning is determined(決定) by the manner in which the parameter changes take place(參數發生改變的方式). [3] adapted from Mendel and McClaren, 1970 3) Five basic learning rules error-correction(錯誤更正式學習) learning (rooted in optimum filtering)(植基於最佳化過濾) memory-based(記憶型學習) learning (memorizing the training data explicitly(清楚地記住)) Hebbian learning, competitive(競爭型) learning (these two are inspired(啟發) by neurobiological(生物神經學上的) considerations) Boltzmann learning (based on statistical(統計學 上的) mechanics(機制)) Multilayer feedforward network (前饋式多層網路) Input vector One or more layers of hidden neurons X(n) Output neuron k yk(n) Σ ﹣ dk(n) ﹢ ek(n) Block diagram of a neural network, highlighting the only neuron (神經元) in the output layer yk(n) is compared to a desired response or target output,dk(n). Consequently, an error signal, ek(n), is produced. ek(n) = dk(n) ﹣yk(n) The error signal actuates adjustments(啟動調整) to the synaptic weights(神經腱強度) to make the output signal come closer to the desired response in a step-by-step manner. The objective is achieved by minimizing a cost function or index of performance,ξ(n), and is the instantaneous value(即時值) of the error energy: ξ(n) = ½ [ek(n)] 2 The adjustments are continued until the system reaches a steady state. Minimization of theξ(n) is referred to as the delta rule or Widrow-Hoff rule. x1(n) wk1(n) X(n) x2(n) wk2(n) -1 dk(n) … … xj(n) φ(‧) wkj(n) vk(n) … … xm(n) wkm(n) Signal-flow graph of output neuron yk(n) ek(n) Let wkj(n) denote the value of synaptic weight of neuron k excited(激發) by element xj(n) of the signal vector X(n) at time step n. The adjustment Δwkj(n) is defined by Δwkj(n) = ηek(n) xj(n) where η is a positive constant that determines the rate of learning as we proceed from one step in the learning process to another. The updated value of synaptic weight wkj is determined by wkj(n+1) = wkj(n) +Δwkj(n) 5) memory-based learning There are correctly classified(正確分類的) inputoutput examples:｛(xi, di)｝, i = 1, …, N where xi is an input vector and di denotes(表示) the corresponding(相對應的) desired response. When classification of a test vector xtest (not seen before) is required, the algorithm responds by retrieving(回去擷取) and analyzing the training data in a “local neighborhood” of xtest . Local neighborhood is defined as the training example that lies in(位於) the immediate(最近的近鄰) neighborhood of xtest (e.g. the network employs the nearest neighbor rule). For example, xN’ ∈｛x1, x2 , …, xN｝ is said to the nearest neighbor of xtest if mini d(xi, xtest) = d(xN’, xtest), i = 1, 2, …, N where d(xi, xtest) is the Euclidean distance between the vectors xi and xtest . The class associated with the minimum distance, that is, vector xN’, is reported as the classification of xtest . A variant(類似方法) is the k-nearest neighbor classifier, which proceeds as follows: (i) Identify(辨認出) the k classified patterns that lie nearest to the xtest for some integer k. (ii) assign xtest to the class (hypothesis) that is most frequently represented(代表) in the k nearest neighbor to xtest (i.e., use a majority vote(多數決) to make the classification and to discriminate(區分出) against the outlier(s)(外來者), as illustrated in the following figure for k = 3). The red marks include two points pertaining to (屬於) class 1 and an outlier from class 0. The point d corresponds to the test vector xtest . With k = 3, the k-nearest neighbor classifier assigns class 1 to xtest even though it lies closest to the outlier. 0 0 0 00 0 0 00 0 0 0 1 1 0 0 0d 0 11 1 11 1 1 1 1 1 11 1 1 1 6) Hebbian learning (i) the original Hebb rule: If two neurons on either side of a synapse (connection) are activated simultaneously(即刻地) (i.e., synchronously(同步地)), then the strength of that synapse is selectively increased. (ii) If two neurons on either side of a synapse are activated asynchronously(非同步地), then that synapse is selectively weakened(弱化) or eliminated(摘除) [4]. [4] Stent, 1973; Changeux and Danchin, 1976 xj(n), yk(n) denote presynaptic and postsynaptic signals respectively(各自地) at time step n, then the adjustment (Hebb’s hypothesis) is expressed by Δwkj(n) = ηyk(n) xj(n) The correlational(關連) nature of a Hebbian synapse is sometimes referred to as the activity product rule as is shown in the top curve of the following figure, with the change Δwkj plotted verse the output signal (postsynaptic activity) yk. Δwkj Hebb’s Illustration of hypothesis Hebb’s hypothesis Slope = and the covariance ηxj hypothesis slope = η(xj – Mx ) Covariance hypothesis (共變異假說) 0 -η(xj Mx)My postsynaptic activity, yk balance point Max. depression point Covariance hypothesis [5]: xj(n), yk (n) are replaced by xj - Mx , yk - My respectively, where Mx /My denote the time-averaged value of xj/yk . Hence, the adjustment is described by Δwkj = η(xj - Mx)( yk - My) The average values x, y constitute thresholds(閥值), which determine the sign(跡象) of synaptic modification. [5] Sejnowski, 1977 7) competitive learning The output neurons of a neural network compete among themselves to become active (fired). Based on Hebbian learning several output neurons may be active simultaneously, while in competitive learning only a single output neuron is active at any one time. x1 x2 Architectural graph of a simple competitive learning network with feedforward connections, and lateral connections(橫向腱 值) among the neurons. x3 x4 layer of source nodes Single layer of output neurons Output signals yk of winning neuron k is set to one; the y of all losing neurons are set to zero. That is, yk = 1 if vk＞vj for all j, j≠k; yk = 0, otherwise where the induced local field(活化濳能) vk represents the combined action of all the forward and feedback inputs to neuron k. Suppose each neuron is allotted(配給) a fixed amount (i.e., positive) of synaptic weight, which is distributed among its input nodes; that is, Σj wkj = 1 for all k (j = 1, …, N) The change Δwkj applied to(作用於) wkj is defined by Δwkj = η(xj - wkj ) if neuron k wins the competition; Δwkj = 0 if k losing x o o o o o o o x o o o x o x o o o o o o o o x o o o o o o o x (a) (b) Geometric interpretation(解讀) of the competitive learning process [6]. The o represents the input vectors, and the x the synaptic weight vectors of three output neurons. (a) representing initial state of the network (b) final state. (It is assumed(假設) that all neurons in the network are constrained(受限) to have the same Euclidean length, i.e. norm, as shown byΣj wkj2 = 1 , for all k (j = 1, …, N). [6] Rumelhart and Zipser, 1985 8) boltzmann learning A recurrent (循環的) structure, and operating in a binary manner since they are either in an “on” state denoted by +1 or in an “off” state (-1). An energy function, E, is shown by E = -½ΣjΣk wkjxkxj (j≠k) where xj : state of neuron j, and j≠k meaning: none of the neurons has self-feedback. By choosing xk at random(隨機地) in the learning process, then flipping to(切換成) -xk at some pseudotemperature(類溫度) T with probability P(xk→ -xk) = [1 + exp(-ΔEk / T )]-1 where ΔEk : energy change resulting from such a flip. If applied repeatedly, the machine will reach thermal equilibrium.(熱平衡) ρkj+ denote the correlation between the states of neuros j and k, with the network in its clamped condition1, whereas ρkj− denotes the correlation in free-running condition2. The change Δwkj is defined by [7] Δwkj = η(ρkj+ − ρkj− ), j≠k where both ρkj+ and ρkj− range in value from -1 to +1. Rk 1: Neurons providing an interface between the network and the environment are called visible. Visible neurons are all clamped(跼限於) onto specific states, whereas hidden neurons always operate freely. Rk 2: All neurons (visible and hidden) are allowed to operate freely in free-running condition. [7] Hinton and Sejnowski, 1986 9) supervised learning (learning with a teacher) vector describing state of the environment environment teacher block diagram of supervised learning (監督式學習) desired response + Learning system actual response _ error signal Σ Teacher has the knowledge of environment, with that knowledge being represented by a set of input-output examples. The environment is, however, unknown to the network of interest. The learning process takes place under the tutelage(監 護) of a teacher. The adjustment is carried out iteratively(疊代地執行) in a step-by-step fashion(方式) with the aim of eventually making the network emulate(盡量模仿) the teacher. 10) Learning without a teacher No labeled(標明配對的) examples of the function are to be learned by the network, and two subdivision are identified: (i) reinforcement learning/ neurodynamic programming (ii) unsupervised learning. (i) reinforcement learning The learning of an input-output mapping is performed through continued interaction with the environment in order to minimize a scalar index of performance. environme nt actio n state (input) vector primary reinforceme nt critic heuristic reinforceme nt(啟發式的 自行發展出 補強) learning system (ii) unsupervised/self-organized learning(自組織 式學習) No external teacher or critic; rather, provision(預備) is made for a task-independent measure of the quality of representation that the network is required to learn, and the free parameters of the network are optimized with respect to(以該手段) that measure. Once the network tuned to(校準到) the statistical regularities of the input data, it develops the ability to form internal representations for encoding features(把特徵 編碼) and thereby to create new classes automatically. vector describing state of the environme nt environ ment block diagram of unsupervised learning (非監督式學習) learning system 11) learning tasks The choice of a particular learning algorithm is Influenced(影響) by the learning task. (i) pattern association(圖訊聯想) takes one of two forms: Autoassociation(自聯想) or heteroassociation (異聯想) . In autoassociation, a network is required to store a set of patterns (vectors) by repeatedly presenting (呈現) them to the network. The network is subsequently presented a partial description or distorted (noisy) version of an original pattern stored in it, and the task is to retrieve (recall) that particular pattern. Heteroassociation differs from autoassociation in that an arbitrary set of input patterns is paired with another arbitrary set of output patterns. Autoassociation involves the use of unsupervised learning, whereas the type of learning involved in heteroassociation is supervised. input-output relation of pattern association Input vector x Pattern associati on output vector y (The key pattern xk acts as a stimulus(激勵源) that not only determines the storage location of memorized pattern yk, but also holds the key (即: 位址) for its retrieval.) (ii) pattern recognition: defined as the process whereby a received pattern/signal is assigned to one of a prescribed number of classes (categories). … Supervised network for classification … Unsupervised networked for feature extraction 1 2 r (a) ․x Feature extraction m-dimensional observation space ․ y․ classification q- dimensional feature space (b) Illustration of the approach to pattern classification (The network may take either forms.) r- dimensional decision space Associative memory(聯想記憶體) • • • • • • • • • • • 12) An associative memory offers the following characteristics(特徵): ․The memory is ditributed. ․Information is stored in memory by setting up(建構) a spatial pattern(空間圖訊) of neural activities. ․Information contained in a stimulus(激勵源) not only determines its storage location but also an address for its retrival(再次被取用) ․There may be interactions(互動) and so, there is distinct(確切的) possibility to make errors during the recall(回憶) process. The memory is said to perform a distributed mapping (映射) that transforms(轉換) an activity pattern (圖訊訊息) in the input space into another activity pattern in the output space, as illustrated in the diagram. 1 1 Xk1 Yk1 2 2 Xk2 Yk2 . . . . . . . . m m Xkm Ykm Input layer synaptic output layer input layer output layer of neurons junctions of neurons of source nodes of neurons (a) Associative M. model component of a nervous system (b) associative M. model using artificial neurons The assumption: each neuron acts as a linear combiner, depicting in the following graph. Xk1 Wi1(k) An activity pattern xk , yk Xk2 Wi2(k) occurring in the input, output Y . ki layer simultaneously, . . respectively is written in . their expanded forms as: Xkm Wim(k) Xk=[xk1 , xk2 , …, xkm]T Yk=[yk1 , yk2 , …, ykm]T Signal-flow graph model of a linear neuron labeled i. The association of vector Xk with memorized vector Yk is described in matrix form as: Yk=W(k) Xk , k=1,2, …,q where W(k) is determined by the input-output pair (Xk , Yk). A detailed description is given by m yki= wij(k) xkj , i=1,2, …, m j 1 Using matrix notation, yki is expressed in the equivalent form yki=[wi1(k), wi2(k), …, wim(k)] xk 1 x k2 . . xkm i=1,2, …, m yk1 w11 (k ) w12 (k )...w1m (k ) xk1 y w (k ) w (k )...w (k ) x 22 2m k 2 k 2 21 . . . . . . . . . ykm wm1 (k ) wm 2 (k )...wmm (k ) xkm The individual presentations of the q pairs of associated patterns Xk→Yk , k=1,2, …, q produce corresponding values of the individual matrix, namely W(1), W(2), …, W(q). We define an m-by-m memory matrix that describes the summation of the weight matrices as follows: q M= W(k) k 1 The matrix M defines the overall connectivity between the input and output layers of the associative memory. In effect, it also represents the total experience gained By the memory and can be reconstructed as a recursion shown by Mk=Mk-1+W(k) , k=1,2, …, q …(A) where the initial value M0 is zero (i.e., the synaptic weights in the memory are all initially zero), and the final value Mq is identically equal to M. When W(k) is added to Mk-1, the increment W(k) loses its distinct identy among the mixture of contributions that form Mk , whereas information about the stimuli may not have lost. Note: As the number q increases, the influence of a new pattern on the memory as a whole is progressively reduced. Suppose the associative memory has learned M, we may postulate , denoting an estimate of M in terms of q these patterns as [1]: …..(B) M = ykxkT k 1 ykxkT represents the outer product of the xk and yk , and is an estimate of the W(k) that maps the output pattern yk onto the input pattern xk . xkj is the output of source node j in the input layer, and yki is the output of neuron i in the output layer. In the context of wij(k) for the kth association, source node j acts as a presynaptic node and neuron i acts as a postsynaptic node. [1] Anderson,1972, 1983; Cooper, 1970]: The ‘local’ learning process in eq.(B) may be viewed as a generalization of Hebb’s postulate of learning. An associative memory so designed is called a correlation matrix memory. Correlation is the basis of learning, association, pattern recognition, and memory recall [2]. Eq.(B) may be reformulated: M y 1 , y 2 ,..., y q x1t t x 2 . t = YX . t x q ……. (C) where X =[x1,x2, …, xq], called the key matrix, and Y=[y1, y2, …, yq], called the memorized matrix. [2] Eggermont, 1990 Eq.(C) may be reconstructed as a recursion and is depicted as follows: M k M k 1 yk xkt k=1,2, …, q ….(D) x kt yk Mk X Z-1I M k 1 t y x Comparing eq.(A) and (D), we see k k represents an estimate of the W(k). recall • The problem posed by an associative memory is the address and recall of patterns stored in memory. Let xj be picked at random and reapplied as stimuli to the memory, yielding the response y = M xj …(E) Substituting eq.(B) in (E), we get m m t t t t y= yk x k xj = ( x k xj)yk =( x j xj)yj + ( x k xj)yk m k 1 k 1 m =yj + ( x x )yk = yj + vj 1 k k j t k j 1 k k j …(F) yj of eq.(F) represents the “desired” response; It may be viewed as the “signal” component of the actual response y. vj is a “noise vector” that is responsible for making errors on recall. Let x1,x2, …, xq be normalized to have unit m 2 x energy; i.e., Ek= kl = x kt xk =1, k=1,2, …, q l 1 1 2 xkt x j 1 2 k ∵ x ( x x ) E ∴ cos(xk,xj)= x x = x kt xj ⇒ cos(xk,xj)=0, k≠j, if xk and xj are orthogonal. ⇒ vj= cos( x , x ) y =0; i.e., in such a case, y=yj t k k k k m k 1 k j k j k j ⇒ The memory associates perfectly if the key vectors form an orthogonal set; that is, if they satisfy the conditions: x kt xj= 1, k j 0, k j Another question: What is the limit on the storage capacity of the associative memory? The answer lies in the rank (say r) of memory matrix M Suppose a rectangular matrix has dimensions l-by-m, we then have r ≤ min(l,m). In the case of a correlation memory, M is an m-by-m matrix. ⇒ r ≤ m ⇒ the number of patterns can never exceed the input space dimensionality. What if the key patterns are neither orthogonal nor highly separated ? ⇒ a correlation matrix memory may get confused and make errors. • First, define the community of a set of patterns as the lower bound on the inner products x kt xj. t • Second, be further assumed that x k xj ≥λ, k≠j Ifλis large enough, the memory may fail to distinguish the response y from that of any other key pattern. Say, xj=x0+ v , where v is a stochastic vector. It is likely the memory will recognize x0 and associate with it a vector y0 rather than any of the actual pattern pairs used to trained it. ⇒ x0 and y0 denote a pair of patterns never seen before. 兩題程式(MATLAB)練習(解析在隨後的slides): (1) 以 unsupervised方式, 將(0, 0), (0,1), (1,0), (1,1) 分成兩類, 即AND邏輯運算; Note: 一開始給定weight值, bias值; 並注意 完成分類後的weight值, bias值 (2) 將題(1)之輸入分成兩類, 但作OR邏輯運算且為supervised設計; Note:一開始未設定 weight值, bias值 (即, 皆零值; 注意程式中的測試 !); 查看完成分類後的weight值, bias 值; 請自行更改程式中 epoch值, 注意性能曲線變化及最後的weight值, bias值 • • • • • • • • • • • • • • • • • • • • • • • • • • close all;clear;clc; % 題(1): AND運算, unsupervised設計 % net=newp([-2 2;-2 2],1); net.IW{1,1}=[2 2]; net.b{1}=[-3]; % input vectors % p1=[0;0]; a1=sim(net,p1) p2=[0;1]; a2=sim(net,p2) % a sequence of two input vectors % p3={[0;0] [0;1]}; a3=sim(net,p3) % a sequence of four input vectors % p4={[0;0] [0;1] [1;0] [1;1]}; % simulated output and final wt./bias % a4=sim(net,p4) net.IW{1,1} net.b{1} % boundary / weight % xx=-.5:.1:2; yy=-net.b{1}/2-xx; x2=0:.1:1.2; y2=x2; plot(xx,yy,0,0,'o',0,1,'o',1,0,'o',1,1,'x',x2,y2); legend('boundary') gtext('W') • • • • • • • • • • • • • • • • • • • • • • • • • • • close all;clear;clc; % 題(2): OR運算, supervised 設計; 並請更改 epoch值 % net=newp([-2 2;-2 2],1); testp=[1 1 1 1]; net.IW{1,1} net.b{1} while testp ~= [0 0 0 0] net.trainparam.epochs=4; p=[[0;0] [0;1] [1;0] [1;1]]; t=[0 1 1 1]; net=train(net,p,t); % the new weights and bias are % net.IW{1,1} net.b{1} % simulate the trained network for each of the inputs % a = sim(net,p) % The errors for the various inputs are % error = [a(1)-t(1) a(2)-t(2) a(3)-t(3) a(4)-t(4)] testp=error; end xx=-1:.1:2; yy=-net.b{1}-xx; x2=xx-.2;y2=yy-.2; x3=0:.01:.7; y3=x3; plot(x2,y2,0,0,'o',0,1,'x',1,0,'x',1,1,'x',x3,y3,'->') legend('boundary') gtext('weight')