#### Transcript Blind Source Separation by Independent Components Analysis

Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England [email protected] The Problem • Temporally independent unknown source signals are linearly mixed in an unknown system to produce a set of measured output signals. • It is required to determine the source signals. • Methods of solving this problem are known as Blind Source Separation (BSS) techniques. • In this presentation the method of Independent Components Analysis (ICA) will be described. • The arrangement is illustrated in the next slide. Arrangement for BSS by ICA s1 s2 A x1 x2 W u1 u2 g(.) y1=g1(u1) y2=g2(u2) : : sn : xn : un yn=gn(un) Neural Network Interpretation • • • • • The si are the independent source signals, A is the linear mixing matrix, The xi are the measured signals, W A-1 is the estimated unmixing matrix, The ui are the estimated source signals or activations, i.e. ui si, • The gi(ui) are monotonic nonlinear functions (sigmoids, hyperbolic tangents), • The yi are the network outputs. Principles of Neural Network Approach • Use Information Theory to derive an algorithm which minimises the mutual information between the outputs y=g(u). • This minimises the mutual information between the source signal estimates, u, since g(u) introduces no dependencies. • The different u are then temporally independent and are the estimated source signals. Cautions I • The magnitudes and signs of the estimated source signals are unreliable since – the magnitudes are not scaled – the signs are undefined because magnitude and sign information is shared between the source signal vector and the unmixing matrix, W. • The order of the outputs is permutated compared wiith the inputs Cautions II • Similar overlapping source signals may not be properly extracted. • If the number of output channels number of source signals, those source signals of lowest variance will not be extracted. This is a problem when these signals are important. Information Theory I • If X is a vector of variables (messages) xi which occur with probabilities P(xi), then the average information content of a stream of N messages is N H ( X ) P ( x ) log2 P ( x ) bits i 1 and is known as the entropy of the random variable, X. Information Theory II • Note that the entropy is expressible in terms of probability. • Given the probability density distribution (pdf) of X we can find the associated entropy. • This link between entropy and pdf is of the greatest importance in ICA theory. Information Theory III • The joint entropy between two random variables X and Y is given by H ( X ,Y ) P ( x, y ) log2 P ( x, y ) x ,y • For independent variables H ( X ,Y ) H ( X ) H (Y ) iff P ( x, y ) P ( x )P ( y ) Information Theory IV • The conditional entropy of Y given X measures the average uncertainty remaining about y when x is known, and is H (Y | X ) P ( x, y ) log2P ( y | x ) x ,y • The mutual information between Y and X is I (Y , X ) H (Y ) H ( X ) H (Y , X ) H (Y ) H (Y | X ) • In ICA, X represents the measured signals, which are applied to the nonlinear function g(u) to obtain the outputs Y. Bell and Sejnowski’s ICA Theory (1995) •Aim to maximise the amount of mutual information between the inputs X and the outputs Y of the neural network. I (Y , X ) H (Y ) H (Y | X ) (Uncertainty about Y when X is unknown) • Y is a function of W and g(u). • Here we seek to determine the W which produces the ui si, assuming the correct Differentiating: I (Y , X ) H (Y ) H (Y | X ) H (Y ) w w w w (=0, since it did not come through W from X.) So, maximising this mutual information is equivalent to maximising the joint output entropy, H ( y1,...y N ) H ( y1 ) .... H ( y N ) I ( y1,...y N ) which is seen to be equivalent to minimising the mutual information between the outputs and hence the ui, as desired. The Functions g(u) • The outputs yi are amplitude bounded random variables, and so the marginal entropies H(yi) are maximum when the yi are uniformly distributed - a known statistical result. • With the H(yi) maximised, I(Y,X) = 0, and the yi uniformly distributed, the nonlinearity gi(ui) has the form of the cumulative distribution function of the probability density function of the si, - a proven result. Pause and review g(u) and W • W has to be chosen to maximise the joint output entropy H(Y,X), which minimises the mutual information between the estimated source signals, ui. • The g(u) should be the cumulative distribution functions of the source signals, si. • Determining the g(u) is a major problem. One input and one output • For a monotonic nonlinear function, g(x), fx ( x ) fy ( y ) y x • Also H ( y ) E ln fy ( y ) fy ( y ) ln fy ( y )dy • Substituting: y H ( y ) E ln E ln fx ( x ) x (we only need to maximise this term) (independent of W) • A stochastic gradient ascent learning rule is adopted to maximise H(y) by assuming H y w ln w w x 1 y y x w x • Further progress requires knowledge of g(u). Assume for now, after Bell and Sejnowski, that g(u) is sigmoidal, i.e. y 1 1 e • Also assume u wx w 0 u Learning Rule: 1 input, 1 output Hence, we find: y wy 1 y x y y 1 y 1 wx 1 2y w x from which : 1 w x 1 2y , and w w 0 1 2y Learning Rule: N inputs, N outputs • Need fx x fy ( y ) , J where J is the Jacobian y1 y1 x x 1 N J det y n y N xN x1 • Assuming g(u) is sigmoidal again, we obtain: W W 1 2yx T 1 T , and w 0 1 2y, where 1 is the unit vector • The network is trained until the changes in the weights become acceptably small at each iteration. • Thus the unmixing matrix W is found. The Natural Gradient • The computation of the inverse matrix W is time-consuming, and may be avoided by rescaling the entropy gradient by multiplying it by WWT ( 1) T 1 • Thus, for a sigmoidal g(u) we obtain H ( y) W 1 (1 2y)uT W W • This is the natural gradient, introduced by Amari (1998), and now widely adopted. The nonlinearity, g(u) • We have already learnt that the g(u) should be the cumulative probability densities of the individual source distributions. • So far the g(u) have been assumed to be sigmoidal, so what are the pdfs of the si? • The corresponding pdfs of the si are superGaussian. Super- and sub-Gaussian pdfs P si Gaussian Super-Gaussian Sub-Gaussian si * Note: there are no mathematical definitions of super- and sub-Gaussians Super- and sub-Gaussians Super-Gaussians: •kurtosis (fourth order central moment, measures the flatness of the pdf) > 0. •infrequent signals of short duration, e.g. evoked brain signals. Sub-Gaussians •kurtosis < 0 •signals mainly “on”, e.g. 50/60 Hz electrical mains supply, but also eye blinks. Kurtosis • Kurtosis = 4th order central moment = mx ( 4 ) E x mx 4 E ui4 E u 2 i 2 3 and is seen to be calculated from the current estimates of the source signals. • To separate the independent sources, information about their pdfs such as skewness (3rd. moment) and flatness (kurtosis) is required. • First and 2nd. moments (mean and variance) are insufficient. A more generalised learning rule • Girolami (1997) showed that tanh(ui) and -tanh(ui) could be used for super- and sub-Gaussians respectively. • Cardoso and Laheld (1996) developed a stability analysis to determine whether the source signals were to be considered super- or sub-Gaussian. • Lee, Girolami, and Sejnowski (1998) applied these findings to develop their extended infomax algorithm for super- and sub-Gaussians using a kurtosis-based switching rule. Extended Infomax Learning Rule • With super-Gaussians modelled as p(u ) N(0,1) sec h (u ) 2 and sub-Gaussians as a Pearson mixture model 1 p(u ) N , 2 N , 2 2 the new extended learning rule is W 1 K tanhuuT uuT W, ki 1, super Gaussian, ki 1, sub Gaussian. Switching Decision W 1 K tanhuu uu W, T T ki 1, super Gaussian, ki 1, sub Gaussian. and the ki are the elements of the N-dimensional diagonal matrix, K, and ki sign E sec h 2 (ui )E ui2 E tanh( ui )ui • Modifications of the formula for ki exist, but in our experience the extended algorithm has been unsatisfactory. Reasons for unsatisfactory extended algorithm 1) Initial assumptions about super- and sub-Gaussian distributions may be too inaccurate. 2) The switching criterion may be inadequate. Alternatives • Postulate vague distributions for the source signals which are then developed iteratively during training. • Use an alternative approach, e.g, statistically based, JADE (Cardoso). Summary so far • We have seen how W may be obtained by training the network, and the extended algorithm for switching between super- and sub-Gaussians has been described. • Alternative approaches have been mentioned. • Next we consider how to obtain the source signals knowing W and the measured signals, x. Source signal determination • The system is: si unknown Mixing matrix A xi Unmixing matrix W measured g(u) uisi yi estimated •Hence U=W.x and x=A.S where AW-1, and US. •The rows of U are the estimated source signals, known as activations (as functions of time). •The rows of x are the time-varying measured signals. Source Signals U W M xN M xM Channel number . x M xN u11 u12 u1N w11 w12 w1M x11 x12 x1N u u w x x w MN N1 MM M 1 MN M1 Time, or sample number Expressions for the Activations u11 w11x11 w12 x21 w1M xM 1 u12 w11x12 w12 x22 w1M xM 2 • We see that consecutive values of u are obtained by filtering consecutive columns of x by the same row of W. u21 w 21x11 w 22 x21 w 23 x31 w 2N xN1 • The ith row of u is the ith row of w by the columns of x. Procedure • Record N time points from each of M sensors, where N 5M. • Pre-process the data, e.g. filtering, trend removal. • Sphere the data using Principal Components Analysis (PCA). This is not essential but speeds up the computation by first removing first and second order moments. • Compute the ui si. Include desphering. • Analyse the results. Optional Procedures I • The contribution of each activation at a sensor may be found by “back-projecting” it to the sensor. x W 1s 1 1 w11 w12 w1M1 0 0 0 0 1 1 1 x w 21 w 22 w 2M s21 s22 s23 s2N 0 0 0 0 1 1 1 x21 w 21 .0 w 22 .s21 w 2M1 .0 w 22 .s21 1 x22 w 22 .s22 , 1 x2M w 22 s2 N Optional Procedures II • A measured signal which is contaminated by artefacts or noise may be extracted by “backprojecting” all the signal activations to the measurement electrode, setting other activations to zero. (An artefact and noise removal method). 1 1 w11 w12 w1M1 s11 s12 1 1 1 x w 21 w 22 w 2M s21 s22 0 0 s13 s1N s23 s2N 0 0 1 1 1 1 x21 w 21 .s11 w 22 .s21 w 2M1 .0 w 21 .s11 w 22 .s21 1 1 x22 w 21 .s12 w 22 .s22 , 1 1 x2N w 21 .s1N w 22 .s2N Current Developments • Overcomplete representations - more signal sources than sensors. • Nonlinear mixing. • Nonstationary sources. • General formulation of g(u). Conclusions • It has been shown how to extract temporally independent unknown source signals from their linear mixtures at the outputs of an unknown system using Independent Components Analysis. • Some of the limitations of the method have been mentioned. • Current developments have been highlighted.