Blind Source Separation by Independent Components Analysis
Transcript Blind Source Separation by Independent Components Analysis
Blind Source Separation by
Professor Dr. Barrie W. Jervis
School of Engineering
Sheffield Hallam University
• Temporally independent unknown source
signals are linearly mixed in an unknown
system to produce a set of measured output
• It is required to determine the source
• Methods of solving this problem are known
as Blind Source Separation (BSS)
• In this presentation the method of
Independent Components Analysis (ICA)
will be described.
• The arrangement is illustrated in the next
Arrangement for BSS by ICA
Neural Network Interpretation
The si are the independent source signals,
A is the linear mixing matrix,
The xi are the measured signals,
W A-1 is the estimated unmixing matrix,
The ui are the estimated source signals or
activations, i.e. ui si,
• The gi(ui) are monotonic nonlinear
functions (sigmoids, hyperbolic tangents),
• The yi are the network outputs.
Principles of Neural Network
• Use Information Theory to derive an
algorithm which minimises the mutual
information between the outputs y=g(u).
• This minimises the mutual information
between the source signal estimates, u,
since g(u) introduces no dependencies.
• The different u are then temporally
independent and are the estimated source
• The magnitudes and signs of the estimated
source signals are unreliable since
– the magnitudes are not scaled
– the signs are undefined
because magnitude and sign information is
shared between the source signal vector
and the unmixing matrix, W.
• The order of the outputs is permutated
compared wiith the inputs
• Similar overlapping source signals may not
be properly extracted.
• If the number of output channels number
of source signals, those source signals of
lowest variance will not be extracted. This
is a problem when these signals are
Information Theory I
• If X is a vector of variables (messages) xi
which occur with probabilities P(xi), then
the average information content of a stream
of N messages is
H ( X ) P ( x ) log2 P ( x ) bits
and is known as the entropy of the random variable,
Information Theory II
• Note that the entropy is expressible in terms
• Given the probability density distribution
(pdf) of X we can find the associated
• This link between entropy and pdf is of the
greatest importance in ICA theory.
Information Theory III
• The joint entropy between two random
variables X and Y is given by
H ( X ,Y ) P ( x, y ) log2 P ( x, y )
• For independent variables
H ( X ,Y ) H ( X ) H (Y ) iff P ( x, y ) P ( x )P ( y )
Information Theory IV
• The conditional entropy of Y given X
measures the average uncertainty remaining
about y when x is known, and is
H (Y | X ) P ( x, y ) log2P ( y | x )
• The mutual information between Y and X is
I (Y , X ) H (Y ) H ( X ) H (Y , X ) H (Y ) H (Y | X )
• In ICA, X represents the measured signals,
which are applied to the nonlinear function g(u)
to obtain the outputs Y.
Bell and Sejnowski’s ICA Theory (1995)
•Aim to maximise the amount of mutual
information between the inputs X and the outputs
Y of the neural network.
I (Y , X ) H (Y ) H (Y | X )
(Uncertainty about Y when X is unknown)
• Y is a function of W and g(u).
• Here we seek to determine the W which
produces the ui si, assuming the correct
I (Y , X )
H (Y )
H (Y | X )
H (Y )
(=0, since it did not come through W from X.)
So, maximising this mutual information is
equivalent to maximising the joint output entropy,
H ( y1,...y N ) H ( y1 ) .... H ( y N ) I ( y1,...y N )
which is seen to be equivalent to minimising the
mutual information between the outputs and hence
the ui, as desired.
The Functions g(u)
• The outputs yi are amplitude bounded random
variables, and so the marginal entropies H(yi) are
maximum when the yi are uniformly distributed
- a known statistical result.
• With the H(yi) maximised, I(Y,X) = 0, and the yi
uniformly distributed, the nonlinearity gi(ui) has
the form of the cumulative distribution function
of the probability density function of the si,
- a proven result.
Pause and review g(u) and W
• W has to be chosen to maximise the joint
output entropy H(Y,X), which minimises
the mutual information between the
estimated source signals, ui.
• The g(u) should be the cumulative
distribution functions of the source signals,
• Determining the g(u) is a major problem.
One input and one output
• For a monotonic nonlinear function, g(x),
fx ( x )
fy ( y )
• Also H ( y ) E ln fy ( y ) fy ( y ) ln fy ( y )dy
H ( y ) E ln E ln fx ( x )
(we only need to maximise this term) (independent
• A stochastic gradient ascent learning rule is
adopted to maximise H(y) by assuming
w w x
x w x
• Further progress requires knowledge of g(u).
Assume for now, after Bell and Sejnowski, that
g(u) is sigmoidal, i.e. y 1
• Also assume u wx w 0
Learning Rule: 1 input, 1 output
Hence, we find:
wy 1 y
y 1 y 1 wx 1 2y
from which :
w x 1 2y , and
w 0 1 2y
Learning Rule: N inputs, N outputs
fy ( y )
where J is the Jacobian
• Assuming g(u) is sigmoidal again, we obtain:
w 0 1 2y,
where 1 is the unit vector
• The network is trained until the changes in the
weights become acceptably small at each
• Thus the unmixing matrix W is found.
The Natural Gradient
• The computation of the inverse matrix W
is time-consuming, and may be avoided by
rescaling the entropy gradient by multiplying
it by WWT ( 1)
for a sigmoidal g(u) we obtain
H ( y)
1 (1 2y)uT W
• This is the natural gradient, introduced by
Amari (1998), and now widely adopted.
The nonlinearity, g(u)
• We have already learnt that the g(u) should
be the cumulative probability densities of
the individual source distributions.
• So far the g(u) have been assumed to be
sigmoidal, so what are the pdfs of the si?
• The corresponding pdfs of the si are superGaussian.
Super- and sub-Gaussian pdfs
* Note: there are no mathematical definitions of
super- and sub-Gaussians
Super- and sub-Gaussians
•kurtosis (fourth order central moment,
measures the flatness of the pdf) > 0.
•infrequent signals of short duration, e.g.
evoked brain signals.
•kurtosis < 0
•signals mainly “on”, e.g. 50/60 Hz
electrical mains supply, but also eye blinks.
• Kurtosis = 4th order central moment =
mx ( 4 ) E x mx
and is seen to be calculated from the current
estimates of the source signals.
• To separate the independent sources, information
about their pdfs such as skewness (3rd. moment)
and flatness (kurtosis) is required.
• First and 2nd. moments (mean and variance) are
A more generalised learning rule
• Girolami (1997) showed that tanh(ui) and -tanh(ui)
could be used for super- and sub-Gaussians
• Cardoso and Laheld (1996) developed a stability
analysis to determine whether the source signals
were to be considered super- or sub-Gaussian.
• Lee, Girolami, and Sejnowski (1998) applied these
findings to develop their extended infomax
algorithm for super- and sub-Gaussians using a
kurtosis-based switching rule.
Extended Infomax Learning Rule
• With super-Gaussians modelled as
p(u ) N(0,1) sec h (u )
and sub-Gaussians as a Pearson mixture model
p(u ) N , 2 N , 2
the new extended learning rule is
W 1 K tanhuuT uuT W,
ki 1, super Gaussian,
ki 1, sub Gaussian.
W 1 K tanhuu uu W,
ki 1, super Gaussian,
ki 1, sub Gaussian.
and the ki are the elements of the N-dimensional
diagonal matrix, K, and
ki sign E sec h 2 (ui )E ui2 E tanh( ui )ui
• Modifications of the formula for ki exist, but in
our experience the extended algorithm has been
Reasons for unsatisfactory extended algorithm
1) Initial assumptions about super- and sub-Gaussian
distributions may be too inaccurate.
2) The switching criterion may be inadequate.
• Postulate vague distributions for the source signals
which are then developed iteratively during training.
• Use an alternative approach, e.g, statistically
based, JADE (Cardoso).
Summary so far
• We have seen how W may be obtained by training
the network, and the extended algorithm for
switching between super- and sub-Gaussians has
• Alternative approaches have been mentioned.
• Next we consider how to obtain the source signals
knowing W and the measured signals, x.
Source signal determination
• The system is:
•Hence U=W.x and x=A.S where AW-1, and US.
•The rows of U are the estimated source signals,
known as activations (as functions of time).
•The rows of x are the time-varying measured signals.
M xN M xM
u11 u12 u1N w11 w12 w1M x11 x12 x1N
u u w
MM M 1
Time, or sample number
Expressions for the Activations
u11 w11x11 w12 x21 w1M xM 1
u12 w11x12 w12 x22 w1M xM 2
• We see that consecutive values of u are obtained by
filtering consecutive columns of x by the same row of
u21 w 21x11 w 22 x21 w 23 x31 w 2N xN1
• The ith row of u is the ith row of w by the columns
• Record N time points from each of M sensors,
where N 5M.
• Pre-process the data, e.g. filtering, trend removal.
• Sphere the data using Principal Components
Analysis (PCA). This is not essential but speeds
up the computation by first removing first and
second order moments.
• Compute the ui si. Include desphering.
• Analyse the results.
Optional Procedures I
• The contribution of each activation at a
sensor may be found by “back-projecting” it
to the sensor.
x W 1s
x w 21 w 22 w 2M s21 s22 s23 s2N
x21 w 21
.0 w 22
.s21 w 2M1 .0 w 22
x22 w 22
x2M w 22
Optional Procedures II
• A measured signal which is contaminated by
artefacts or noise may be extracted by “backprojecting” all the signal activations to the
measurement electrode, setting other activations to
zero. (An artefact and noise removal method).
w1M1 s11 s12
x w 21 w 22 w 2M s21 s22
x21 w 21
.s11 w 22
.s21 w 2M1 .0 w 21
.s11 w 22
x22 w 21
.s12 w 22
x2N w 21
.s1N w 22
• Overcomplete representations - more signal
sources than sensors.
• Nonlinear mixing.
• Nonstationary sources.
• General formulation of g(u).
• It has been shown how to extract temporally
independent unknown source signals from
their linear mixtures at the outputs of an
unknown system using Independent
• Some of the limitations of the method have
• Current developments have been highlighted.