Blind Source Separation by Independent Components Analysis

Transcript Blind Source Separation by Independent Components Analysis

Blind Source Separation by
Independent Components
Analysis
Professor Dr. Barrie W. Jervis
School of Engineering
Sheffield Hallam University
England
[email protected]
The Problem
• Temporally independent unknown source
signals are linearly mixed in an unknown
system to produce a set of measured output
signals.
• It is required to determine the source
signals.
• Methods of solving this problem are known
as Blind Source Separation (BSS)
techniques.
• In this presentation the method of
Independent Components Analysis (ICA)
will be described.
• The arrangement is illustrated in the next
slide.
Arrangement for BSS by ICA
s1
s2
A
x1
x2
W
u1
u2
g(.)
y1=g1(u1)
y2=g2(u2)
:
:
sn
:
xn
:
un
yn=gn(un)
Neural Network Interpretation
•
•
•
•
•
The si are the independent source signals,
A is the linear mixing matrix,
The xi are the measured signals,
W  A-1 is the estimated unmixing matrix,
The ui are the estimated source signals or
activations, i.e. ui  si,
• The gi(ui) are monotonic nonlinear
functions (sigmoids, hyperbolic tangents),
• The yi are the network outputs.
Principles of Neural Network
Approach
• Use Information Theory to derive an
algorithm which minimises the mutual
information between the outputs y=g(u).
• This minimises the mutual information
between the source signal estimates, u,
since g(u) introduces no dependencies.
• The different u are then temporally
independent and are the estimated source
signals.
Cautions I
• The magnitudes and signs of the estimated
source signals are unreliable since
– the magnitudes are not scaled
– the signs are undefined
because magnitude and sign information is
shared between the source signal vector
and the unmixing matrix, W.
• The order of the outputs is permutated
compared wiith the inputs
Cautions II
• Similar overlapping source signals may not
be properly extracted.
• If the number of output channels  number
of source signals, those source signals of
lowest variance will not be extracted. This
is a problem when these signals are
important.
Information Theory I
• If X is a vector of variables (messages) xi
which occur with probabilities P(xi), then
the average information content of a stream
of N messages is
N
H ( X )   P ( x ) log2 P ( x ) bits
i 1
and is known as the entropy of the random variable,
X.
Information Theory II
• Note that the entropy is expressible in terms
of probability.
• Given the probability density distribution
(pdf) of X we can find the associated
entropy.
• This link between entropy and pdf is of the
greatest importance in ICA theory.
Information Theory III
• The joint entropy between two random
variables X and Y is given by
H ( X ,Y )    P ( x, y ) log2 P ( x, y )
x ,y
• For independent variables
H ( X ,Y )  H ( X )  H (Y ) iff P ( x, y )  P ( x )P ( y )
Information Theory IV
• The conditional entropy of Y given X
measures the average uncertainty remaining
about y when x is known, and is
H (Y | X )    P ( x, y ) log2P ( y | x )
x ,y
• The mutual information between Y and X is
I (Y , X )  H (Y )  H ( X )  H (Y , X )  H (Y )  H (Y | X )
• In ICA, X represents the measured signals,
which are applied to the nonlinear function g(u)
to obtain the outputs Y.
Bell and Sejnowski’s ICA Theory (1995)
•Aim to maximise the amount of mutual
information between the inputs X and the outputs
Y of the neural network.
I (Y , X )  H (Y )  H (Y | X )
(Uncertainty about Y when X is unknown)
• Y is a function of W and g(u).
• Here we seek to determine the W which
produces the ui  si, assuming the correct
Differentiating:
I (Y , X ) 



H (Y ) 
H (Y | X ) 
H (Y )
w
w
w
w
(=0, since it did not come through W from X.)
So, maximising this mutual information is
equivalent to maximising the joint output entropy,
H ( y1,...y N )  H ( y1 )  ....  H ( y N )  I ( y1,...y N )
which is seen to be equivalent to minimising the
mutual information between the outputs and hence
the ui, as desired.
The Functions g(u)
• The outputs yi are amplitude bounded random
variables, and so the marginal entropies H(yi) are
maximum when the yi are uniformly distributed
- a known statistical result.
• With the H(yi) maximised, I(Y,X) = 0, and the yi
uniformly distributed, the nonlinearity gi(ui) has
the form of the cumulative distribution function
of the probability density function of the si,
- a proven result.
Pause and review g(u) and W
• W has to be chosen to maximise the joint
output entropy H(Y,X), which minimises
the mutual information between the
estimated source signals, ui.
• The g(u) should be the cumulative
distribution functions of the source signals,
si.
• Determining the g(u) is a major problem.
One input and one output
• For a monotonic nonlinear function, g(x),
fx ( x )
fy ( y ) 
y
x

• Also H ( y )  E ln fy ( y )    fy ( y ) ln fy ( y )dy

• Substituting:
 y 
H ( y )  E ln   E ln fx ( x )
 x 
(we only need to maximise this term) (independent
of W)
• A stochastic gradient ascent learning rule is
adopted to maximise H(y) by assuming
H
  y
w 

 ln
w w  x
1
  y    y 
 
 
  x  w  x 
• Further progress requires knowledge of g(u).
Assume for now, after Bell and Sejnowski, that
g(u) is sigmoidal, i.e. y  1
1 e
• Also assume u  wx  w 0
u
Learning Rule: 1 input, 1 output
Hence, we find:
y
 wy 1  y 
x
  y 
   y 1  y 1  wx 1  2y 
w  x 
from which :
1
w   x 1  2y , and
w
w 0  1  2y
Learning Rule: N inputs, N outputs
• Need
fx x 
fy ( y ) 
,
J
where J is the Jacobian
y1 
 y1
 x  x 
1
N

J  det   
 
y n
y N



xN 
 x1
• Assuming g(u) is sigmoidal again, we obtain:
W  W
  1  2yx
T 1
T
, and
w 0  1  2y,
where 1 is the unit vector
• The network is trained until the changes in the
weights become acceptably small at each
iteration.
• Thus the unmixing matrix W is found.
The Natural Gradient
• The computation of the inverse matrix W 
is time-consuming, and may be avoided by
rescaling the entropy gradient by multiplying
it by WWT (  1)
T 1
• Thus,
for a sigmoidal g(u) we obtain
H ( y)
W 
 1  (1  2y)uT W
W
• This is the natural gradient, introduced by
Amari (1998), and now widely adopted.
The nonlinearity, g(u)
• We have already learnt that the g(u) should
be the cumulative probability densities of
the individual source distributions.
• So far the g(u) have been assumed to be
sigmoidal, so what are the pdfs of the si?
• The corresponding pdfs of the si are superGaussian.
Super- and sub-Gaussian pdfs
P si 
Gaussian
Super-Gaussian
Sub-Gaussian
si
* Note: there are no mathematical definitions of
super- and sub-Gaussians
Super- and sub-Gaussians
 Super-Gaussians:
•kurtosis (fourth order central moment,
measures the flatness of the pdf) > 0.
•infrequent signals of short duration, e.g.
evoked brain signals.
 Sub-Gaussians
•kurtosis < 0
•signals mainly “on”, e.g. 50/60 Hz
electrical mains supply, but also eye blinks.
Kurtosis
• Kurtosis = 4th order central moment =


mx ( 4 )  E  x  mx  
4
E ui4 
E u 
2
i
2
3
and is seen to be calculated from the current
estimates of the source signals.
• To separate the independent sources, information
about their pdfs such as skewness (3rd. moment)
and flatness (kurtosis) is required.
• First and 2nd. moments (mean and variance) are
insufficient.
A more generalised learning rule
• Girolami (1997) showed that tanh(ui) and -tanh(ui)
could be used for super- and sub-Gaussians
respectively.
• Cardoso and Laheld (1996) developed a stability
analysis to determine whether the source signals
were to be considered super- or sub-Gaussian.
• Lee, Girolami, and Sejnowski (1998) applied these
findings to develop their extended infomax
algorithm for super- and sub-Gaussians using a
kurtosis-based switching rule.
Extended Infomax Learning Rule
• With super-Gaussians modelled as
p(u )  N(0,1) sec h (u )
2
and sub-Gaussians as a Pearson mixture model
1
p(u )  N , 2   N  , 2 
2
the new extended learning rule is
W  1  K tanhuuT  uuT W,
ki  1, super  Gaussian,
ki  1, sub  Gaussian.
Switching Decision
W  1  K tanhuu  uu W,
T
T
ki  1, super  Gaussian,
ki  1, sub  Gaussian.
and the ki are the elements of the N-dimensional
diagonal matrix, K, and
ki  sign E sec h 2 (ui )E ui2   E tanh( ui )ui 
• Modifications of the formula for ki exist, but in
our experience the extended algorithm has been
unsatisfactory.
Reasons for unsatisfactory extended algorithm
1) Initial assumptions about super- and sub-Gaussian
distributions may be too inaccurate.
2) The switching criterion may be inadequate.
Alternatives
• Postulate vague distributions for the source signals
which are then developed iteratively during training.
• Use an alternative approach, e.g, statistically
based, JADE (Cardoso).
Summary so far
• We have seen how W may be obtained by training
the network, and the extended algorithm for
switching between super- and sub-Gaussians has
been described.
• Alternative approaches have been mentioned.
• Next we consider how to obtain the source signals
knowing W and the measured signals, x.
Source signal determination
• The system is:
si
unknown
Mixing
matrix A
xi
Unmixing
matrix W
measured
g(u)
uisi
yi
estimated
•Hence U=W.x and x=A.S where AW-1, and US.
•The rows of U are the estimated source signals,
known as activations (as functions of time).
•The rows of x are the time-varying measured signals.
Source Signals
U

W
M xN M xM
Channel
number
.
x
M xN
 u11 u12  u1N   w11 w12  w1M   x11 x12  x1N 

 


            
 u  u  w
 x  x 

w
MN   N1
MM   M 1
MN 
 M1
Time, or sample number
Expressions for the Activations
u11  w11x11  w12 x21    w1M xM 1
u12  w11x12  w12 x22    w1M xM 2
• We see that consecutive values of u are obtained by
filtering consecutive columns of x by the same row of
W.
u21  w 21x11  w 22 x21  w 23 x31    w 2N xN1
• The ith row of u is the ith row of w by the columns
of x.
Procedure
• Record N time points from each of M sensors,
where N  5M.
• Pre-process the data, e.g. filtering, trend removal.
• Sphere the data using Principal Components
Analysis (PCA). This is not essential but speeds
up the computation by first removing first and
second order moments.
• Compute the ui  si. Include desphering.
• Analyse the results.
Optional Procedures I
• The contribution of each activation at a
sensor may be found by “back-projecting” it
to the sensor.
x  W 1s
1
1
 w11
w12
 w1M1   0
0
0 0 
 1


1
1
x   w 21 w 22  w 2M   s21 s22 s23  s2N 
 
 0



0
0

0



1
1
1
x21  w 21
.0  w 22
.s21    w 2M1 .0  w 22
.s21
1
x22  w 22
.s22 ,
1
x2M  w 22
s2 N
Optional Procedures II
• A measured signal which is contaminated by
artefacts or noise may be extracted by “backprojecting” all the signal activations to the
measurement electrode, setting other activations to
zero. (An artefact and noise removal method).
1
1
 w11
w12
 w1M1   s11 s12
 1

1
1
x   w 21 w 22  w 2M   s21 s22
 
 0


0


s13  s1N 

s23  s2N 
0  0 
1
1
1
1
x21  w 21
.s11  w 22
.s21    w 2M1 .0  w 21
.s11  w 22
.s21
1
1
x22  w 21
.s12  w 22
.s22 ,
1
1
x2N  w 21
.s1N  w 22
.s2N
Current Developments
• Overcomplete representations - more signal
sources than sensors.
• Nonlinear mixing.
• Nonstationary sources.
• General formulation of g(u).
Conclusions
• It has been shown how to extract temporally
independent unknown source signals from
their linear mixtures at the outputs of an
unknown system using Independent
Components Analysis.
• Some of the limitations of the method have
been mentioned.
• Current developments have been highlighted.

Blind Source Separation by Independent Components Analysis

Transcript Blind Source Separation by Independent Components Analysis

Directory