Transcript x 1
Connection between Multilayer
Perceptrons and Regression Using
Independent Component Analysis
Aapo Hyvärinen and Ella Bingham
Preliminary version appeared in Proc.
ICANN'99
Summarized by Seong-woo Chung
2001.9.14
Introduction
Express observed random variables x1, x2, …, xq as linear
combinations of unknown component variables, denoted
by s1, s2, …, sn ( n>=q for nonsingular joint density)
x As
The variables in x are divided into two parts, observed and
missing
So k first variables form the vector of the observed variables xo=(x1,
…, xk)T , and the remaining variables forum the vector of the
missing variables xm=(xk+1, …, xq)T
xo
Ao
( ) ( )s
xm
Am
(C) 2001, SNU CSE Biointelligence Lab
Introduction(Continued)
The problem is to predict xm for a given observation of xo
The regression x^ m is conventionally defined as the conditional
expectation
^
x m E{x m | x o }
Model the joint density of x by ICA, and then, for a given
sample of incomplete data, predict the missing values in xm
using the conditional expectation, which is well defined
once the ICA model has been estimated
E{x m | x o } A m
sp(s)ds
A os x o
(C) 2001, SNU CSE Biointelligence Lab
Regression by ICA and by an
MLP: The connection
Denote the probability densities of the si by pi , and gi(u) =
p´i(u) / pi(u) + cu
T
E{xm | xo } A m g ( A o x o )
The regression function for data modeled by ICA, is given
by the output of an MLP with one hidden layer
The weight vectors of the MLP are simple functions of the
mixing matrix, and the nonlinear activation functions of
the MLP are functions of the probability densities of the si
The vector AoTxo can be interpreted as an initial linear estimate of s
The nonlinear aspect of g() consists largely of thresholding the
linear estimates of s, to obtain s= g(AoTxo)
The final linear layer is basically a linear reconstruction of the
form xm = Amŝ
(C) 2001, SNU CSE Biointelligence Lab
Simulation
Simulation data is 100-dimensional and there are 101000 data
samples
The independent components, generated according to some
probability density are mixed using a randomly generated n×n
mixing matrix
The mixtures x are divided into observed (xo) and missing (xm)
The dimensionality of xo is 99 and the dimensionality of xm is 1
The variables in xo are uncorrelated and their variance is set to one
A training data set of size 100000 and a test data set of size 1000
The ICA estimation on the training data set give the estimated
values for the source signals s and the mixing matrix A
The value of the missing variable xm is predicted either using
numerical integration or using approximation method
(C) 2001, SNU CSE Biointelligence Lab
Simulation – Strongly
Supergaussian Data
p( s )
1
3
2 (1 | s |)4
( 3) / dsign ( s )
f ' ( s)
( 1) | s / d |
(C) 2001, SNU CSE Biointelligence Lab
Simulation – Laplace Distributed
Data
exp( 2 | s |
p( s )
2
f ' ( s ) 2sign ( s )
(C) 2001, SNU CSE Biointelligence Lab
Simulation – Very Weakly
Supergaussian Data
1 1
p( s )
2 cosh 2 s
f ' ( s ) tanh s
(C) 2001, SNU CSE Biointelligence Lab
Conclusion
Approximation
If the distributions of the independent components are
close to gaussian, it gives excellent results
If they are strongly supergaussian, the approximation is
less accurate but still quite reasonable in the range we
experimented with
Regression
The stronger the supergaussianity, the better the quality
of the regression
In contrast, for weakly supergaussian components, ICA
regression does not really explain the data
(C) 2001, SNU CSE Biointelligence Lab
Discussion
Regression by ICA is computationally demanding, due to
the integration
The integration may be approximated by the computationally
simple procedure of computing the outputs of an MLP
The output of each hidden-layer neuron corresponds to
estimation of one of the independent components
The choice of the nonlinearity is a problem of estimating
the probability densities of the independent components
(C) 2001, SNU CSE Biointelligence Lab