Information Theory and Learning

Transcript Information Theory and Learning

Information Theory and Learning
Tony Bell
Helen Wills Neuroscience Institute
University of California at Berkeley
One input, one output deterministic
Infomax: match the input distribution
to the non-linearity:
Gradient descent learning rule to
maximise the transferred information
deterministic
sensory only
Examples of score functions
LOGISTIC
LAPLACIAN
In stochastic gradient algorithms (online training),
we dispense with the ensemble averages
giving:
for a single training example and a laplacian ‘prior’.
Same theory for multiple dimensions: fire
vectors into the the unit hypercube uniformly:
(
)
where this is the absolute determinant of the
Jacobian matrix, measuring how stretchy the
mapping is for square or overcomplete transforms
Undercomplete transformations
are not invertable, and require the
more complex formula:
Same theory for multiple dimensions: fire
vectors into the the unit hypercube uniformly:
(
)
Post-multiplying this by a positive definate transform
rescales the gradient optimally (called the Natural Gradient - Amari)
giving the pleasantly simple form:
Decorrelation is not enough:
diagonal
matrix
f gives higher order statistics, through its Taylor expansion
Infomax/ICA on image patches:
learn co-ordinates for natural scenes.
In this linear
generative
model,
we want
u = s: recover
independent
sources.
After training,
we calculate
A = W -1
, and
plot the
columns. For
16x16 images,
we get 256
bases
f from
logistic
density
f from
laplacian
density
f from
Gaussian
density
But this does not actually make the neurons independent.
Many joint densities p(u1,u2) are decorrelated but still
radially symmetric: they factorise in polar co-ordinates,
but not in cartesian, unless they’re Gaussian..
instead of
This happens when cells have similar position,
spatial frequency, and orientation selectivity,
but different phase.
Dependent filters can combine to make non-linear
complex cells (oriented but phase insensitive).
‘Dependent’ Component Analysis.
First, the maximum likelihood framework.
What we have been doing is:
Infomax
Maximum Likelihood
Minimum KL Divergence
We are fitting a model to the data:
or equivalently:
But a much more general model is the ‘energy-based’ model (Hinton):
with
sum of functions
on subsets of
‘Dependent’ Component Analysis.
For the completely
general model:
the learning rule is:
with the 2nd term reducing to -I (identity) in the case of ICA.
Unfortunately this involves an intractable integral over the model q.
Nonetheless, we can still work with all dependency models which
are non-loopy hypergraphs. Learn as before,
but
: with a modified score function:
instead of
a loopy
hypergraph:
For example, we can split the space into subspaces such
that the cells are independent between subspaces and
dependent within the subspaces. Eg: for 4 cells:
1
2
3
4
We now show a sequence of symmetry-breaking
occuring as we move from training, on images, a model
which is one big 256-dimensional hyperball, down to
a model which is 64 four-dimensional hyperballs:
Logistic
Density
1 subspace
Logistic
density
2 subspaces
Logistic
density
4 subspaces
Logistic
density
8 subspaces
Logistic
density
16 subspaces
Logistic
density
32 subspaces
Logistic
density
64
subspaces
Topographic ICA
Arrange the cells in a 2D map with a statistical model q constructed
from overlapping subsets. This is a loopy hypergraph, an
un-normalised model, but it still gives a nice result….
The hyperedges of our
hypergraph are
overlapping 4x4
neighbourhoods
etc.
That was from Hyvarinen & Hoyer.
Here’s one from Osindero & Hinton.
Well, we did get somewhere:
Conclusion.
We seem to have an information-theoretic explanation of some
properties of area V1 of visual cortex:
-simple cells (Olshausen &Field, Bell & Sejnowski)
-complex cells (Hyvarinen & Hoyer)
-topographic maps with singularities (Hyvarinen & Hoyer)
-colour receptive fields (Doi & Lewicki)
-direction sensitivity (van Hateren & Ruderman)
But we are stuck on:
-the gradient of the partition function
-still working with rate models, not spiking neurons
-no top-down feedback
-no sensory-motor (all passive world modeling)
References.
The references for all the work in these 3 talks will be
forwarded separately. If you don’t have access to them
email me at [email protected], and I’ll send them to you.

Information Theory and Learning

Transcript Information Theory and Learning

Directory