Transcript slides

How to win big by thinking straight about
relatively trivial problems
Tony Bell
University of California at Berkeley
Density Estimation
Make the model
like the reality
by minimising the Kullback-Leibler Divergence:
by gradient descent in a parameter
of the model
THIS RESULT IS COMPLETELY GENERAL.
:
The passive case (
= 0)
For a general model distribution written in the ‘Energy-based’ form:
energy
partition function
(or zeroth moment...)
the gradient evaluates in the simple ‘Boltzmann-like’ form:
learn on data while awake
unlearn on data while asleep
The single-layer case
Shaping Density
Many problems solved
by modeling in the
transformed space
Linear Transform
Learning Rule
(Natural Gradient)
for non-loopy hypergraph
The Score Function
is the important quantity
Conditional Density Modeling
To model
use the rules:
This little known fact has hardly ever been exploited.
It can be used instead of regression everywhere.
Independent Components, Subspaces and Vectors
ICA
ISA
IVA
DCA
(ie: score function hard to get at due to Z)
IVA used for audio-separation in real room:
Score functions derived from sparse factorial
and radial densities:
Results on real-room source separation:
Why does IVA work on this problem?
Because the score function, and thus the learning, is only
sensitive to the amplitude of the complex vectors, representing
correlations of amplitudes of frequency components associated
with a single speaker. Arbitrary dependencies can exist between
the phases of this vector. Thus all phase (ie: higher-order
statistical structure) is confined within the vector and removed
between them.
It’s a simple trick, just relaxing the independence assumptions
in a way that fits speech. But we can do much more:
• build conditional models across frequency components
• make models for data that is even more structured:
Video is [time x space x colour]
Many experiments are [time x sensor x task-condition x trial]
channel 1-16, time 0-8
channel 17-32, time 0-8
channel 1-16, time 0-8
channel 1-16, time 0-1
channel 17-32, time 0-1
channel 1-16, time 0-1
channel 17-32, time 0-1
channel 33-48, time 0-1
The big picture.
Behind this effort is an attempt to explore something called
“The Levels Hypothesis”, which is the idea that in biology, in the brain,
in nature, there is a kind of density estimation taking place across scales.
To explore this idea, we have a twofold strategy:
1. EMPIRICAL/DATA ANALYSIS:
Build algorithms that can probe the EEG across scales, ie: across frequencies
2. THEORETICAL:
Formalise mathematically the learning process in such systems.
A Multi-Level View of Learning
LEVEL
UNIT
DYNAMICS
LEARNING
ecology
society
predation,
symbiosis
natural selection
society
organism
behaviour
sensory-motor
learning
organism
cell
protein
cell
spikes
synaptic plasticity
protein
direct, voltage,
Ca, 2nd messenger
molecular change
molecular forces
gene expression,
protein recycling
amino acid
(
= STDP)
Increasing
Timescale
LEARNING at a LEVEL is CHANGE IN INTERACTIONS between its UNITS,
implemented by INTERACTIONS at the LEVEL beneath, and by extension
resulting in CHANGE IN LEARNING at the LEVEL above.
Interactions=fast
Learning=slow
Separation of timescales allows INTERACTIONS at one LEVEL
to be LEARNING at the LEVEL above.
1
Infomax between Layers.
(eg: V1 density-estimates Retina)
y
2
Infomax between Levels.
(eg: synapses density-estimate spikes)
t
V1
all neural spikes
synapses,
dendites
synaptic
weights
x
retina
• square (in ICA formalism)
• feedforward
• information flows within a level
• predicts independent activity
• only models outside input
This SHIFT in looking at the problem
alters the question so that if it is
answered, we have an unsupervised
theory of ‘whole brain learning’.
y
all synaptic readout
• overcomplete
• includes all feedback
• information flows between levels
• arbitrary dependencies
• models input and intrinsic activity
pdf of all synaptic ‘readouts’
pdf of all spike times
If we can
make this
pdf uniform
then we have a model
constructed from all synaptic and dendritic causality
Formalisation of the problem:
IF
p is the ‘data’ distribution,
q is the ‘model’ distribution
w is a synaptic weight, and
I(y,t) is the spike synapse mutual information
THEN if we were doing classical Infomax, we would use the gradient:
(1)
BUT if one’s actions can change the data, THEN an extra term appears:
(2)
It is easier to live in a
world where one can
change the world
to fit the model,
as well as
changing
one’s model
to fit the world
therefore (2) must be easier than (1). This is what we are now researching.