Transcript woman boot

Neural Information Coding: Biological
Foundations, Analysis Methods and
Models
Stefano Panzeri & Alberto Mazzoni
Italian Institute of Technology
Genova, Italy
[email protected][email protected]
Contributors
Nikos Logothetis
Christoph Kayser
Yusuke Murayama
Cesare Magri
Robin Ince
Nicolas Brunel
Neural Coding 1:
biological foundations
Neuronal Coding
• How do neurons encode and transmit sensory
information?
• What is the language (“code”) used by the neurons to
transmit information to one another?
Why is it important for BMI to understand how
to decode a neural signal?
Nicolelis and Lebedev (2009) Nature Reviews Neuroscience
What is a code?
A code is transformation of a certain message by using another alphabet
For example, you can use your fingers to represent numbers
Use only thumb to code 0 or 1
2 fingers are used to encode 3 different
numbers by the total amount of extended
fingers
2 fingers are used with a more complex
code to encode 4 different numbers.
Here the position of the extended finger
is also used to signal – this increases
the capacity to encode information
There are only 10 kind of people in the world: those who know the binary code and those who don’t.
Morse code
Telegrams were transmitted by electrical signals over wires by using Morse
codes.
Morse codes are a correspondence between the pattern of electrical signals and
the characters of the English alphabet.
There are two types of “symbols”: long (_) and short (.) electrical pulses.
Not only the length of the individual signal is important, but also the timing of the
individual signals (> for characters between different words)
The neuronal code is a sequence of spikes
Somatic electrode –
subthreshold membrane
potential plus Action Potentials
(spikes)
Extracellular electrode
Axonal electrode–
subthreshold membrane
potential are attenuated and
only spikes propagate long
distance
The synaptic vesicles in the axon terminal release neurotransmitters only when
an action potential arrives from the presynaptic neuron
Thus, neurons communicate information only through action potentials, not
through subthreshold membrane fluctuations.
Thus, the neuronal code consists of a time series of stereotyped action
potentials
… but on a BMI perspective, the code is
LFP
spikes
ECoG
EEG
fMRI
Spike trains encode stimuli
Single Neuron Variability
Trial 1
Trial 2
Trial 3
Trial 4
Decoding brain activity
??
?
Arieli et al (1996) Science
A response to an individual trial can be modeled as a sum of
two components: the reproducible response and the ongoing
network fluctuations
The effect of a stimulus may be likened to the additional
ripples caused when throwing a stones in a wavy sea
Response to multiple presentations of the same movie clip
Single unit activity - 10 secs
Local Field Potential - 10 secs
The code must be composed by features that are DIFFERENT from
scene to scene and SIMILAR across trials
The dictionary for neural code is noisy…
stimulus
s()
response
r(t)
Thus, the neuronal dictionary is probabilistic!
Probabilistic dictionary
Distribution of
barometer readings
predicting rain
of occurrence
Frequency
Probability
0.08
Distribution of
barometer readings
when it does not
rain
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
10
29
20
30
40
29.5
30
30.5
Response (spikes/sec)
Pressure
Hg)
Response(inches
(spikes/s)
50
Neural Coding 2:
analysis methods
Single trial analysis toolboxes
(Information Theory and/or Decoding)
Decoding: predicting the most likely external
correlate from neural activity
Information: quantifies the total knowledge (in units
of bits) about external correlates that can be gained
from neural activity
(1 bit = doubling the knowledge)
Quian Quiroga & Panzeri (2009) Nature Reviews Neuroscience
What is Information Theory?
Information theory is a branch of mathematics that deals with
measures of information and their application to the study of
communication, statistics, and complexity.
It originally arose out of communication theory and is sometimes used
to mean the mathematical theory that underlies communication
systems and communication in the presence of noise.
Based on the pioneering work of Claude Shannon
Shannon, C. E. (1948). A mathematical theory of communication. Bell
Sys. Tech. Journal 27:379-423, 623-656.
TM Cover and JA Thomas, Elements of Information Theory, JohnWhiley 2006
PE Latham and Y Roudi (2008) Mutual Information – Scholarpedia
article (freely accessible)
R Quian Quiroga and S Panzeri (2009) Extracting information from
neuronal populations: informaiton theory and decoding approaches.
Nature Reviews Neurosci
Entropy of a random variable S
Suppose we want to gain information about the stimulus
If there is only one possible stimulus, then no information is gained by knowing that
stimulus. The amount of information obtained from knowing something is related to
its INITIAL UNCERTAINTY
Therefore the first step to define information is to quantify uncertainty.
The entropy H of a random variable s is a simple function of its probability
distribution P(s):
H ( S )   P( s) log 2 P( s)
s
H(S) measures the amount of uncertainty inherent to a random variable, i.e. the
amount of information necessary to describe it completely.
24
1
1
H (GuessWho)   log 2
 4.585
24
1 24
Warning about guess who
I will use Guess Who as the stupidest application of Information equations. But there is
no noise, no biological variability in the game. Decoding neurons activity is more like
playing a version of the game in which the faces are probabillistic (X has an hat only P(X)%
of the times you look), and you are playing against a guy who lies and talks in a language
you don’t really know well during a noisy Skype conference.

Entropy and description length
H(S) has a very concrete interpretation: Suppose s is chosen randomly from the
distribution P(s) , and someone who knows the distribution is asked to guess which
was chosen. If the guesser uses the optimal question-asking strategy -- which is to
divide the probability in half on each guess by asking questions like "is s greater than
.. ?", then the average number of yes/no questions it takes to guess lies between
H(S) and H(S)+1.
This gives quantitative meaning to "uncertainty": it is the number of yes/no questions
it takes to guess a random variables, given knowledge of the underlying distribution
and taking the optimal question-asking strategy
H(Guess Who)=4.585
So you really should win with 5 questions….
Demonstration is left to the audience
Entropy of the stimulus distribution
H ( S )   P( s) log 2 P( s)
s
H(S) expresses the uncertainty about the
stimulus (or equivalently, the information
needed to specify the stimulus) prior to
the observation of a neural response
Residual entropy (equivocation) of the stimulus
after observing the neural response
H (S | R)   P(r ) P(s | r ) log 2 P(s | r )
r
s
H(S|R) expresses the residual
uncertainty about the stimulus
after the observation of a
neural response.
“Are you a woman?”
P(yes)=19/24; P(no)=5/24;
P(face|yes)= 0 or 1/19; P(face|no)= 0 or 1/5
H(Guess Who|woman)=3.97;
Questions left for optimal strategy: at most 4
Shannon’s Information
s
I ( R; S )  H ( R)  H ( R | S )
I ( S ; R)  H ( S )  H ( S | R)
r
Shannon’s Information : it quantifies the average reduction of uncertainty
(=information gained) after a single-trial observation of the neural response
Measured in bits. 1 bit = reduction of uncertainty by a factor of 2 (like a correct
answer to a yes/no question)
I(Guess Who|woman) = 4.58 - 3.97 = 0.61 bits ;
I(Real World|woman) = 1 bit;
Information key equations
I ( R; S )  H ( R)  H ( R | S )
I ( S ; R)  H ( S )  H ( S | R)  H ( R)  H ( R | S )  I ( R; S )
P(r | s )
I ( S ; R)   P( s) P(r | s) log 2
P(r )
s
r
R
Stim 1
Stim 2
R
Stim 1
Stim 2
Trial 1
10
5
Trial 1
7
5
Trial 2
9
3
Trial 2
5
3
Trial 3
8
5
Trial 3
8
5
Trial 4
10
4
Trial 4
4
4
I(S;R) =1 bit
I(S;R) =0.045 bit
Information analysis with dynamic stimuli
P (r )
P(r | s A )
P(r | sB )
Trials
P(r | s)
I ( S ; R)   P( s) P(r | s) log 2
P(r )
r ,s
sA
sB
1 sec
Information about which section of the dynamic stimulus elicited the considered
neural response .
Since this procedure does not make assumptions about which movie feature is being encoded, it
quantifies information about all possible visual attributes in the movie
Data processing inequality
I ( R; S )  H ( R)  H ( R | S )
I ( S ; f ( R ))  I ( S ; R )
this means that you will not increase information by decimating or
squaring or taking the logarithm of your data, but does not mean that
you can not filter off noise (or extract spikes from raw data) , because
I ( S ; R1  R 2)) ? I ( S ; R1)
since
R1  R 2  f ( R1)
Combining different responses
I ( R; S )  H ( R)  H ( R | S )
P (r1 , r2 | s )
I ( S ; R1 , R2 )   P ( s )  P (r1 , r2 | s ) log 2
P (r1 , r2 )
s
r 1, r 2
R1
Stim 1
Stim 2
Stim 3
Trial 1
10
5
4
Trial 2
9
3
5
Trial 3
8
5
5
Trial 4
10
4
3
R2
Stim 1
Stim 2
Stim 3
Trial 1
3
1
9
Trial 2
1
0
8
Trial 3
0
3
8
Trial 4
2
2
9
I ( S ; R1 , R2 )  max( I ( S ; R1 ), I ( S ; R2 ))
Redundancy
I ( R; S )  H ( R)  H ( R | S )
RED ( S ; R1 , R2 )  I ( S ; R2 )  I ( S ; R1 )  I ( S ; R1 , R2 )
R1
Stim 1
Stim 2
Stim 3
Trial 1
10
5
7
Trial 2
9
3
5
Trial 3
8
5
6
Trial 4
10
4
7
R2
Stim 1
Stim 2
Stim 3
Trial 1
3
1
9
Trial 2
1
0
8
Trial 3
0
3
8
Trial 4
2
2
9
Redundancy>0
Decoding (stimulus reconstruction)
srs
p
Stimulus presented
decoding
Stimulus predicted
Quian Quiroga & Panzeri Nature Reviews Neurosci 2009
Decoding algorithm using Bayes rule
P( s ') P(r | s ')
P( s ' | r ) 
P( r )
To predict the stimulus sp that generated a given response r,
we can chose the stimulus that maximizes the posterior
probability of have caused r
s
p
 arg max P ( s ' | r )
s'
Decoding using
clustering algorithms
Divide the response space into
regions. When the response
falls in a given region, assign it
to a stimulus
Optimal boundaries are
difficult to find, ask Iannis….
Decoding using neural networks
Kjaer et al (1994) J Comput Neurosci
Information in the confusion matrix
p
P
(
s
,
s
)
p
p
I ( S ; S )   P ( s, s ) log 2
p
p
P
(
s
)
P
(
s
)
s,s
Stimulus
presented s
I(S;Sp) quantifies the information
than can be gained by a receiver
that knows the true P(s|r) but only
uses the information about which is
the most likely stimulus
Stimulus
predicted sp
Because of data processing inequality:
I ( S ; S )  I ( S ; R)
p
Advantages of Shannon’s information
• Takes into account all possible ways in which a variable can carry
information about another variable
• More complete than simply decoding
(quantifies all sources of knowledge)
• Makes no assumption about the relation between stimulus and
response (we do not need to specify which features of the
stimulus activate the neuron and how they activate it)
Quian Quiroga & Panzeri (2009) Nature Reviews Neuroscience
Neurons may convey information in other ways than
reporting the most likely stimulus
If the receiver only uses the information about which is the
most likely stimulus, it may miss out on important
information even if it operates on the true P(s|r).
Quian Quiroga & Panzeri Nature Reviews Neurosci 2009
Advantages of information theory
• It quantifies single-trial stimulus discriminability on a
meaningful scale (bits)
• It is a principled measure of the correlation on a single
trial basis between neuronal responses and stimuli
• It works even in non-linear situations
• It makes no assumption about the relation between
stimulus and response – we do not need to specify
which features of the stimulus activate the neuron and
how they activate it
• Since it takes into account all ways in which a neuron
can reduce ignorance, it bounds the performance of any
biological decoder. Thus it can be used to explore which
spike train features are best for stimulus decoding,
without the limitations coming from committing to a
specific decoding algorithm
Advantages of decoding techniques
• Decoding algorithms are much easier to implement and
compute
• Calculations are robust even with limited amounts of
data
• Allow to test specific hypotheses on how downstream
systems may interpret the messages of other neurons
•
More direct implementation to BMI
Neural Coding 3:
BIAS
The “plug-in” information estimation
The most direct way to compute information and entropies is the “plug-in” method:
estimate the response probabilities P(r|s) and P(r) as the experimental histogram of
the frequency of each response across the available trials and then plug these
empirical probability estimates into entropy Equations.
H (S | R)   P(r ) P(s | r ) log 2 P(s | r )
r
s
The limited sampling bias
The problem with the plug-in estimation is that these probabilities are not known but have to be measured
experimentally from the available data.
The estimated probabilities are subject to statistical error and necessarily fluctuate around their true values. The
significance of these finite sampling fluctuations is that they lead to both systematic error (bias) and statistical
error (variance) in estimates of entropies and information. These errors, particularly the bias, constitute the key
practical problem in the use of information with neural data.
If not corrected, bias can lead to serious misinterpretations of neural coding data.
Panzeri, S., Senatore, R., Montemurro, M. A.,
and Petersen, R. S. (2007). Correcting for the
sampling bias problem in spike train
information measures. J Neurophysiol 98,
1064-1072. Review paper
S Panzeri, C Magri, L Carraro (2008)
Sampling bias – Scholarpedia article (freely
accessible)
Statistics of information values:
bias and variance due to limited sampling
The bias of a given functional F of a probability distribution P is defined as the
difference between the trial-averaged value of F when the probability distributions are
computed from N trials only and the value of F computed with the true probability
distributions (obtained from infinite number of trials).
Ince et al (2010) Frontiers Neurosci
Panzeri et al (2007) J Neurophysiol
The limited sampling bias
1
2
The limited sampling bias
1
2
The Information Bias – Asymptotic expansion
BIAS ( I )  I ( PN )
N
C1
C2
 I ( P ) 

 ....
2
N
2N
This expansion converges only if N is large (large enough
that each response with non-zero probability is observed
several times)
Note: All coefficients are non-negative
Miller, 1955
Treves & Panzeri 1995
Panzeri & Treves, Network 1996
Victor, Neur. Comp. 2000
Paninski, Neur. Comp. 2003
The Information Bias – Leading coefficient
H ( R | S )   P( s) P(r | s) log 2 P(r | s)
s
Bias ( H ( R | S )) 
r
1
Rs  1

2 N ln( 2) s
H ( R)   P(r ) log 2 P(r )
Rs= size of support of P(r|s)
# of responses with Prob > 0
of being observed upon
presentation of s
r
1
R  1
Bias ( H ( R)) 
2 N ln( 2)
R= # of responses with > 0 of
being observed upon
presentation of any s
Entropies are negatively biased – H(R|S) far more biased
than H(R)
Procedures to alleviate – eliminate the bias
• Bootstrapping
and subtracting
• Quadratic Extrapolation
• Panzeri Treves correction
• Shuffling
(for 2 dimensions or more)
The boostrapping and subtracting method
One way to estimate the bias is to randomly “bootstrap”
stimulus-response data pairs. The resulting information Iboot
should be zero.
In practice, because of finite sampling, we will find Iboot > 0. The
mean of the distribution of Iboot is equal to the bias of Iboot.
Stim 1 Stim 2 Stim 3
Stim 1 Stim 2 Stim 3
Trial 1
4
3
1
Trial 1
1
3
4
Trial 2
3
2
1
Trial 2
3
1
2
Trial 3
3
2
1
Trial 3
1
3
2
Trial 4
3
3
2
Trial 4
3
3
2
Trial 5
5
2
1
Trial 5
1
5
2
Tot
18
12
7
Tot
9
15
12
boott
The boostrapping and subtracting method
Bias ( I ) 
1 

  Rs  1  R  1
2 N ln( 2)  s

Bias ( I boot ) 
1 

  Rs boot  1  Rboot  1
2 N ln( 2)  s

Bootstrapping increases the support of P(r,s)
Bias of Iboot is typically higher than the bias of I
bootstrapped
real
r
r
s
s
The Quadratic Extrapolation
C1 C2
I N  I true  
 ....
2
N 2N
-) Collect N data
-) Divide then into blocks of N/2 , N/4 …
-) Compute average information for N , N/2 , N/4 ... Data and
fit the dependence of I on N to the above quadratic expression
-) Estimate the true (N=∞) value from the best-fit expression
The Panzeri-Treves bias estimate
Correlation and bias
With multi-dimensional codes, there is extra noise
added by the correlations between the responses
... + Very bad sampling because num of parameters scales exponentially with L
1) Compute Hind(R|s)
with independent
probabilities for
dimension
2) Compute Hsh(R|S): shuffling
randomly trial order at fixed time and
stimulus condition, then compute
entropy of shuffled distribution
R1, R2
Stim 1
Stim 2
Trial 1
10, 3
6, 4
8, 5
Trial 2
8, 5
8, 3
6, 3
Trial 3
9, 4
7, 5
R1, R2
Stim 1
Stim 2
Trial 1
10, 5
7, 4
Trial 2
9, 4
Trial 3
8, 3
SH
I sh ( S ; R)  H ( R)  H ind ( R | S )  H sh ( R | S )  H ( R | S )
The Shuffling subtraction method
Bias ( I ) 
1 

  Rs  1  R  1
2 N ln( 2)  s

Bias ( I sh ) 
1 

  Rs  sh  1  Rsh  1
2 N ln( 2)  s

Shufflig increases the support of P(r,s)
Bias of Ish is typically higher than the bias of I
shuffled
real
r
r
s
s
Sampling rules of thumb for I(S;R)
The previous results suggest rules of thumb for data
sampling (i.e. the number of trials per stimulus Ns)
necessary to obtained unbiased information estimates:
If no bias correction is used:
Ns >> number of possible responses R
If we use an appropriate bias correction, then the
constraint on data becomes much better:
Ns ≥ number of possible responses R
Submitted last week……
We are always trying to improve!!!
Toolbox
www.ibtb.org