Giancarlo Schrementi

Download Report

Transcript Giancarlo Schrementi

Information Theory Metrics
Giancarlo Schrementi
Expected Value
EV   p(i)v(i)
iX
• Example: Die Roll
(1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+(
1/6)*6 = 3.5
The equation gives you the value that you
can expect an event to take.
A Template Equation
N
EV   p(i)v(i)
i
• Expected Value forms a template for many
of the equations in Information Theory
• Notice it has three parts, a summation over
all possible events, a probability of an event
and the value of that event.
• It can be thought of as a weighted average

Entropy as an EV
H(X)    p(i)lg( p(i))
iX
• Notice it has the same three parts
• The value here is the log base 2 of the
probability which can be seen as the amount
of information that an event transmits.
• Since it’s a logarithm that means less likely
occurrences are more informative than highly
likely occurrences.
• So Entropy can be thought of as the expected
amount of information that we will receive.

Mutual Information
 p(i, j) 
I(X;Y)   p(i, j)lg 

p(i) p( j) 
i, j X ,Y
• Our three friends show up again
• The value this time is the joint
probability divided by the product of the
two marginal probabilities.
• This equation is definitely an expected
value equation but what does that value
tell us?
MI’s Value
 p(i, j) 
lg 

p(i) p( j) 
• The bottom tells us the probability of the two events
occurring if they are independent.
• The top is the actual probability of the two events
occurring which is going to be different if they are
 way.
dependent in some
• Dividing probabilities tells us how much more likely
one probability is than the other.
• Thus the value tells us how much more likely the joint
event is than it would be if the two were independent,
giving us a measure of dependence.
• The lg scales the value to be in the bits of information
one event tells us about the next event.

Mutual Information as an EV
 p(i, j) 
I(X;Y)   p(i, j)lg 

p(i) p( j) 
i, j X ,Y
• Mutual Information can be looked at as the
expected number of bits that one event tells
us about another event in the distribution.
This can be thought of as how many fewer
bits are need to encode the next event
because of your knowledge of the prior event.
• An MI of zero means that every event is
independent of every other event.
• It is also symmetric across X,Y and is always
non-negative.
Kullback-Liebler Divergence
p(x) 
KL( p,q)   p(x)lg 

q(x) 
x
• Once again it looks like an EV equation.
• P and Q are two distributions and P(x) is the
 probability of event x in P.
• The division in the value part tells us how much
more/less likely an event is in distribution P than it
would be Q.
• The lg scales this to be in bits, but what does that
tell us?
What does KL tell us?
p(x) 
KL( p,q)   p(x)lg 

q(x) 
x
• The value part tells us what the informational
content difference is between event x in
distribution P and event x in distribution Q.

• So KL tells us the expected difference in
informational content of events in the two
distributions.
• This can be thought of as the difference in
number of bits needed to encode event X in
the two distributions.
Other KL Observations
p(x) 
KL( p,q)   p(x)lg 

q(x) 
x

• As it is written above it is not-symmetric and
does not obey the triangle inequality but it is
non-negative.
• If p and q are the same then the equation
results in zero.
• Kullback and Leibler actually define it as the
sum of the the equation above plus its
counterpart where p & q are switched. It then
becomes symmetric.
I(X;Y) 
Comparing MI and KL
 p(i, j) 
p(x) 
 p(i, j)lgp(i) p( j) KL( p,q)   p(x)lgq(x) 
i, j X ,Y
x

• The two equations are noticeably similar.
• In fact you can express MI as the KL divergence
between p(i,j) and p(i)p(j).
 is computing the divergence
• This tells us that MI
between the true distribution and the distribution
where each event is completely independent.
Essentially computing KL with respect to a
reference point.
• KL divergence gives us a measure of how different
two distributions are, MI gives us a measure of
different a distribution is from what it would be if its
events were independent.

Sample Application:
Chunk Parser
• Chunk parsers attempt to divide a sentence into
its logical/grammatical units. In this case
recursively, by splitting the sentence into two
chunks and then splitting those two chunks into
two chunks and so on.
• MI and KL are both measures that can tell us how
statistically related two words are to each other.
• If two words are highly related to each other there
is not likely to be a chunk boundary between them
and likewise if they are largely unrelated then
between them would be a good location to
suppose a chunk boundary.
MI and KL in this Task
•
•
•
•
p( xy )
p(y | x)
MI( xy )  lg
KL( xy )  lg
p(x) p(y)
p(y)
Here we are only concerned with the values at a point
not the expected value over all points so the
summations and the weight are missing giving us a

‘point-wise’ calculation.
These calculations are across a bigram <xy> in a text,
where x and y are two words.
p(<xy>) is the probability of a bigram occurring
Note in the KL equation what P and Q are. Q is p(y)
which can be thought of as the prior and P is p(y|x)
which can be thought of as the posterior. This gives
you the information difference that knowledge of x
brings to the probability of the right word y.
Results of the Comparison
p( xy )
MI( xy )  lg
p(x) p(y)
p(y | x)
KL( xy )  lg
p(y)
• First, note that these equations can result in negative
numbers if the top probability is lower than the bottom one.
Also, MI is no longer symmetrical because p(<xy>) is
different from p(<yx>).

• This asymmetry is useful for language analysis because
most languages are direction dependent and so we want our
metrics to be sensitive to that.
• They both provide an accurate dependence measure, but
the KL measure provides more exaggerated values the
further it gets from zero. This is because p(y|x) can get much
higher than p(<xy>) will ever be.
• This makes KL more useful for this context in that
relationships are more striking and thus easier to identify
from noise.
One Final Variant
p( yx )
 p( yx | x)lg p(x) p(y)
y  Yx 

y  xY
p( xy )
p( xy | x)lg
p(x)
p(y)

• These two equations are variants of mutual
information designed to tell you how much a word
tells you about the words that could occur that to

its left or to its right.
• Notice that its an expected value over all the
bigrams in which the word occurs in the right or
the left and weighted by the conditional probability
of that bigram given the word in question.
• This can be used to give you an estimate of the
handedness of a word or how much the word
restricts what can occur to the left or the right of it.
References
• S. Kullback and R. A. Leibler. On information and sufficiency.
Annals of Mathematical Statistics 22(1):79-86, March 1951.
• Damir Ćavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues,
Giancarlo Schrementi. "On Statistical Parameter Setting."
Proceedings of the First Workshop on Psycho-computational
Models of Human Language Acquisition (COLING-2004).
Geneva, Switzerland. August 28-29, 2004.
• Damir Ćavar, Paul Rodrigues, Giancarlo Schrementi. "Syntactic
Parsing Using Mutual Information and Relative Entropy."
Proceedings of the Midwest Computational Linguistics
Colloquium (MCLC). Bloomington, IN, USA. July 2004.
• http://en.wikipedia.org/wiki/Mutual_information
• http://en.wikipedia.org/wiki/Kullback-Leibler_divergence