Language and Information

Download Report

Transcript Language and Information

Handout #1
SI 760 / EECS 597 / Ling 702
Language and Information
Winter 2004
Course Information
• Instructor: Dragomir R. Radev
([email protected])
• Office: 3080, West Hall Connector
• Phone: (734) 615-5225
• Office hours: TBA
• Course page:
http://www.si.umich.edu/~radev/LNI-winter2004/
• Class meets on Thursdays, 1-4 PM in 412 WH
Introduction
Demos
•
•
•
•
•
•
Google (www.google.com)
AskJeeves (www.ask.com)
OneAcross (www.oneacross.com)
Systran (www.altavista.com)
NewsInEssence (www.newsinessence.com)
Also NSIR, IONAUT, Vivísimo, …
The Shannon game
• http://math.ucsd.edu/~crypto/java/ENTROPY/
• http://www.nightgarden.com/shannon.htm
• http://graphics.stanford.edu/~liyiwei/project/textSynthesis/
textSynthesisDemoJava.html
• http://www.teamten.com/lawrence/projects/markov/
• Additional readings:
– http://home1.gte.net/deleyd/random/abramson.html
– http://www.cs.bell-labs.com/cm/cs/pearls/sec153.html
What this course is about
• Quantitative processing of textual data
(especially large corpora such as the Web)
• Connection with other courses:
– EECS 595/LING 541/SI 660 Natural Language
Processing
– SI 650 Information Retrieval
Syllabus (I)
• 1. The computational study of Language. Linguistic
Fundamentals.
• 2. Mathematical and Probabilistic Fundamentals.
Descriptive Statistics. Measures of central tendency.
The z score. Hypothesis testing.
• 3. Information theory. Entropy, joint entropy,
conditional entropy. Relative entropy and mutual
information. Chain rules. The entropy of English.
• 4. Working with corpora. N-grams.
• 5. Language models. Hidden Markov Models. Noisy
channel models. Applications to Part-of-speech tagging
and other problems.
Syllabus (II)
• 6. Cluster analysis. Distributional clustering.
• 7. Collocations. Syntactic criteria for collocability.
• 8. Literary detective work. The statistical analysis of
writing style.
• 9. Text summarization. Cross-document structure
theory.
• 10. Lexical semantics. WordNet
Syllabus (III)
•
•
•
•
•
11. Information Extraction. Question Answering.
12. Word sense disambiguation
13. Lexical acquisition.
14. Paraphrase acquisition
15. Possible additional topics: Text alignment.
Statistical machine translation. Discourse
segmentation.
Grading
• Assignments (25%)
– The assignments will involve analysis of real textual
data using both manual and automated techniques.
• Project (30%)
– Programming project or research paper.
• Survey paper (15%)
• Final (30%)
– A mixture of short-answer and essay-type questions .
Projects
Each student will be responsible for designing and
completing a research project that demonstrates the ability
to use concepts from the class in addressing a practical
problem. A significant part of the final grade will depend
on the project assignment. Students will need to submit a
project proposal, a progress report, and the project itself.
Students can elect to do a project on an assigned topic, or
to select a topic of their own.
The final version of the project will be put on the
World Wide Web, and will be defended in front of the class
at the end of the semester.
Readings
• Required books
– Manning and Schütze. Foundations of Statistical Natural Language
Processing. MIT Press. 1999.
– Oakes. Statistics for Corpus Linguistics. Edinburgh University
Press 1998.
• Reference readings
– Jurafsky and Martin. Speech and Language Processing. PrenticeHall 2000.
– Cover & Thomas. Elements of Information Theory. John Wiley
and Sons 1991
• Additional handouts (articles, documentation, tutorials)
Main Research Forums
• Conferences: ACL, SIGIR, HLT/NAACL, COLING,
EACL/NAACL, AMTA/MT Summit, ICSLP/Eurospeech
• Journals: Computational Linguistics, Natural Language
Engineering, Information Retrieval, Information
Processing and Management, ACM Transactions on
Information Systems, ML, AI, JAIR, TALIP, etc.
• University centers: Columbia, CMU, UMass, MIT, UPenn,
USC/ISI, JHU, Stanford, Brown, Michigan, Maryland,
Edinburgh, Cambridge, Saarbrücken, Kyoto, and many
others
• Industrial research sites: IBM, Google, Microsoft, AT&T,
Bell Labs, PARC, SRI, BBN, MITRE
Tutorial 1:
Computational Linguistics
Why is language processing
difficult
• Ambiguous words:
– ball, board, plant
– fly, rent, tape
– address, resent, entrance, number
• Ambiguous sentences:
– Hijack a test for Putin (CNN, 03/15/2001)
– Prague battles flood waters
(http://news.bbc.co.uk/1/hi/world/europe/2192288.stm)
– U.S. eyes return to the moon
(http://www.cnn.com/2003/TECH/space/12/04/us.moon/index.html)
Pressure cooker-like machine proposed for mad cows
Monday, January 12, 2004 Posted: 10:21 AM EST (1521 GMT)
NEW YORK (Reuters) -- The bodies of dead cattle infected with mad
cow disease are usually burned to destroy the misshapen proteins
suspected of causing the brain-wasting ailment -- although there are
doubts whether this is safe, cost-effective or environmentally
sound.
But an Indiana-based company, set up by two professors from Albany
Medical College, now claims to have an effective alternative. You don't
have to go further than your kitchen sink to understand the science.
Their company, Waste Reduction by Waste Reduction Inc., says that by
using the kinds of chemicals that go into a drain-clearing product such as
Drano, they can safely break down the suspected disease-causing
proteins, known as prions.
Prions are misshaped proteins believed to cause bovine spongiform
encephalopathy, or mad cow disease. They eat at the brain tissue of
cattle by forcing proteins performing other jobs to take their shape,
resulting in a chain reaction.
Syntactic categories
• Substitution test:
Joseph eats
Chinese
hot
fresh
vegetarian
{
}
food.
• Open (lexical) and closed (functional)
categories:
No-fly-zone
yadda yadda yadda
the
in
Morphology
The dog chased the yellow bird.
•
•
•
•
•
•
Parts of speech: eight (or so) general types
Inflection (number, person, tense…)
Derivation (adjective-adverb, noun-verb)
Compounding (separate words or single word)
Part-of-speech tagging
Morphological analysis (prefix, root, suffix, ending)
Part of Speech Tags
Brown corpus - 79 tags
NN
IN
AT
NP
JJ
,
NNS
CC
RB
VB
VBN
VBD
CS
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
singular noun */
preposition */
article */
proper noun */
adjective */
comma */
plural noun */
conjunction */
adverb */
un-inflected verb */
verb +en (taken, looked (passive,perfect)) */
verb +ed (took, looked (past tense)) */
subordinating conjunction */
Jabberwocky (Lewis Carroll)
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
Nouns
• Nouns: dog, tree, computer, idea
• Nouns vary in number (singular, plural),
gender (masculine, feminine, neuter), case
(nominative, genitive, accusative, dative)
• Latin: filius (m), filia (f), filium (object)
German: Mädchen
• Clitics (‘s)
Pronouns
• Pronouns: she, ourselves, mine
• Pronouns vary in person, gender, number, case (in
English: nominative, accusative, possessive, 2nd
possessive, reflexive)
Joe bought him an ice cream.
Joe bought himself an ice cream.
• Anaphors: herself, each other
Determiners and Adjectives
•
•
•
•
•
•
•
Articles: the, a
Demonstratives: this, that
Adjectives: describe properties
Attributive and predicative adjectives
Agreement: in gender, number
Comparative and superlative (derivative and periphrastic)
Positive form
Verbs
•
•
•
•
•
•
•
•
•
•
Actions, activities, and states (throw, walk, have)
English: four verb forms
tenses: present, past, future
other inflection: number, person
gerunds and infinitive
aspect: progressive, perfective
voice: active, passive
participles, auxiliaries
irregular verbs
French and Finnish: many more inflections than English
Other Parts of Speech
•
•
•
•
•
Adverbs, prepositions, particles
phrasal verbs (the plane took off, take it off)
particles vs. prepositions (she ran up a bill/hill)
Coordinating conjunctions: and, or, but
Subordinating conjunctions: if, because, that,
although
• Interjections: Ouch!
Phrase-structure Grammars
Alice bought Bob flowers.
Bob bought Alice flowers.
•
•
•
•
•
•
•
Constituent order (SVO, SOV)
imperative forms
sentences with auxiliary verbs
interrogative sentences
declarative sentences
start symbol and rewrite rules
context-free view of language
Sample Phrase-structure
Grammar
S
NP
NP
NP
VP
VP
VP
P








NP
AT
AT
NP
VP
VBD
VBD
IN
VP
NNS
NN
PP
PP
NP
NP
AT
NNS
NNS
NNS
VBD
VBD
VBD
IN
IN
NN










the
drivers
teachers
lakes
drank
ate
saw
in
of
cake
Phrase-structure Grammars
• Local dependencies
• Non-local dependencies
• Subject-verb agreement
The students who wrote the best essays were given a reward.
• wh-extraction
Should Derek read a magazine?
Which magazine should Derek read?
• Empty nodes
Phrase Structure Ambiguity
•
•
•
•
•
•
•
Grammars are used for generating and parsing sentences
Parses
Syntactic ambiguity
Attachment ambiguity: Visiting relatives can be boring.
The children ate the cake with a spoon.
High vs. low attachment
Garden path sentences: The horse raced past the barn fell.
Is the book on the table red?
Ungrammaticality vs. Semantic
Abnormality
* Slept children the.
# Colorless green ideas sleep furiously.
# The cat barked.
Semantics and Pragmatics
• Lexical semantics and compositional semantics
• Hypernyms, hyponyms, antonyms, meronyms and
holonyms (part-whole relationship, tire is a
meronym of car), synonyms, homonyms
• Senses of words, polysemous words
• Homophony (bass).
• Collocations: white hair, white wine
• Idioms: to kick the bucket
Discourse Analysis
• Anaphoric relations:
1. Mary helped Peter get out of the car. He thanked her.
2. Mary helped the other passenger out of the car.
The man had asked her for help because of his foot
injury.
• Information extraction problems (entity crossreferencing)
Hurricane Hugo destroyed 20,000 Florida homes.
At an estimated cost of one billion dollars, the disaster
has been the most costly in the state’s history.
Pragmatics
• The study of how knowledge about the world and
language conventions interact with literal
meaning.
• Speech acts
• Research issues: resolution of anaphoric relations,
modeling of speech acts in dialogues
Other Research Areas
• Linguistics is traditionally divided into phonetics,
phonology, morphology, syntax, semantics, and
pragmatics.
• Sociolinguistics: interactions of social organization and
language.
• Historical linguistics: change over time.
• Linguistic typology
• Language acquisition
• Psycholinguistics: real-time production and perception of
language
Tutorial 2:
Mathematical Foundations
Probability Spaces
• Probability theory: predicting how likely it is that
something will happen
• basic concepts: experiment (trial), basic outcomes, sample
space 
• discrete and continuous sample spaces
• for NLP: mostly discrete spaces
• events
•  is the certain event while  is the impossible event
• event space - all possible events
Probability Spaces
• Probabilities: numbers between 0 and 1
• Probability function (distribution): distributes a
probability mass of 1 throughout the sample space
.
• Example: coin is tossed three times. What is the
probability of 2 heads?
• Uniform distribution
Conditional Probability and
Independence
• Prior and posterior probability
P(A  B)
P(A|B) =
P(B)

A
B
AB
Conditional Probability and
Independence
• The chain rule:
n-1
P(A1 …  An) = P(A1) P(A2 |A1) P(A3|A1A2 ) … P(An |  Ai)
i=1
• This rule is used in many ways in statistical NLP more
specifically in Markov Models.
• Two events are independent when P(AB) = P(A)P(B)
• Unless P(B)=0 this is equivalent to saying that P(A) = P(A|B)
• If two events are not independent, they are considered
dependent
Bayes’ Theorem
• Bayes’ theorem is used to calculate P(A|B) given
P(B|A).
P(A|B)P(B)
P(BA)
P(B|A) =
=
P(B)
P(A)
Random Variables
• Simply a function:
X:   Rn
• The numbers are generated by a stochastic process with a
certain probability distribution.
• Example: the discrete random variable X that is the sum of
the faces of two randomly thrown dice.
• Probability mass function (pmf) which gives the
probability that the random variable has different numeric
values:
P(x) = P(X = x) = P(Ax) where Ax = {   : X() = x}
Random Variables
• If a random variable X is distributed according to
the pmf p(x), the we write X ˜ p(x)
• For a discrete random variable, we have that:
Sp(xi) = SP(Axi) = P() = 1
Measures of Central Tendency
• Mode: the most frequent score in a data set
• Median: central score of the distribution
• Mean: average of all scores
Examples
• Split “Moby Dick” into 135 files (“pages”).
• Occurrences of the word “the” in the first
15 pages:
Data: 17 125 99 300 80 36 43 65 78 259 62 36 40 120 45
Mean: 93.67
Median: 65
Mode: 36
Expectation and Variance
• Expectation = mean (average) of a random variable.
• If X is a random variable with a pmf p(x), such that
S |x| p(x) < , then the expectation is:
E(X) = S xp(x)
• Example: rolling one die
• Variance = measure of whether the values of the random
variable tend to be consistent over trials or to vary a lot.
Var(X) = E((X - E(X))2) = E(X2) - E2(X)
• Standard deviation = square root of variance
Expectation and Variance
• Composition of functions:
E(g(Y)) = S g(y)p(y)
• Examples:
If g(Y) = aY + b, then E(g(Y)) = aE(Y) + b
E(X+Y) = E(X) + E(Y)
E(XY) = E(X)E(Y), if X and Y are independent
Joint and Conditional
Distributions
• Joint (multivariate) probability distributions:
p(x,y) = P(X = x , Y = y)
• Marginal pmf:
px(x) = Syp(x,y)
pY(y) = Sxp(x,y)
• If X and Y are independent:
p(x,y) = pX(x)pY(y)
Joint and Conditional
Distributions
• Conditional pmf in terms of the joint distribution:
P(x,y)
pX|Y(x|y) =
pY(y)
for y such that pY(y) > 0
Determining P
•
•
•
•
Estimation
Example “The cow chewed its cud”
Relative frequency
Parametric approach (doesn’t work for
distribution of words in newspaper articles
in a particular topic category)
• Non-parametric approach
The Binomial Distribution
• The number r of successes out of n trials given that
the probability of success in any single trial is p:
B(r; n,p) =
n
r
( )
pr
(1-p)n-r,
where
n
r
( )
n!
=
(n-r)!r!
• Example: tossing a (possibly weighted) coin n
times.
Pascal’s Triangle
1
1
p
1
1
1
1
2
3
4
5
1
3
1
1
4
6
10
q
1
10
5
1
The Normal Distribution
• Describes a continuous distribution
n(x; m,s) =
1
2p s
-(x-m)2/(2s2)
e
• Standard normal distribution: when m = 0 and s = 1
• In statistics, normal distribution is often used to
approximate the binomial distribution. It should only be
used when np(1-p) > 5
Skewed Normal Distributions
• Positively skewed (most of the data is below the
mean)
• Negatively skewed (the opposite)
• Bimodal distributions
• In corpus analysis: the number of letters in a word
or the length of a verse in syllables is usually
positively skewed
• Lognormal distributions
Central Limit Theorem
When samples are repeatedly drawn from a
population, the means of the samples are normally
distributed around the population mean. This
occurs whether or not the actual distribution is
normal or not.
Measures of Variability
• Variance =
S (x-m) /N-1
2
• Range
• Standard deviation is the square root of the variance
• Semi inter-quartile range (25% - 75% range): Michigan
SAT scores (1180-1380)
Data: 17 125 99 300 80 36 43 65 78 259 62 36 40 120 45
Mean: 93.67
Median: 65
Variance: 6729.52
Standard Deviation: 82.03
z-score
• A measure of how far a value is from the mean, in
terms of standard deviations
• Example: m = 93, s = 82. Let’s consider a page
with 144 occurrences of the word “the”. The zscore for that page is:
z = (144-93)/82 = 0.62
• Using the table on pages 258-259 of Oakes, we
find that the new page is at the 26th percentile
Hypothesis Testing
• If two data sets are both normally distributed, and the
means and standard deviations are known
• Example: Francis and Kucera reported that the mean
sentence length in government documents is 25.48 words,
while in the Present-Day Edited American English corpus,
the mean length is 19.27 words only
Hypotheses
• Null hypothesis: that the difference can be
explained in terms of chance and natural
variability
• Statistical significance: when there is less than 5%
chance that the null hypothesis holds
T-testing
• Tests the difference between two groups for
normally-distributed interval data
• The t-test is normally used with small samples:
less than 30 items
• The one-sample study compares a sample mean
with an established population
Tobs = (x - m) / stderr
Example 1
• Mixed corpus: 2.5 verbs per sentence with 1.2
standard deviation
• Scientific corpus: 3.5 verbs per sentence with 1.6
standard deviation
• number of sentences in the scientific corpus: 100
• standard error in scientific corpus: 3.5/10
• observed value of t = (3.5-2.5)/0.35 = 2.86
Example 1 (Cont’d)
•
•
•
•
Number of degrees of freedom: in the example: 99
Use table on page 260 of Oakes
Find value: 1.671
The observed value of t is larger, therefore the null
hypothesis can be rejected
Tests for Difference
Tobs = (x1 - x2) / stderr
stderr2 = s12/n1 + s22/n2
Control
(n=8)
10
Test
(n=7)
8
5
1
3
2
6
1
4
3
4
4
7
2
9
Example 2
stderr =
2.27 x 2.27
7
+
2.21 x 2.21
0.736 + 0.611
8
=
=
1.347
t = (6-3)/1.161 = 2.584
= 1.161
Example 2 (Cont’d)
• Number of degrees of freedom:
7 + 8 - 2 = 13
• critical value of significance at the 5 per cent level
is 2.16
• Since the observed value is greater than 2.16, we
can reject the null hypothesis
Parametric and Non-parametric
Tests
• Four scales of measurement: ratio, interval,
ordinal, nominal
• parametric tests (e.g., t-test): interval or ratioscored dependent variables; assumes independent
observations; usually normal distributions only
• non-parametric tests: mostly for frequencies and
rank-ordered scales; any type of distributions; less
powerful than parametric tests
Chi-square Test
• Relationship between the frequencies in a
display table
• Null hypothesis: no difference in
distribution (all distributions are equal)
2 = S
(O-E)2
E
Special cases
• When the number of degrees of freedom is 1, as in
a 2x2 contingency table, Yates’s correction factor
is used.
• If O > E, add 0.5 to O, otherwise, subtract 0.5
from O.
• If E < 5, results are not reliable.
Two-dimensional Contingency
Table
X = yes
X = no
Y = yes
a
b
Y = no
c
d
Row total x column total
Expected value =
Grand number of items
2 =
N( |ad - bc| - N/2)2
(a+b)(c+d)(a+c)(b+d)
Third Person Singular Reference (O)
Japanese
Ellipsis
English Total
104
0
104
Central pronouns
73
314
387
Non-central pronouns
12
28
40
Names
314
291
605
Common NPs
205
174
379
Total
708
807
1515
Third Person Singular Reference (E)
Japanese
Ellipsis
English Total
48.6
55.4
104
180.9
206.1
387
18.7
21.3
40
Names
282.7
322.3
605
Common NPs
177.1
201.9
379
708
807
1515
Central pronouns
Non-central pronouns
Total
(O-E)2/E for the Two Languages
Japanese
S = 258.8;
English
Ellipsis
63.2
55.4
Central pronouns
64.4
56.5
Non-central pronouns
2.4
2.1
Names
3.5
3.0
Common NPs
4.4
3.9
df = (5-1) x (2-1) = 4
--> different at the 0.001 level
Rank Correlation
• Pearson - continuous data
• Spearman’s rank correlation coefficient non-continuous variables
r=1-
6 Sd2
N (N2 - 1)
Example
S
X
Y
X'
Y'
d
d2
1
894
80.2
2
5
3
9
2
1190
86.9
1
2
1
1
3
350
75.7
6
6
0
0
4
690
80.8
4
4
0
0
5
826
84.5
3
3
0
0
6
449
89.3
5
1
4
16
r=1-
6 x 26
6 (62 - 1)
= 0.3
Linear Regression
• Dependent and independent variables
• Regression: used to predict the behavior of
the dependent variable
• Needed: mX, mY, X, b = slope of Y(X)
b=
NSXY - SXSY
NSX2 - (SX)2
Y’ = mY + b(X - mX)
Example
Section
X
Y
X2
XY
1
22
20
484
440
2
49
24
2401
1176
3
80
42
6400
3360
4
26
22
676
572
5
40
23
1600
920
6
54
26
2916
1404
7
91
55
8281
5005
TOTAL
362
212
22758
12877
Example (Cont’d)
(7 x 12877) - (362 x 212)
90139 - 76744
13395
b = (7 x 22758) - (362 x 362) = 159306 - 131044 = 28262 = 0.474
a = 5.775
Y’ = 5.775 + 0.474 X
Tutorial 3:
Information Theory
Entropy
• Let p(x) be the probability mass function of a
random variable X, over a discrete set of symbols
(or alphabet) X:
p(x) = P(X=x), x  X
• Example: throwing two coins and counting heads
and tails
• Entropy (self-information): is the average
uncertainty of a single random variable:
Information theoretic measures
• Claude Shannon (information theory):
“information = unexpectedness”
• Series of events (messages) with associated
probabilities: pi (i = 1 .. n)
• Goal: to measure the information content,
H(p1, …, pn) of a particular message
• Simplest case: the messages are words
• When pi is low, the word is less informative
Properties of information content
• H is a continuous function of the pi
• If all p are equal (pi = 1/n), then H is a
monotone increasing function of n
• if a message is broken into two successive
messages, the original H is a weighted sum
of the resulting values of H
Example
p1 = 1/2, p2 = 1/3, p3 = 1/6
• Only function satisfying all three properties
is the entropy function:
H=-
S p log
i
2
pi
Example (cont’d)
H = - (1/2 log2 1/2 + 1/3 log2 1/3 + 1/6 log2 1/6)
=
=
1/2 log2 2 + 1/3 log2 3 + 1/6 log2 6
1/2
+
1.585/3 + 2.585/6
=
1.46
Alternative formula for H:
H=
S p log
i
2
(1/pi)
Another example
• Example:
–
–
–
–
No tickets left:
Matinee shows only:
Eve. show, undesirable seats:
Eve. Show, orchestra seats:
P = 1/2
P = 1/4
P = 1/8
P = 1/8
Example (cont’d)
H = - (1/2 log 1/2 + 1/4 log 1/4 + 1/8 log 1/8 + 1/8 log 1/8)
H = - (1/2 x -1) + (1/4 x -2) + (1/8 x -3) + (1/8 x -3)
H = 1.75 (bits per symbol)
Characteristics of Entropy
• When one of the messages has a probability
approaching 1, then entropy decreases.
• When all messages have the same
probability, entropy increases.
• Maximum entropy: when P = 1/n (H = ??)
• Relative entropy: ratio of actual entropy to
maximum entropy
• Redundancy: 1 - relative entropy
Entropy examples
• Letter frequencies in Simplified Polynesian:
P(1/8), T(1/4), K(1/8), A(1/4), I (1/8), U (1/8)
• What is H(P)?
• What is the shortest code that can be designed to
describe simplified Polynesian?
• What is the entropy of a weighted coin? Draw a
diagram.
Joint entropy and conditional entropy
• The joint entropy of a pair of discrete random variables
X, Y  p(x,y) is the amount of information needed on
average to specify both their values
H (X,Y) = -
SS
x
y
p(x,y) log2 p(X,Y)
• The conditional entropy of a discrete random variable Y
given another X, for X, Y  p(x,y) expresses how much
extra information is need to communicate Y given that the
other party knows X
H (Y|X) = -
SS
x
y
p(x,y) log2 p(y|x)
Connection between joint and
conditional entropies
• There is a chain rule for entropy (note that the
products in the chain rules for probabilities have
become sums because of the log):
H (X,Y) = H(X) + H(Y|X)
H (X1,…,Xn) = H(X1) + H(X2|X1) + … + H(Xn|X1,…,Xn-1)
Simplified Polynesian revisited
p
t
k
a 1/16 3/8 1/16 1/2
i
u
1/16 3/16
0
1/8
0
1/4
3/16 1/16 1/4
3/4
1/8
Mutual information
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X) – H(X|Y) = H(Y) – H(Y|X) = I(X;Y)
• Mutual information: reduction in
uncertainty of one random variable due to
knowing about another, or the amount of
information one random variable contains
about another.
Mutual information and entropy
H(X,Y)
H(Y|X)
H(X|Y)
I(X;Y)
H(X|Y)
H(X|Y)
• I(X;Y) is 0 iff two variables are independent
• For two dependent variables, mutual information grows
not only with the degree of dependence, but also according
to the entropy of the variables
Formulas for I(X;Y)
I(X;Y) = H(X) – H(X|Y) = H(X) + H(Y) – H(X,Y)
I(X;Y) =
S
xyp(x,y) log2
p(x,y)
p(x)p(y)
Since H(X|X) = 0, note that H(X) = H(X)-H(X|X) = I(X;X)
p(x,y)
I(x;y) = log2 p(x)p(y)
: pointwise mutual information