Transcript 791-03-Math
COMP 791A: Statistical Language
Processing
Mathematical Essentials
Chap. 2
1
Motivations
Statistical NLP aims to do statistical inference
for the field of NL
Statistical inference consists of:
taking some data (generated in accordance with some
unknown probability distribution)
then making some inference about this distribution
Ex. of statistical inference: language modeling
how to predict the next word given the previous words
to do this, we need a model of the language
probability theory helps us finding such model
2
Notions of Probability Theory
Probability theory
deals with predicting how likely it is that
something will happen
Experiment (or trial)
the process by which an observation is made
Ex. tossing a coin twice
3
Sample Spaces and events
Sample space Ω :
An event A is a set of basic outcomes with A Ω
set of all possible basic outcomes of an experiment
Coin toss: Ω = {head, tail}
Tossing a coin twice: Ω = {HH, HT, TH, TT}
Uttering a word: |Ω| = vocabulary size
Every observation (element in Ω) is a basic outcome or
sample point
Ω is then the certain event
Ø is the impossible (or null) event
Example - rolling a die:
Sample space Ω = {1, 2, 3, 4, 5, 6}
Event A that an even number occurs A = {2, 4, 6}
4
Events and Probability
The probability of an event A is denoted p(A)
also called the prior probability
i.e. the probability before we consider any additional
knowledge
Example: experiment of tossing a coin 3 times
Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
events with two or more tails:
A = {HTT, THT, TTH, TTT}
P(A) = |A|/|Ω| = ½ (assuming uniform distribution)
events with all heads:
A = {HHH}
P(A) = |A|/|Ω| = ⅛
5
Probability Properties
A probability function P (or probability distribution):
Distributes a probability mass of 1 over the sample space Ω
[0,1]
P(Ω) = 1
For disjoint events Ai (ie : AiAj = Ø for all i ≠j)
P( Ai) = Σ P(Ai)
Immediate consequences:
P(Ø ) = 0
P(Ā) = 1 - P(A)
AB ==> P(A) ≤ P(B)
ΣaєΩ P(a) = 1
6
Joint probability
Joint probability of A and B:
P(A,B) = P(AB)
Ω
A
AB
B
7
Conditional probability
Prior (or unconditional) probability
Probability of an event before any evidence is obtained
P(A) = 0.1
P(rain today) = 0.1
i.e. Your belief about A given that you have no evidence
Posterior (or conditional) probability
Probability of an event given that all we know is B (some
evidence)
P(A|B) = 0.8
P(rain today| cloudy) = 0.8
i.e. Your belief about A given that all you know is B
8
Conditional probability (con’t)
P(A, B)
P(A | B)
P(B)
Ω
A
AB
B
9
Chain rule
P(A | B)
P(B | A)
P(A, B)
so P(A, B) P(B | A) x P(A)
P(A)
With 3 events, the probability that A, B and C occur is:
P(A, B)
so P(A, B) P(A | B) x P(B)
P(B)
The probability that A occurs
Times, the probability that B occurs, assuming that A
occurred
Times, the probability that C occurs, assuming that A and B
have occurred
With multiple events, we can generalize to the Chain rule:
P(A1, A2, A3, A4, ..., An)
= P (Ai)
= P(A1) × P(A2|A1) × P(A3|A1,A2) × ... × P(An|A1,A2,A3,…,An-1)
(important to NLP)
10
Bayes’ theorem
given that : P(A, B) P(A | B) P(B)
and : P(A, B) P(B | A) P(A)
then : P(A | B) P(B) P(B | A) P(A)
P(B | A) P(A)
or : P(A | B)
P(B)
11
So?
we typically want to know: P(Cause | Effect)
ex: P(Disease | Symptoms)
ex: P(linguistic phenomenon | linguistic observations)
But this information is hard to gather
However P(Effect | Cause) is easier to gather
(from training data)
So
P(Cause | Effect)
P(Effect | Cause) P(Cause)
P(Effect)
12
Example
Rare syntactic construction occurs in 1/100,000
sentences
A system identifies sentences with such a
construction, but it is not perfect
If sentence has the construction -->
system identifies it 95% of the time
If sentence does not have the construction -->
system says it does 0.5% of the time
Question:
if the system says that sentence S has the
construction… what is the probability that it is right?
13
Example (con’t)
What is P(sentence has the construction | the system says yes) ?
Let:
cons = sentence has the construction
yes = system says yes
not_cons = sentence does not have the construction
we have:
P(cons) = 1/100,000 = 0.00001
P(yes | cons) = 95% = 0.95
P(yes | not_cons) = 0.5% = 0.005
P(cons | yes)
P(yes | cons) P(cons)
P(yes)
P(yes) = ?
P(B) = P(B|A) P(A) + P(B|Ā) P(Ā)
P(yes) = P(yes | cons) × P(cons) + P(yes | not_cons) × P(not_cons)
= 0.95 × 0.00001 + 0.005 × 0.9999
14
Example (con’t)
So:
P(yes | cons) P(cons)
P(cons | yes)
P(yes)
0.95 0.00001
0.002 (0.2%)
0.95 0.00001 0.005 0.9999
So
in only 1 sentence out of 500 that the system says yes,
it is actually right!!!
15
Statistical Independence vs.
Statistical Dependence
How likely are we to have Head in a coin toss,
given that it is raining today?
A: having a head in a coin toss
B: raining today
Some variables are independent…
How likely is the word “ambulance” to appear,
given that we’ve seen “car accident”?
Words in text are not independent
16
Independent events
Two events A and B are independent:
If A and B are independent, then:
if the occurrence of one of them does not
influence the occurrence of the other
i.e. A is independent of B if P(A) = P(A|B)
P(A,B) = P(A|B) x P(B) (by chain rule)
= P(A) x P(B) (by independence)
In NLP, we often assume independence of
variables
17
Bayes’ Theorem revisited
(a golden rule in statistical NLP)
If we are interested in which event B is most likely to occur
given an observation A
we can chose the B with the largest P(B|A)
argmax B P(B | A) argmax B
P(A | B) x P(B)
P(A)
P(A)
is a normalization constant (to ensure 0…1)
is the same for all possible Bs (and is hard to gather anyways)
so we can drop it
So Bayesian reasoning:
In NLP:
argmax B P(A| B) x P(B)
argmax language_event P(observat ion |language_e vent) P(language _event)
18
Application of Bayesian Reasoning
Diagnostic systems:
P(Disease | Symptoms)
Categorization:
P(Category of object| Features of object)
Text classification: P(sports-news | words in text)
Character recognition: P(character | bitmap)
Speech recognition: P(words | signals)
Image processing: P(face-person | image)
…
19
Random Variables
A random variable X is a function
X: Ω --> Rn (typically n= 1)
Example – tossing 2 dice
X(1,1) = 2 X(1,2) = 3, … X(6,6) = 12
Rx= {2,3,4,5,6,7,8,9,10,11,12}
A random variable X is discrete if:
Ω = {(1,1), (1,2), (1,3), … (6,6)}
X : Ω --> Rx assigns to each point in Ω, the sum of the 2 dice
X: Ω --> S where S is a countable subset of R
In particular, if X: Ω --> {0,1}
then X is called a Bernoulli trial.
A random variable X is continuous if:
X: Ω --> S where S is a continuum of numbers
20
Probability distribution of an RV
Let X be a finite random variable
Rx= {x1, x2, x3,… xn}
A probability mass function f gives the
probability of X at different in points in Rx
f(xk) = P(X=xk) = p( xk)
p(xk) ≥ 0
Σk p(xk) = 1
x2
x3
…
X
x1
p(X)
p(x1 ) p(x2 ) p(x3 ) …
xn
p(xn )
21
Example: Tossing 2 dice
X = sum of the faces
X: Ω --> S
Ω = {(1,1), (1,2), (1,3), …, (6,6)}
S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
X
2
3
4
5
6
7
8
9
10
11
12
p(X)
1/36
2/36
3/36
4/36
5/36
6/36
5/36
4/36
3/36
2/36
1/36
X = maximum of the faces
X: Ω --> S
Ω = {(1,1), (1,2), (1,3), …, (6,6)}
S = {1, 2, 3, 4, 5, 6}
X
1
2
3
4
5
6
p(X)
1/36
3/36
5/36
7/36
9/36
11/36
22
Expectation
The expectation (μ) is the mean (or average or
expected value) of a random variable X
E(X) p(xi ) xi
Intuitively, it is:
the weighted average of the outcomes
where each outcome is weighted by its probability
ex: the average sum of the dice
If X and Y are 2 random variables on the same
sample space, then:
E(X+Y) = E(X) + E(Y)
23
Example
The expectation of the sum of the faces on two dice? (the
average sum of the dice)
If equiprobable… (2+3+4+5+…+12)/11
But not, equiprobable
SUM (xi)
2
3
4
5
6
7
8
9
10
11
12
p(SUM=xi)
1/36
2/36
3/36
4/36
5/36
6/36
5/36
4/36
3/36
2/36
1/36
1
2
3
2
1 252
E(X) xip(xi ) 2
7
3
4
... 11
12
36
36
36
36
36 36
Or more simply:
E(SUM)=E(Die1+Die2)=E(Die1)+E(Die2)
Each face on 1 die is equiprobable
E(Die) = (1+2+3+4+5+6)/6 = 3.5
E(SUM) = 3.5 + 3.5 = 7
24
Variance and standard deviation
The variance of a random variable X is a
measure of whether the values of the RV tend
to be consistent over trials or to vary a lot
var(X) p(xi ) (xi - E(X))
2
The standard deviation of X is the square root
of the variance
σ x var(X)
Both measure the weighted “spread” of the
values xi around the mean E(X)
25
Example
What is the variance of the sum of the
faces on two dice?
SUM (xi)
2
3
4
5
6
7
8
9
10
11
12
p(SUM=xi)
1/36
2/36
3/36
4/36
5/36
6/36
5/36
4/36
3/36
2/36
1/36
var(SUM) p(xi ) (xi - E(SUM))
2
1
2
3
4
(2 7)2
(3 7)2
(4 7)2
(5 7)2
36
36
36
36
5
2
1
(6 7)2 ...
(11 7)2
(12 7)2 5.83
36
36
36
26
Back to NLP
What is the probability that someone says
the sentence:“Mary is reading a book.”
In general, for language events, the probability
function P is unknown
Language event x
x
x
1
p
?
2
?
3
?
We need to estimate P (or a model M of the
language) by looking at a sample of data (training
set)
2 approaches:
Frequentist statistics
Bayesian statistics (we will not see)
27
Frequentist Statistics
To estimate P, we use the relative frequency of
the outcome in a sample of data
i.e. the proportion of times a certain outcome o occurs.
C(o)
fo
N
Where C(o) is the number of times o occurs in N trials
For N--> ∞ the relative frequency stabilizes to some
number: the estimate of the probability function
Two approaches to estimate the probability
function:
Parametric (assuming a known distribution)
Non-parametric (distribution free)… we will not see
28
Parametric Methods
Assume that some phenomenon in language is
modeled by a well-known family of distributions
(ex. binomial, normal)
The advantages:
we have an explicit probabilistic model of the process
by which the data was generated
determining a particular probability distribution within
the family requires only the specification of a few
parameters (so, less training data)
But:
Our assumption on the probability distribution may be
wrong…
29
Non-Parametric Methods
No assumption is made about the
underlying distribution of the data
For ex, we can simply estimate P
empirically by counting a large number of
random events
But: because we use less prior information
(no assumption on the distribution), more
training data is needed
30
Standard Distributions
Many applications give rise to the same basic form of a
probability distribution - but with different
parameters.
Discrete Distributions:
the binomial distribution (2 outcomes)
the multinomial distribution (more than 2 outcomes)
…
Continuous Distributions:
the normal distribution (Gaussian)
…
31
Binomial Distribution (discrete)
Also known as Bernoulli distribution
Each trial has only two outcomes (success or failure)
The probability of success is the same for each trial
The trials are independent
There are a fixed number of trials
Distribution has 2 parameters:
nb of trials n
probability of success p in 1 trial
Ex: Flipping a coin 10 times and counting the number of heads that
occur
Can only get a head or a tail (2 outcomes)
For each flip there is the same chance of getting a head (same prob.)
The coin flips do not effect each other (independence)
There are 10 coin flips (n = 10)
32
Examples
b(n,p) = B(10, 0.1)
Nb trials = 10
Prob(head) = 0.1
b(n,p) = B(10, 0.7)
Nb trials = 10
Prob(head) = 0.7
33
Binomial probability function
let:
n = nb of trials
p = probability of success in any trial
r = nb of successes out of the n trials
P(r) the probability of having r successes in n trials
n r
n r
p 1 p
r
n
n!
where :
(0 r n)
r (n r)!r!
The number of ways of having
r successes in n trials.
The probability of
having n-r failures.
The probability of
having r successes
34
Example
What is the probability of rolling higher
than 4 in 2 rolls of 3 dice rolls?
1st
2nd 3rd Probability
444
4 44
44 4
444
4 4 4
4 44
44 4
444
4
6
4
6
4
6
4
6
2
6
2
6
2
6
2
6
46 46
46 26
26 46
26 26 *
46 46
46 26 *
26 46 *
26 26
p(r 2) ( 46 26 26 ) ( 26 46 26 ) ( 26 26 46 )
2
2
2
27
27
27
29
n trials =3
p probability of success in 1 trial = 26
r successes = 2
3 2 2 4
4
p(r 2) 6 6 3 36
46 29
2
35
Properties of binomial distribution
B(n,p)
Mean E(X) = μ = np
Ex:
Flipping a coin 10 times
E(head) = 10 x ½ = 5
Variance σ2= np(1-p)
Ex:
Flipping a coin 10 times
σ2 = 10 x ½ ( ½ ) = 2.5
36
Binomial distribution in NLP
Works well for tossing a coin
But, in NLP we do not always have complete independence from one
trial to the next
Consecutive sentences are not independent
Consecutive POS tags are not independent
So, binomial distribution in NLP is an approximation (but a fair one)
When we count how many times something is present or absent
And we ignore the possibility of dependencies between one trial
and the next
Then, we implicitly use the binomial distribution
Ex:
Count how many sentences contain the word “the”
Assume each sentence is independent
Count how many times a verb is used as transitive
Assume each occurrence of the verb is independent of the others…
37
Normal Distribution (continuous)
Also known as Gaussian distribution (or Bell curve)
to model a random variable X on an infinite sample space
(ex. height, length…)
X is a continuous random variable if there is a function f(x)
defined on the real line R = (-∞, +∞) such that:
f is non-negative f(x) ≥ 0
The area under the curve of f is one
f(x) dx 1
The probability that X lies in the interval [a,b] is equal to the
area under f between x=a and x=b P(a X b) b f(x) dx
a
38
Normal Distribution (con’t)
has 2 parameters:
mean μ
standard deviation σ
n(μ,σ)= n(0,1)
μ=0; σ= 1
n(μ,σ)=n(1.5,2)
1
p(x)
e
σ 2π
2
(x μ)
2σ 2
μ=1.5; σ=2
39
The standard normal distribution
if μ=0 and σ=1, then called standard normal
distribution Z
P(0 X 1)
1
f(x) dx 34.1%
0
40
Frequentist vs Bayesian Statistics
Assume we toss a coin 10 times, and get 8 heads:
Frequentists will conclude (from the observations) that a
head comes 8/10 -- Maximum Likelihood Estimate (MLE)
if we look at the coin, we would be reluctant to accept
8/10… because we have prior beliefs
Bayesian statisticians will use an a-priori probability
distribution (their belief)
will update the beliefs when new evidence comes in (a
sequence of observations)
by calculating the Maximum A Posteriori (MAP)
distribution.
The MAP probability becomes the new prior probability and
the process repeats on each new observation
41
Essential Information Theory
Developed by Shannon in the 40s
To maximize the amount of information
that can be transmitted over an imperfect
communication channel (the noisy channel)
Notion of entropy (informational content):
How informative is a piece of information?
ex. How informative is the answer to a question
If you already have a good guess about the answer, the
actual answer is less informative… low entropy
42
Entropy - intuition
Ex: Betting 1$ to the flip of a coin
If the coin is fair:
(1$ - 0$ average win)
If the coin is rigged
Expected gain is ½ (+1) + ½ (-1) = 0$
So you’d be willing to pay up to 1$ for advanced information
P(head) = 0.99
P(tail) = 0.01
assuming you bet on head (!)
Expected gain is 0.99(+1) + 0.01(-1) = 0.98$
So you’d be willing to pay up to 2¢ for advanced information
(1$ - 0.98$ average win)
Entropy of fair coin is 1$ > entropy of rigged coin 0.02$
43
Entropy
Let X be a discrete RV
Entropy (or self-information)
n
H(X) p(xi )log2p(xi )
i1
measures the amount of information in a RV
average uncertainty of a RV
the average length of the message needed to transmit an
outcome xi of that variable
the size of the search space consisting of the possible values
of a RV and its associated probabilities
measured in bits
Properties:
H(X) ≥ 0
If H(X) = 0 then it provides no new information
44
Example: The coin flip
n
Fair coin: H(X) p(xi )log2p(xi ) - 1 log2 1 1 log2 1 1 bit
2 2
2
2
i1
Rigged coin: H(X) p(xi )log2p(xi ) - 99 log2 99 1 log2 1 0.08 bits
n
100
100 100
100
Entropy
i1
P(head)
45
Example: Simplified Polynesian
In simplified Polynesian, we have 6 letters with
frequencies:
p
t
k
a
i
u
1/8 1/4 1/8 1/4 1/8 1/8
The per-letter entropy is
H(p)
i{p, t,k, a,i,u}
1
p(i)log p(i) ( 8 log
2
2
1 1
1 1
1 1
1 1
1 1
1
log2 log2 log2 log2 log2 ) 2.5 bits
8 4
4 8
8 4
4 8
8 8
8
We can design a code that on average takes 2.5bits to
transmit a letter
p
t
k
a
i
u
100
00
101
01
110
111
Can be viewed as the average nb of yes/no questions you
need to ask to identify the outcome (ex: is it a ‘t’? Is it a ‘p’?)
46
Entropy in NLP
Entropy is a measure of uncertainty
The more we know about something the lower its
entropy
So if a language model captures more of the
structure of the language, then its entropy
should be lower
in NLP, language models are compared by using
their entropy.
ex: given 2 grammars and a corpus, we use entropy to
determine which grammar beter matches the corpus.
47