投影片 1 - National Cheng Kung University
Download
Report
Transcript 投影片 1 - National Cheng Kung University
Lecture 3: Basic Probability
(Chapter 2 of Manning and Schutze)
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
National Cheng Kung University
2014/02/24
Motivation
• Statistical NLP aims to do statistical inference.
• Statistical inference consists of taking some data
(generated according to some unknown
probability distribution) and then making some
inferences about this distribution.
• An example of statistical inference is the task of
language modeling, namely predicting the next
word given a window of previous words. To do
this, we need a model of the language.
• Probability theory helps us to find such a model.
Probability Terminology
• Probability theory deals with predicting how likely it is
that something will happen.
• The process by which an observation is made is called
an experiment or a trial (e.g., tossing a coin twice).
• The collection of basic outcomes (or sample points)
for our experiment is called the sample space .
• An event is a subset of the sample space.
• Probabilities are numbers between 0 and 1, where 0
indicates impossibility and 1, certainty.
• A probability function/distribution distributes a
probability mass of 1 throughout the sample space.
Experiments & Sample Spaces
• Set of possible basic outcomes of an experiment:
sample space
–
–
–
–
coin toss ( = {head,tail})
tossing coin 2 times ( = {HH, HT, TH, TT})
dice roll ( = {1, 2, 3, 4, 5, 6})
missing word (| | vocabulary size)
• Discrete (countable) versus continuous (uncountable)
• Every observation/trial is a basic outcome or sample
point.
– Event A is a set of basic outcomes with A , and all A 2
(the event space)
– is then the certain event, is the impossible event
Events and Probability
• The probability of event A is denoted p(A)
– also called the prior probability
– the probability before we consider any additional
knowledge
• Example Experiment: toss coin three times
– = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
• cases with two or more tails:
– A = {HTT, THT, TTH, TTT}
– P(A) = |A|/|| = 4/8 = 1/2 (assuming uniform distribution)
• all heads:
– A = {HHH}
– P(A) = |A|/|| = 1/8
Probability Properties
• Basic properties:
– P: 2 [0,1]
– P() = 1
– For disjoint events: P(Ai) = i P(Ai)
• [NB: axiomatic definition of probability: take the
above three conditions as axioms]
• Immediate consequences:
– P() = 0, P( A ) = 1 - P(A),
– A P(A) = 1
A B P(A) P(B)
Joint Probability
• Joint Probability of A and B:
P(A,B) = P(A B)
– 2-dimensional table (|A|x|B|) with a
value in every cell giving the probability
of that specific pair occurring.
A
B
AB
Conditional Probability
• Sometimes we have partial knowledge about the
outcome of an experiment, then the conditional
(or posterior) probability can be helpful. If we
know that event B is true, then we can determine
the probability that A is true given this
knowledge: P(A|B) = P(A,B) / P(B)
A
B
AB
Conditional and Joint Probabilities
• P(A|B) = P(A,B)/P(B) P(A,B) = P(B)P(A|B)
• P(B|A) = P(A,B)/P(A) P(A,B) = P(A)P(B|A)
• Chain rule:
P(A1,..., An)
A1
A2
A3
An
= P(A1)P(A2|A1)P(A3|A1,A2 )….P(An|A1,..., An-1)
– Chain rule is used in Language Model.
Bayes Rule
• Since P(A,B) = P(B,A), P(A B) = P(B A),
and P(A,B) = P(A|B)P(B) = P(B|A)P(A):
– P(A|B) = P(A,B)/P(B) = [P(B|A)P(A)]/P(B)
– P(B|A) = P(A,B)/P(A) = [P(A|B)P(B)]/P(A)
A
B
AB
Example
• S: have stiff neck, M: have meningitis (腦膜炎)
• P(S|M) = 0.5, P(M) = 1/50,000, P(S)=1/20
• I have stiff neck, should I worry?
P( S | M ) P( M )
P( M | S )
P( S )
0.5 1/50000
0.0002
1/20
Independence
• Two events A and B are independent of each
other if P(A|B) = P(A)
• Example:
– two coin tosses
– weather today and on March 4th, 2012
• If A and B are independent, then we compute
P(A,B) from P(A) and P(B) as:
– P(A,B) = P(A|B)P(B) = P(A)P(B)
• Two events A and B are conditionally independent of each
other given C if:
•
P(A|B,C) = P(A|C)
A Golden Rule
(of Statistical NLP)
• If we are interested in which event is most
likely given A, we can use Bayes rule, max
over all B:
– argmaxB P(B|A) = argmaxB P(A,B) / P(A)
= argmaxB [P(A|B)P(B)] / P(A)
= argmaxB P(A|B)P(B)
– P(A) is the same for each candidate B
Random Variables (RV)
• Random variables (RV) allow us to talk about the
probabilities of numerical values that are related to the
event space (with a specific numeric range)
• An RV is a function X: Q (events values)
– in general: Q = Rn, typically R
– easier to handle real numbers than real-world events
– An RV is discrete if Q is a countable subset of R;
an indicator RV (or Bernoulli trial) if Q = {0, 1}
• Define a probability mass function (pmf) for RV X:
– pX(x) = P(X=x) = P(Ax), where Ax = { : X() = x}
– often just p(x)
Example
• Suppose we define a discrete RV X that is the sum of
the faces of two dice, then Q = {2, …, 11, 12} with the
pmf as follows:
–
–
–
–
–
–
–
–
–
–
–
P(X=2)=1/36,
P(X=3)=2/36,
P(X=4)=3/36,
P(X=5)=4/36,
P(X=6)=5/36,
P(X=7)=6/36,
P(X=8)=5/36,
P(X=9)=4/36,
P(X=10)=3/36,
P(X=11)=2/36,
P(X=12)=1/36
Expectation and Variance
• The expectation is the mean or average of a RV,
defined as:
E ( X ) xp( x)
x
• The variance of a RV is a measure of whether the
values of the RV tend to vary over trials:
Var( X ) E (( X E ( X )) 2 )
E( X 2 ) E 2 ( X )
• The standard deviation (s) is the square root of the
variance.
Examples
• What is the expectation of the
sum of the numbers on two
dice?
2 * P(X=2) = 2 * 1/36 = 1/18 3 *
P(X=3) = 3 * 2/36 = 3/18 4 *
P(X=4) = 4 * 3/36 = 6/18 5 *
P(X=5) = 5 * 4/36 = 10/18 6 *
P(X=6) = 6 * 5/36 = 15/18 7 *
P(X=7) = 7 * 6/36 = 21/18 8 *
P(X=8) = 8 * 5/36 = 20/18 9 *
P(X=9) = 9 * 4/36 = 18/18 10 *
P(X=10) = 10 * 3/36 = 15/18
11 * P(X=11) = 11 * 2/36 =
11/18 12 * p(X=12) = 12 * 1/36
= 6/18
E(SUM) = 126/18 = 7
• Or more simply:
• E(SUM) = E(D1+D2)
= E(D1) + E(D2)
• E(D1) = E(D2)
= 1* 1/6 + 2* 1/6 + … + 6* 1/6
= 1/6 (1 + 2 + 3+ 4 + 5 + 6)
= 21/6
• Hence,
E(SUM) = 21/6 + 21/6 = 7
Examples
• Var(X) = E((X – E(X))2)
= E( X2 – 2XE(X) + E2(X) )
= E(X2) – 2E(XE(X)) + E2(X)
= E(X2) – 2E2(X) + E2(X)
= E(X2) – E2(X)
• E(SUM2)=329/6 and E2(SUM) = 72 = 49
• Hence, Var(SUM) = 329/6 – 294/6 = 35/6
Joint and Conditional
Distributions for RVs
y1 y2
• Joint pmf for two RVs X and Y is
p(x,y) = P(X=x, Y=y)
x1
x2
• Marginal pmfs are calculated as
pX(x) = y p(x,y) and pY(y) = x p(x,y)
• If X and Y are independent then
p(x,y) = pX(x) * pY(y)
• Define conditional pmf using joint distribution:
pX|Y(x|y) = p(x,y)/pY(y) if pY(y) > 0
• Chain rule:
– p(w,x,y,z) = p(w)p(x|w)p(y|w,x)p(z|w,x,y)
y3
Estimating Probability Functions
• What is the probability that sentence “The cow chewed
its cud” will be uttered? Unknown, so P must be
estimated from a sample of data.
• An important measure for estimating P is the
relative frequency of the outcome, i.e., the proportion of
times an outcome u occurs:
C (u )
fu
N
– C(u) is the number of times u occurs in N trials.
– For N , the relative frequency tends to stabilize around some
number, the probability estimate.
• Two different approaches:
– Parametric (assume distribution)
– Non-parametric (distribution free)
Parametric Methods
• Assume that the language phenomenon is
acceptably modeled by one of the well-known
standard distributions (e.g., binomial, normal).
• By assuming an explicit probabilistic model of
the process by which the data was generated,
then determining a particular probability
distribution within the family requires only the
specification of a few parameters, which requires
less training data (i.e., only a small number of
parameters need to be estimated).
Non-parametric Methods
• No assumption is made about the
underlying distribution of the data.
• For example, simply estimate P empirically
by counting a large number of random
events is a distribution-free method.
• Given less prior information, more training
data is needed.
Estimating Probability
• Example: Toss coin three times
– = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
– count cases with exactly two tails: A = {HTT, THT, TTH}
• Run an experiment 1000 times (i.e., 3000 tosses)
• Counted: 386 cases with two tails (HTT, THT, or TTH)
• Estimate of p(A) = 386 / 1000 = .386
– Run again: 373, 399, 382, 355, 372, 406, 359
• p(A) = .379 (weighted average) or simply 3032 / 8000
• Uniform distribution assumption: p(A) = 3/8 = .375
Standard Probabilistic Distribution
• Discrete distributions • Continuous distributions
Binomial
n
P ( X k | p, n) p k (1 p ) n k
k
Multinomial
n!
P ( X 1 k1 ,..., X m k m )
p1k1 ... p m k m
k1!...k m !
P ( X k ) p (1 p ) k 1
P( X k | ) e
k
k!
Geometric
Poisson
Normal
N ( x | , )
1
2
f ( x | ) e x
e
1
2
2
( x )2
Exponential
1 x
( x | , )
x e
( )
Gamma
Standard Distributions: Binomial
• Series of trials with only two outcomes, 0 or 1,
with each trial being independent from all the
other outcomes.
• Number r of successes out of n trials given that
the probability of success in any trial is p:
n r
b(r; n, p) p (1 p) nr
r
•
n
(r)
counts how many possibilities there are for
choosing r objects out of n, i.e., n! / (n-r)!r!
Binomial Distribution
• Works well for tossing a coin. However, for
linguistic corpora one never has complete
independence from one sentence to the next:
approximation.
• Use it when counting whether something has a
certain property or not (assuming independence).
• Actually quite common in SNLP: e.g., look
through a corpus to find out the estimate of the
percentage of sentences that have a certain
word in them or how often a verb is used as
transitive or intransitive.
• Expectation is n p, variance is n p (1-p)
Standard Distributions: Normal
• The Normal (Gaussian) distribution is a
continuous distribution with two parameters:
mean μ and standard deviation σ.
– Standard normal if = 0 and = 1.
n( x; , )
X
1
2
e
( x )2
2 2
Basic Formula
Ω
P( x) P( x, h)
h1
h6
X h5
h
P( x | y) P( x, h | y)
h2
h3
h4
h
P( x, h | y) P(h | y) P( x | y, h)
*
P( x | y) P(h | y) P( x | y, h)
x
y
h
P( x | y) P(h | y) P( x | h)
h
h
WMMKS Lab
Homework 2
• Based on the following formula, please give an example
of research idea or application.
P( x | y) P(h | y) P( x | y, h)
h
Frequency
Zipf's law
Rank
• Zipf's law, an empirical law formulated using
mathematical statistics, refers to the fact that many types
of data studied in the physical and social sciences can
be approximated with a Zipfian distribution,
one of a family of related discrete power law probability
distributions.
• The law is named after the linguist George Kingsley Zipf
(pronounced /ˈzɪf/) who first proposed it (Zipf 1935,
1949), though J.B. Estoup appears to have noticed the
regularity before Zipf.
Frequency
Zipf's law
Rank
• Zipf's law is most easily observed by
plotting the data on a log-log graph,
with the axes being log(rank order)
and log(frequency).
– For example, the word "the" (as described
above) would appear at x = log(1), y =
log(69971). The data conform to Zipf's law
to the extent that the plot is linear.
• Formally, let:
– N be the number of elements;
– k be their rank;
– s be the value of the exponent
characterizing the distribution.
Zipf PMF for N = 10 on a log-log scale.
The horizontal axis is the index k .
(Note that the function is only defined at
integer values of k. The connecting
lines do not indicate continuity.)
Frequentist Statistics
• s: sequence of observations
• M: model (a distribution plus parameters)
• For fixed model M, Maximum likelihood
estimate:
arg max P( s | M )
M
• Probability expresses something about
the world with no prior belief!
Bayesian Statistics I: Bayesian
Updating
• Assume that the data are coming in sequentially
and are independent.
• Given an a-priori probability distribution, we can
update our beliefs when a new datum comes in
by calculating the Maximum A Posteriori (MAP)
distribution.
• The MAP probability becomes the new prior and
the process repeats on each new datum.
arg max P( M
M
P( s | M ) P( M )
| s) arg max
P( s)
M
arg max P( s | M ) P( M )
M
• P(s) is a
normalizing constant
Bayesian Statistics II: Bayesian
Decision Theory
• Bayesian Statistics can be used to
evaluate which model or family of
models better explains some data.
• We define two different models of the
event and calculate the likelihood
ratio between these two models.
Bayesian Decision Theory
• Suppose we have 2 models M1 and M2; we
want to evaluate which model better explains
some new data.
P( M 1 | s) P( s | M 1 ) P( M 1 )
P( M 2 | s) P( s | M 2 ) P( M 2 )
P( M 1 | s)
– if
1 i.e. P ( M 1 | s ) P ( M 2 | s )
P( M 2 | s)
– then M1 is the most likely model, otherwise M2
Essential Information Theory
• Developed by Shannon in the 1940s.
• Goal is to maximize the amount of information
that can be transmitted over an imperfect
communication channel.
• Wished to determine the theoretical maxima for
data compression (entropy H) and transmission
rate (channel capacity C).
• If a message is transmitted at a rate slower than
C, then the probability of transmission errors can
be made as small as desired.
The Noisy Channel Model
• Goal: encode the message in such a way that it
occupies minimal space while still containing
enough redundancy to be able to detect and
correct errors.
W
Message
X
Encoder
Input to
channel
Channel
p(y|x)
Y
^
W
Decoder
Output from
channel
Attempt to
reconstruct
message
based
on output
The Noisy Channel Model
• Want to optimize a communication across a
channel in terms of throughput and accuracy
– the communication of messages in the presence of
noise in the channel.
• There is a duality between compression
(achieved by removing all redundancy) and
transmission accuracy (achieved by adding
controlled redundancy so that the input can be
recovered in the presence of noise).
The Noisy Channel Model
• Channel capacity: rate at which one can transmit
information through the channel with an arbitrary
low probability of being unable to recover the
input from the output. For a memoryless
channel:
C max I(X;Y)
p(X)
I(X; Y): mutual information
• We reach a channel’s capacity if we manage to
design an input code X whose distribution p(X)
maximizes I between input and output.
Mutual Information
• I(X; Y) is called the mutual information
between X and Y.
• I(X; Y)= H(X)-H(X|Y)= H(Y)-H(Y|X)
– H(X): entropy of X, X: discrete RV, p(X)
• It is the reduction in uncertainty of one
random variable due to knowing about
another, or, in other words, the amount of
information one random variable contains
about another.
Entropy
• Entropy (or self-information) is the average uncertainty
of a single RV.
• Let p(x) = P(X=x), where x X, then:
bits (length)
1
H ( p) H ( X ) p( x) log2 p( x) p( x) log2
p ( x)
xX
xX
• In other words, entropy measures the amount of
information in a random variable measured in bits.
It is also the average length of the message needed to
transmit an outcome of that variable using the optimal
code.
• An optimal code sends a message of probability p(x) in
-log2 p(x) bits.
Entropy (cont.)
H ( X ) p( x) log2 p( x)
xX
p( x)[ log2 p( x)]
xX
1
p( x) log2
p( x)
xX
1
E log2
p ( x)
• Entropy is non-negative: H ( X ) 0
H ( X ) 0 p( X ) 1
– when the value of X is determinate, it provides no
new information
Entropy of a Weighted Coin
(one toss)
The Limits
• When H(X) = 0?
– if a result of an experiment is known ahead of time
– necessarily:
x ; p( x) 1 & y ; y x p( y) 0
• Upper bound?
–
1
, n
for || = n: H(X) log2n
n
• nothing can be more uncertain than the uniform
distribution
x , p( x)
• Entropy increases with message length.
“Coding” Interpretation of Entropy
• The least (average) number of bits needed to
encode a message (string, sequence, series,...)
gives H(X).
• Compression algorithms
– do well on data with repeating patterns
(easily predictable = low entropy)
– their results have high entropy compressed data
does nothing
Coding: Example
• Experience: some characters are more common,
some (very) rare
– use more bits for the rare, and fewer bits for the frequent
– suppose: p(‘a’) = 0.3, p(‘b’) = 0.3, p(‘c’) = 0.3,
the rest: p(x) .0004
– code: ‘a’ ~ 00, ‘b’ ~ 01, ‘c’ ~ 10, rest: 11b1b2b3b4b5b6b7b8
– code acbbécbaac => 0010010111000011111001000010
a c b b
é
c b a a c
– number of bits used: 28
(vs. 80 using “naive” 8-bit uniform coding)
Joint Entropy
• The joint entropy of a pair of discrete
random variables X, Y, p(x, y) is the
amount of information needed on average
to specify both their values.
H ( X , Y ) p( x, y) log2 p( x, y)
xX yY
Conditional Entropy
• The conditional entropy of a discrete random variable
Y given another X, for X, Y, p(x,y) expresses how much
extra information is needed to supply on average for
communicating Y given that the other party knows X.
(Recall that H(X) = E(log2(1/p(x))); weights are not
conditional)
H (Y | X ) p( x) H (Y | X x)
xX
p( x) p( y | x) log2 p( y | x)
xX
yY
p( x, y ) log2 p( y | x)
xX yY
E (log2 p(Y | X ))
Entropy and Language
• Entropy is measure of uncertainty. The more we
know about something the lower the entropy.
• If a language model captures more of the
structure of the language than another model,
then its entropy should be lower.
• Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we have already seen.
Perplexity
• A measure related to the notion of cross entropy
and used in the speech recognition community
is called the perplexity.
• Perplexity(x1n, m) = 2H(x1n,m)
• A perplexity of k means that you are as
surprised on average as you would have been if
you had to guess between k equi-probable
choices at each step.
Chain Rule for Entropy
• The product became a sum due to the log.
H ( X , Y ) H ( X ) H (Y | X )
H ( X1,...,Xn ) H ( X1 ) H ( X 2|X1 ) ....
H ( X n|X1,...Xn1 )
Mutual Information
• By the chain rule for entropy, we have
H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y)
• Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X)=I(X; Y)
• I(X; Y) is called the mutual information
between X and Y.
• It is the reduction in uncertainty of one
random variable due to knowing about
another, or, in other words, the amount of
information one random variable contains
about another.
Relationship between I and H
H(X,Y)
H(X|Y)
I(X; Y)
H(Y|X)
H(Y)
H(X)
I(X;Y) H(X)- H(X | Y) H(X) H(Y)- H(X,Y)
1
1
p( x) log
p( y ) log
p( x, y ) log p( x, y )
p ( x) y
p( y ) x , y
x
p( x, y ) log
x, y
p ( x, y )
p( x) p( y )
Mutual Information (cont)
I(X; Y) H(X) - H(X | Y) H(Y)- H(Y | X)
• I(X; Y) is symmetric, non-negative measure of
the common information of two variables.
• Some see it as a measure of dependence
between two variables, but better to think of it as
a measure of independence.
– I(X; Y) is 0 only when X and Y are independent:
H(X|Y)=H(X)
• H(X)=H(X)-H(X|X)=I(X; X)
Why entropy is called self-information.
Mutual Information (cont)
• Don’t confuse with pointwise mutual information,
p ( x, y )
p( x) p( y )
• which has some problems…
The Noisy Channel Model
• Goal: encode the message in such a way that it
occupies minimal space while still containing
enough redundancy to be able to detect and
correct errors.
W
Message
X
Encoder
Input to
channel
Channel
p(y|x)
Y
^
W
Decoder
Output from
channel
Attempt to
reconstruct
message
based
on output
The Noisy Channel Model
• Want to optimize a communication across a
channel in terms of throughput and accuracy
– the communication of messages in the presence of
noise in the channel.
• There is a duality between compression
(achieved by removing all redundancy) and
transmission accuracy (achieved by adding
controlled redundancy so that the input can be
recovered in the presence of noise).
The Noisy Channel Model
• Channel capacity: rate at which one can
transmit information through the channel
with an arbitrary low probability of being
unable to recover the input from the output.
For a memoryless channel:
C max I(X;Y)
p(X)
• We reach a channel’s capacity if we
manage to design an input code X whose
distribution p(X) maximizes I between
input and output.
Language and the Noisy
Channel Model
• In language we can’t control the encoding phase;
we can only decode the output to give the most
likely input. Determine the most likely input given
the output!
^
I
Noisy Channel
p(o|i)
O
I
Decoder
ˆI argmaxp(i | o) argmax p(i)p(o| i)
p(o)
i
i
argmaxp(i)p(o| i)
i
The Noisy Channel Model
ˆI argmaxp(i | o) argmax p(i)p(o| i)
p(o)
i
i
argmaxp(i)p(o| i)
i
• p(i) is the language model, the distribution of
sequence of ‘words’ in the input language
• p(o|i) is the channel model (probability)
• This is used in machine translation, optical
character recognition, speech recognition, etc.
The Noisy Channel Model
Relative Entropy: Kullback-Leibler
Divergence
• For 2 probability functions, p(x) and q(x), their
relative entropy is:
D(p || q) p(x)log2
xX
p(x)
q(x)
p(X)
E p log2
q(X)
• The relative entropy, also called the KullbackLeibler divergence, is a measure of how different
two probability distributions are (over the same event
space).
• The KL divergence between p and q can also be
seen as the average number of bits that are wasted
by encoding events from a distribution p with a code
based on the not-quite-right distribution q.
Comments on Relative
Entropy
• Goal: minimize relative entropy D(p||q) to have a
probabilistic model as accurate as possible.
• Conventions:
– 0 log 0 = 0
– p log (p/0) = (for p > 0)
• Distance?
– not symmetric: D(p||q) D(q||p)
– does not satisfy the triangle inequality
– But can be useful to think of it as distance
Mutual Information and
Relative Entropy
• Random variables X, Y; pXY(x, y), pX(x), pY(y)
• Mutual information:
I(X,Y) = D(p(x,y) || p(x)p(y)) I(X;Y) p( x, y) log p( x, y)
x, y
p ( x) p( y )
• I(X,Y) measures how much (our knowledge of)
Y contributes (on average) to easing the
prediction of X
• Or how p(x, y) deviates from independence
(p(x)p(y))
Entropy and Language
• Entropy is measure of uncertainty. The more we
know about something the lower the entropy.
• If a language model captures more of the
structure of the language than another model,
then its entropy should be lower.
• Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we have already seen.
The Entropy of English
• We can model English using n-gram
models (also known a Markov chains).
• These models assume limited memory,
i.e., we assume that the next word
depends only on the previous k ones
(kth order Markov approximation).
• What is the Entropy of English?
– First order: 4.03 bits
– Second order: 2.8 bits
– Shannon’s experiment: 1.3 bits
Perplexity
• A measure related to the notion of cross entropy
and used in the speech recognition community
is called the perplexity.
• Perplexity(x1n, m) = 2H(x1n,m)
• A perplexity of k means that you are as
surprised on average as you would have been if
you had to guess between k equi-probable
choices at each step.
Homework 3
• Please collect three web news, and then
compute the entropies of word distribution for
one, two and three news, respectively.