Probability for linguists

Download Report

Transcript Probability for linguists

Probability for linguists
John Goldsmith
What is probability theory?




It is the quantitative theory of evidence.
It derives, historically, from the study of
games of chance – gambling – and of the
insurance business.
To this day, the basic examples of probability
are often presented in terms that grow out of
gambling: toss of a die, the draw of a card.
Alas.
Knowledge of language?



Knowledge of language is the long-term
knowledge that speakers bring to bear in the
analysis, and the generation, of specific
utterances in particular contexts.
Long-term knowledge?
Understanding a specific utterance is a task
that combines knowledge of the sounds
that were uttered and long-term knowledge
of the language in question.
But before we can get to that…

The basic game plan of all of probability
theory is to



define a universe (“sample space” or “universe of
possible outcomes”)
define a method of dividing up “probability mass”
(whose total weight is 1.0) over all of the sample
space.
That’s it!
Again:
For this discrete case, probability theory obliges
us to:
1. Establish the universe of outcomes that we
are interested in;
2. Establish a means for distributing probability
mass over the outcomes, in such a way that
the total amount of probability distributed
sums to 1.0 – no more, and no less.
The role of a model


In every interesting case, the probability
assigned to the basic members of the sample
space is calculated according to some model
that we devise – and therein lies our
contribution.
Let’s turn to some examples.
Letters



Let our sample space be the 26 letters of the Latin
alphabet.
To make a probabilistic model, we need an
assignment of probability to each letter a…z:
prob(a) ≥ 0;
and the sum of all of these probabilities is 1:
26
 pr(l )
i 1
i
Why?



Probabilities are our eventual contact with
reality, and with observations.
Probabilities are our model’s way of making
quantitative predictions.
We will (eventually) evaluate our model by
seeing how good its probabilities are.
Counts, frequencies, and probabilities



Counts are integral – at least they are for
now; eventually we drop that assumption. For
now, they are the result of empirical
measurement.
Frequency (relative or proportional) is the
ratio of a count of a subset to the count of the
whole universe.
Probability is a parameter in a probabilistic
model.
Distribution

A distribution is a set of non-negative
numbers which sum to 1.
Brief digression: some niceties

The sample space Ω need not be finite or discrete,
and we actually care about a large set F of subsets
of Ω. The sets in F are called events, and we require
that F be a s-algebra (so F is closed under
complementation, countable union and
intersection). A probability measure P on (Ω, F) is a
function P such that F(Ω) = 1, F(Ø) = 0, and the
probability of a countable set of pairwise disjoint
members of F is the sum of their probabilities:


i 1
i 1
P( Ai )   P( Ai )
Uniform distribution


A uniform distribution is one in which all
elements in the sample space have the same
probability.
If there are 26 letters in our sample space,
each letter is assigned probability 1/26 by the
uniform distribution.
Uniform distribution (2)



If the sample space is rolls of a die, and the
distribution is uniform, then each die-roll has
probability equal to 1/6.
Sample space can be complex – e.g.,
sequences of 3 rolls of a die. In that case,
there are 63 = 216 members, and each has
probability 1/216 if the distribution is uniform.
If the sample space is N rolls of a die, then
there are 6N members, each with prob 1/6N.
More examples

If the sample space is all sequences of 1, 2
or 3 rolls of a die, then the size of the sample
space is 6 + 62 + 63 = 258, and if the
distribution is uniform, each has probability
1/258.
Words in the sample space

Suppose our sample space has 1,000
elements, and each is associated with 1 of
the 1,000 most frequent words in the Brown
corpus; each is also associated to the
number (between 1 and 1,000) associated
with the rank of that number.
Probability distribution



We could associate a uniform distribution to that
sample space; or we could assign a probability
to each word based on its frequency in a
corpus.
Let N = total number of words in the corpus
(1,160,322 in Brown corpus).
counts( wordi )
frequency of wordi =
1,160,322
E.g., frequency (“the”)
= 69,681 / 1,160,322
= .0600…
Is it clear that the sum of all these frequencies is
necessarily 1.0?
count( word N )
count( word1 ) count( word2 )

 ... 
Total count
Total count
Total count
N
count( wordi )

i 1 Total count
Sequences
When we think about sequences, we see that
the probability assigned to each element
(=each sequence) is assigned through a
model, not directly.
In the simplest case, the probability of a
sequence is the product of the probabilities of
the individual elements.
Random variable (r.v.)


A random variable is a function from our
probability space (the sample space,
endowed with a probability distribution) to the
real numbers.
Example: Our sample space has 1,000
elements, and each is associated with 1 of
the 1,000 most frequent words in the Brown
corpus; our r.v. R maps each word to the
number (between 1 and 1,000) associated
with the rank of that number.
Sample space: words



Suppose we replace our usual die with one
with 1,000 faces, and each face has a
number (1-1,000) and one of the 1,000 most
frequent words from the Brown corpus on it
(1=the, 2=of, etc.). Now we choose numbers
at random: for example, 320 990 646 94 756
which translates into: whether designed
passed must southern.
For the moment, assume a uniform
distribution…
We’re more interested in the probability our
model assigns to sentences that already exist.
For example,
(1) In the beginning was the word
Since these words are in the Brown corpus, each
has probability 1/1000, and the probability of (1)
is
1
1

10 36

1018
 10 18


Any sequence of 6 words in the top 1,000 words
from the Brown Corpus will have this probability. Is
this good, or bad?
We could increase our vocabulary to the entire
47,885 words of the Brown Corpus. Then
sentence (1) would have probability
1
-5 6
 29
 (2.1*10 )  8.6 *10
6
47,885
Or: use frequency as our probability


pr (the) = 0.0600, as
above
107
pr (leaders) = 1,160,322  0.000 092
(ranked 1,000 in corpus)
Be clear on the difference
between frequencies
and probabilities.
Word
Count
Frequency
the
69681
0.060053
of
36372
0.031346
and
28506
0.024567
to
26080
0.022477
a
23170
0.019969
in
21261
0.018323
'
10716
0.009235
that
10704
0.009225
is
10085
0.008692
was
9786
0.008434
he
9729
0.008385
What is the probability of the sentence
“The woman arrived” ?
What is the sample space? Strings of 3
words from the Brown Corpus.
 Probability of each word is its frequency in
the Brown Corpus.
pr ("the“) = 0.060 080;
pr (“woman”) = 0.000 212 030;
pr (“arrived”) = .000 053 482

Some notation



S = “the woman arrived”
S[1] = “the”, S[2] = “woman”, S[3] = “arrived”
pr (S) = pr (S[1] = “the” & S[2] = “woman” &
S[3] = “arrived”)
Stationary model


For all sentences S, all words w and all
positions i and j:
prob ( S[i] = wn ) = prob ( S[j] = wn ).
Is this true?
Back to the sentence
Pr (“the woman arrived”) = 6.008 020 * 10-2
2.122 030 * 10-5
* 5.348 207 * 10-5 =
6.818 x 10-11

“in the beginning was the word”




We calculated that this sentence’s probability
was 1 in 1018 in the sample space of all
sentences of exactly 6 words, selected from
the 1,000 most frequent words in the Brown
Corpus.
8.6 * 10-29 in the sample space of all six-word
sentences built from the Brown vocabulary.
And in frequency-based distribution:
1.84 x 10-14
in
0.018322785
the
0.060080206
beginning
0.000139743
was
0.008439816
the
0.060080206
word
0.000236356
PRODUCT
1.84368E-14


We prefer a model that scores better (by
assigning a higher probability) to sentences
that actually and already exist –
we prefer that model to any other model that
assigns a lower probability to the actual
corpus.
Some win, some lose

In order for a model to assign higher
probability to actual and existing sentences,
it must assign less probability mass to other
sentences (since the total amount of
probability mass that it has at its disposal to
assign totals up to 1.000, and no more). So of
course it assigns lower probability to a lot of
unobserved strings.
Word order?

Alas, word order counts for nothing, so far.
Probability mass

It is sometimes helpful to think of a distribution as a
way of sharing an abstract goo called probability
mass around all of the members of the universe of
basic outcomes (also known as the sample space).
Think of there being 1 kilogram of goo, and it is cut
up and assigned to the various members of the
universe. None can have more than 1.0 kg, and
none can have a negative amount, and the total
amount must add up to 1.0 kg. And we can modify
the model by moving probability mass from one
outcome to another if we so choose.
Conditional probability


Sometimes we want to shift the universe of
discussion to a more restricted sub-universe
– this is always a case of having additional
information, or at least of acting as if we had
additional information.
Universe = sample space.


We look at our universe of outcomes, with its
probability mass spread out over the set of
outcomes, and we say, let us consider only a
sub-universe, and ignore all possibilities
outside of that sub-universe. We then must
ask: how do we have to change the
probabilities inside that sub-universe so as to
ensure that the probabilities inside it add up
to 1.0 (to make it a distribution)?
Answer: Divide probabilities by total amount
of probability mass in the sub-universe.
How do we get the new information?
Peek and see the color of a card.
 Maybe we know the word will be a noun.
 Maybe we know what the last word was.
Generally: we have to pick an outcome, and we
have some case-particular information that
bears on the choice.

Cards: pr (Queen of Hearts|red)



Pr (Queen Hearts) = 1/52;
Pr(red card)= ½;
Pr (Queen Hearts, given Red) =
1
52  2  1
1 52 26
2
Definition of probability of A, given B:
prob( A and B)
prob( A | B) 
prob( B)
Guessing a word, given knowledge of
previous word:



pk( S[i] = wj given that S[i-1] = wk ), which is
usually written in this way:
pk( S[i] = wj | S[i-1] = wk )
Pr (“the”) is high, but not if the preceding
word is “a”.
Almost nothing about language or about
English in particular has crept in. The fact
that we have considered conditioning our
probabilities of a word based on what word
preceded is entirely arbitrary; we could just
as well look at the conditional probability of
words conditioned on what word follows, or
even conditioned on what the word was two
words to the left.
Table 2: Top of the Brown Corpus for words following "the":
Word
Count
Frequency
0first
664
1same
629
2other
419
3most
419
4new
398
5world
393
6united
385
7state
292
8two
267
9only
260
10time
250
11way
239
12old
234
13last
223
14house
216
15man
214
16next
210
17end
206
18fact
194
19whole
190
0.009494
0.008994
0.005991
0.005991
0.005691
0.005619
0.005505
0.004175
0.003818
0.003718
0.003575
0.003417
0.003346
0.003189
0.003089
0.003060
0.003003
0.002946
0.002774
0.002717

One of the most striking things is how few
nouns, and how many adjectives, there are
among the most frequent words here -- that's
probably not what you would have guessed.
None of them are very high in frequency;
none place as high as 1 percent of the total.
Table 3: Top of the Brown Corpus for words following "of".
Total count 36388
Word
Count
Count / 36,388
1
the
9724
0.26723
2
a
1473
0.04048
3
his
810
0.02226
4
this
553
0.0152
5
their
342
0.0094
6
course
324
0.0089
7
these
306
0.00841
8
them
292
0.00802
9
an
276
0.00758
10
all
256
0.00704
11
her
252
0.00693
12
our
251
0.0069
13
its
229
0.00629
14
it
205
0.00563
15
that
156
0.00429
16
such
140
0.00385
17
those
135
0.00371
18
my
128
0.00352
19
which
124
0.00341
Among the words after "of", one word is over
25%: "the". So not all words are equally
helpful in helping to guess what the next word
is.
Table 4: Top of the Brown Corpus for words preceding "the".
Total count 69936
word
count
count / 69,936
1
of
9724
0.13904
2
.
6201
0.08867
3
in
6027
0.08618
4
,
3836
0.05485
5
to
3485
0.04983
6
on
2469
0.0353
7
and
2254
0.03223
8
for
1850
0.02645
9
at
1657
0.02369
10
with
1536
0.02196
11
from
1415
0.02023
12
that
1397
0.01998
13
by
1349
0.01929
14
is
799
0.01142
15
as
766
0.01095
16
into
675
0.00965
17
was
533
0.00762
18
all
430
0.00615
19
when
418
0.00598
20
but
389
0.00556

Note the prepositions.
Table 5: Top of the Brown Corpus for words 2 to the right of
"the".
Total count 69936
Word
Count
Count / 69,936
1
of
10861
0.1553
2
.
4578
0.06546
3
,
4437
0.06344
4
and
2473
0.03536
5
to
1188
0.01699
6
'
1106
0.01581
7
in
1082
0.01547
8
is
1049
0.015
9
was
950
0.01358
10
that
888
0.0127
11
for
598
0.00855
12
were
386
0.00552
13
with
370
0.00529
14
on
368
0.00526
15
states
366
0.00523
16
had
340
0.00486
17
are
330
0.00472
18
as
299
0.00428
19
at
287
0.0041
20
or
284
0.00406

If you know a word is "the", then the
probability that the word-after-next is "of" is
greater than 15% -- which is quite a bit.
Exercise:
What do you think the probability distribution is
for the 10th word after "the"? What are the
two most likely words? Why?
Bayes’ Rule: the, given that preceding is of
How do we calculate what the probability is that
the nth word of a sentence is “the” if the n-1st
word is “of”? We count the number of
occurrences of “the” that follow “of”, and divide
by the total number of “of”s.
 Total number of "of":
36,341
 Total number of "of the":
9,724
 p ( S[i] = the | S[i-1] = of ) =
9724 / 36341 = 0.267

Of, given that following is the:
What is the probability that the nth word is "of",
if the n+1st word is "the"? We count the
number of occurrences of "of the", and divide
by the total number of "the": that is,
 p ( S[i] = "of" | S[i+1] = "the" ) =
9,724 / 69,903 = 0.139
Bayes’ Rule:
The relationship between
 pr ( A | B ) "the probability of A given B" and
 pr ( B | A ) "the probability of B given A".
pr(A)
pr (A | B )  pr ( B | A)
pr(B)
pr( A and B)  pr( A | B) * pr( B)
pr( A and B)  pr( B | A) * pr( A)
pr( A | B) * B  pr( B | A) * pr( A)
pr(A)
pr (A | B )  pr ( B | A)
pr(B)
p ( S[i]  " of"| S[i  1]  " the")

p ( S[i]  " the"| S[i - 1]  of ) * p( S[i - 1]  " of")
p( S[i  1]  " the")
The joy of logarithms






Probabilities get too small too fast.
Logs help.
Log probabilities are very big and negative.
We make them positive.
Positive log probabilities (a.k.a. inverse log probabilities, ILP)
ILP’s have a meaning in terms of the ideal compressed
length of a symbol in a compression scheme.
Notation: {p} = -1 * log p.
More standard notation:
~
p


It's important to be comfortable with the
notation, so that you see easily that the
preceding equation can be written as follows:
 4
 4
log  prob(S[i])   log prob(S[i])
 i 1
 i 1
Base 2 logs
2x is y, then log2(y) is x.
 23 is 8, so log (8) = 3;
 2-3 is 1/8, so log(1/8) = -3.
 If
Adding log probs in unigram model



The probability of a sentence S in the unigram
model is the product of the probabilities of its
words, so the log probability of a sentence in the
unigram model is the sum of the log probabilities of
its words.
Thus the longer the sentence gets, the larger its log
probability gets. In a sense that is reasonable -- the
longer the sentence, the less likely it is.
But we might also be interested in the average log
probability of the sentence,the total log probability of
the sentence divided by the number of words

it's the average log probability per word =
1
N

N
{ S[i] }
i 1
entropy -- especially if we're talking about
averaging over not just one sentence, but a
large, representative sample, so that we can
say it's (approximately) the entropy of the
language, not just of some particular
sentence.
N
 { S[i]
}
i 1

Observe carefully that this is a sum in which
we sum over the successive words of the
sentence. When i is 1, we are considering the
first word, which might be the, and when i is
10, the tenth word might be the as well.
It makes sense to re-order the summing of the
log probabilities -- because the sum is the
same regardless of the order in which you
add numbers -- so that all the identical words
are together.
This means that we can rewrite the sum of the
log probabilities as a sum over words in the
vocabulary (or the dictionary -- a list where
each distinct word occurs only once), and
multiply the log probability by the number of
times it is present in the entire sum. Thus:
N
( sum over words in string)
{ S[i] } 
i 1
V
 count(word ){word }
j 1
j
j
( sum over vocabulary)

If we've kept track all along of how many
words there are in this corpus (calling this
"N"), then if we divide this calculation by N,
we get, on the left, the average inverse log
probability, and, on the right:
V

j 1
count(word j )
N
Frequency of wordj
{word j }

So we can rewrite this as:
V
 prob(word
j 1
V
j
){word j }
  prob( word j ) log prob(word j )
j 1


The fundamental goal of analysis is to
maximize the probability of the observed
data.
If we restrict ourselves at first to the
unigram model, then it is not difficult to
prove – and it is important to recognize –
that the maximum probability that can be
obtained for a given corpus is the one
whose word-probabilities coincide
precisely with the observed frequencies.



We can assign a probability to a corpus with
any distribution.
If there are words in the corpus which do not
get a positive value in the distribution, then
the corpus will receive a total probability of
zero, but that is not an impossible situation.
We refer to the set which gets a non-zero
probability as the support of the distribution.
Computational linguists may say that they are
concerned with making sure that all words
are in the support of their probability
distribution.



Suppose we built a distribution for the
words of a corpus randomly.
To make this slightly more concrete, let's
say that these probabilities form the
distribution Q, composed of a set of values
q(wordi), for each word in the corpus (and
possibly other words as well).
Even this randomly assigned distribution
would (mathematically) assign a probability
to the corpus.
N
( sum over words in string)
V
  q(word j )
j 1
 q(S[i])
i 1
count ( word j )
(sum over vocabulary)

If we now switch to thinking about the log
probability, any particular word which occurs
k times in the corpus will contribute k times
its log probability to the entire sum which
gives us the (positive, or inverse) log
probability:
V
 count(word ){word }
j 1
j
(sum over vocabulary)


What should be clear by now is that we can
use any distribution to assign a probability to
a corpus. We could even use the uniform
distribution, which assigns the same
probability to each word.
Now we can better understand the idea that
we may use a distribution for a given corpus
whose probabilities are defined exactly by the
frequencies of the words in a given corpus.


It is a mathematical fact that this "empirical
distribution" assigns the highest probability to
the corpus, and this turns out to be an
extremely important property.
Important: you should convince yourself now
that if this is true, then the empirical
distribution also assigns the lowest entropy to
the corpus.
It follows from what we have just said that if there is
a "true" probability distribution for English, it will
assign a lower probability to any given corpus that
the empirical distribution based on that corpus
and that the empirical distribution based on one corpus
C1 will assign a lower probability to a different
corpus C2 than C2's own empirical distribution.
Putting that in terms of entropy (that is, taking the
positive log of the probabilities that we have just
mentioned, and dividing by N, the number of words
in the corpus), we may say that the "true" probability
distribution for English assigns a larger entropy to a
corpus C than C's own empirical distribution, and
C1's empirical distribution assigns a higher entropy to a
different corpus C2 than C2's own empirical
distribution does.

When we calculate this formula, weighting one
distribution (like an observed frequency
distribution) by the log probabilities of some
other distribution D2, we call that the crossentropy;

and if we calculate the difference between
the cross-entropy and the usual (self)
entropy, we also say that we are calculating
the Kullback-Leibler (or "KL") divergence
between the two distributions.
Mathematically, if the probability assigned to
wordi by D1 is expressed as D1(wordi) (and
likewise for D2), then the KL divergence is…
V
 D (word
j 1
1
j
) log D1 ( word j )  D1 log D2 ( word j )
V
 D (word
j 1
1
j
) log
D1 (word j )
D2 (word j )
The quantity –log2 freq (.) is of great
importance; we can refer to it as the
information content, or the optimal
compressed length, because it is always
possible to develop a compression scheme
by which a symbol w, emitted with probability
prob (w), is represented by a placeholder of
length -1*log2 prob(w) bits.
Probability of a corpus on a uniletter
model
 [l i ] 



i 1  N 
26

[ li ]
And if we add 10 e’s:
[ e ]10
i
 [li ]   [e]  10 

 


N

10
N

10

 
i 1,i  e 
26
[l ]
where N  [ei ]
Let’s calculate the difference of these two
log probabilities, or the log of their
ratio… [l ]

  [e]  10 
new probability




  N  10   N  10 
[ e ]10
[ li ]
26
i
i 1,i  e
[ li ]
 [li ] 
 

i 1  N 
26
[ e ]10
old probability
N ([e]  10)
 N 



N 10
[e]
( N  10) [e]
 N  10 
N
N
[e]  1010 [e]  10[e]
N  1010 [e][e]
Now we take inverse logarithm, which is
the difference of their information
content:
 N  N [e]  1010 [e]  10[ e ] 
 log

0
10
[e]
[ e]
 N  10  N  10

N new
[e]new
 N log
 10 log probnew (e)  [e] log
N old
[e]old
Global change in the
information content of
all of the old symbols;
positive
10 * information
content of an [e];
We’re subtracting a
negative number
change in the
information content
of each of the old e’s;
We’re subtracting
a positive number
Mutual information


The mutual information between two
variables is a measure of the relationship
between their joint occurrence and their
individual occurrences.
In the case we’re looking at, we might be
interested in the variables specifying the
letters at adjacent (ith and i+1th) locations in a
string.
Mutual information
The mutual information between two letters is
the log of the ratio of the observed frequency
divided by the frequency that we would
predict if the data had no structure, which is
the product of the frequencies of the letters:
the log of the found to expected-if-nostructure --

MI(X,Y) =
freq( XY )
log
freq( X ) freq(Y )
 log freq( XY )
 log freq( X )
 log freq(Y )
This is a measure
of the stickiness
between X and Y in
the data: they can
attract (MI>0) or
repel (MI < 0).