Transcript Document

SOME BASIC NOTIONS OF
PROBABILITY THEORY
Universita’ di Venezia
29 Settembre 2003
1
September 2003
What probability theory is for



2
Suppose that we have a fair dice, with six faces, and that we keep
throwing (‘casting’) it. For what proportion of the throws will we get
– a particular value (say, 4)?
– an even value?
– a value greater than or equal to 3?
PROBABILITY THEORY was developed to give us a vocabulary to talk
about the LIKELIHOOD of certain EVENTS
Where an EVENT is any result of a TRIAL (‘experiment’)
– Getting the value 4 when casting a die,
– Getting a value greater than 3,
– But also: winning a race, getting a ‘tail’ result when flipping a coin,
encountering a certain word
September 2003
EVENTS and OUTCOMES

More precisely, we call every possible result of
a trial an OUTCOME
–

An EVENT is defined as a set of possible
OUTCOMES (possibly just one):
–
–
–
3
E.g., any of the numbers on the die, such as 4,
constitutes an outcome
E1 = {4}
E2 = {2,4,6}
E3 = {3,4,5,6}
September 2003
SAMPLE SPACES

The SAMPLE SPACE is the set of all possible
outcomes:
–

For the case of a dice, sample space S = {1,2,3,4,5,6}
Another example:
–
–
–
–
Writing down a word is a TRIAL,
The word written down is an OUTCOME,
EVENTS which result from this trial are: writing that particular
word, writing that word in uppercase letters, etc
The set of all possible spellings is the SAMPLE SPACE

4
(NB: sometime the sample space is NOT finite)
September 2003
Probability Functions


The likelihood of an event is indicated using a
PROBABILITY FUNCTION
The probability of an event E is specified by a function
P(E), with values between 0 and 1
–
–

Example: in the case of die casting,
–
–
5
P(E) = 1: the event is CERTAIN to occur
P(E) = 0: the event is certain NOT to occur
P(E’ = ‘getting as a result a number between 1 and 6’) =
P({1,2,3,4,5,6}) = 1
P(E’’ = ‘getting as a result 7’) = 0
September 2003
Probabilities and
relative frequencies




6
In the case of a die, we know all of the possible outcomes ahead
of time, and we also know a priori what the likelihood of a certain
outcome is. But in many other situations in which we would like to
estimate the likelihood of an event, this is not the case.
For example, suppose that we would like to bet on horses rather
than on dice. Harry is a race horse: we do not know ahead of time
how likely it is for Harry to win. The best we can do is to
ESTIMATE P(WIN) using the RELATIVE FREQUENCY of the
outcome `Harry wins’
Suppose Harry raced 100 times, and won 20 races overall. Then
– P(WIN) = WIN/TOTAL NUMBER OF RACES = .2
– P(LOSE) = .8
The use of probabilities we are interested in (estimate the
probability of certain sequences of words) is of this typeSeptember 2003
Conjunctions of events


We are often interested in the probability of TWO events
happening:
– When throwing a die TWICE, the probability of getting a 6 both
times
– The probability of finding a sequence of two words: `the’ and
`car’
We use the notation A&B to indicate the conjunction of two events,
and P(A&B) to indicate the probability of such conjunction
– Because events are SETS, the probability is often also written
as
P( A  B)

7
We use the same notation with WORDS: P(‘the’ & ‘car’)
September 2003
Prior probability vs. conditional
probability




8
The prior probability P(WIN) is the likelihood of an event occurring
irrespective of anything else we know about the world
Often however we DO have additional information, that can help
us making a more informed guess about the likelihood of a certain
event
E.g, take again the case of Harry the horse. Suppose we know
that it was raining during 30 of the races that Harry raced, and that
Harry won 15 of these races. Intuitively, the probability of Harry
winning when it’s raining is .5 - HIGHER than the probability of
Harry winning overall
– We can make a more informed guess
We indicate the probability of an event A happening given that we
know that event B happened as well – the CONDITIONAL
PROBABILITY of A given B – as P(A|B)
September 2003
Conditional probability

Conditional probability is DEFINED as follows:
P( A & B)
P( A | B) 
P( B)

9
Intuitively, you RESTRICT the range of trials in
consideration to those in which event B took place, as
well (most easily seen when thinking in terms of
relative frequency)
September 2003
Example

Consider the case of Harry the horse again:
P(WIN & RAIN )
P(WIN | RAIN ) 
P( RAIN )

Where:
–
–

P(WIN&RAIN) = 15/100 = .15
P(RAIN) = 30/100 = .30
This gives:
0.15
P(WIN | RAIN ) 
 0.5
0.3
10

(in agreement with our intuitions)
September 2003
The chain rule

The definition of conditional probability can we
rewritten as:
–
–

These equation generalize to the so-called CHAIN
RULE:
–

P(w1,w2,w3,….wn) = P(w1) P(w2|w1) P(w3|w1,w2) …. P(wn|w1 ….
wn-1)
The chain rule plays an important role in statistical
NLE:
–
11
P(A&B) = P(A|B) P(B)
P(A&B) = P(B|A) P(A)
P(the big dog) = P(the) P(big|the) P(dog|the big)
September 2003
Independence




12
Additional information does not always help. For example,
knowing the color of a dice usually doesn’t help us predicting the
result of a throw; knowing the name of the jockey’s girlfriend
doesn’t help predicting how well the horse he rides will do in a
race; etc. When this is the case, we say that two events are
INDEPENDENT
The notion of independence is defined in probability theory using
the definition of conditional probability
Consider again the basic form of the chain rule:
– P(A&B) = P(A|B) P(B)
We say that two events are INDEPENDENT if:
– P(A&B) = P(A) P(B)
– P(A|B) = P(A)
September 2003
Bayes’ Theorem


Bayes’ Theorem is a pretty trivial consequence of the
definition of conditional probability, but it is very useful
in that it allows us to use one conditional probability to
compute another
We already saw that the definition of conditional
probability can be rewritten equivalently as:
–
–

13
P(A&B) = P(A|B) P(B)
P(A&B) = P(B|A) P(A)
If we equate the two left sides, we get Bayes’ theorem
P( B | A) 
P( A | B) P( B)
P( A)
September 2003
Statistical NLE





What’s the connection between this and natural language?
A number of NL interpretation (and generation) tasks can be formulated in
terms of CHOICE BETWEEN ALTERNATIVES: choosing the most likely
– continuation of a certain sentence
– POS tag or meaning for a word
– Parse for a sentence
In all of these cases, we can formalize `likelihood’ using probabilities, and
choose the alternative with THE HIGHEST PROBABILITY
Tomorrow we will see the first (and simplest) example of this: choosing
the most likely next word
This task can be viewed as the task of choosing the w that maximizes:
P(w | W1 …. WN-1)
14
September 2003
Using corpora to estimate
probabilities


But where do we get these probabilities? Idea:
estimate them by RELATIVE FREQUENCY.
The simplest method: Maximum Likelihood
Estimate (MLE). Count the number of words in
a corpus, then count how many times a given
sequence is encountered.
C (W1..Wn )
P (W1..Wn ) 
N
15
September 2003
Readings



16
Krenn and Samuelsson, The Linguist’s Guide
to Statistics (on the Web site)
The Statistics Glossary
Further reading: Manning and Schuetze,
chapter 2
September 2003