Transcript Powerpoint

PMAI 2003/04
Targil 1: Basic Probability
(see further details in handouts)
Random Variables and Probability
A Random variable X describes a value that is the result of a
process which can not be determine in advance. The shape of
the card we draw from a deck is a random variable.
The Sample space S of a random variable is the set of all
possible values of the variable. An event is a subset of S.
A random variable is not necessarily random in the true sense
of the word. Usually we deal with random variables which
obey some law of nature, or a background probability. Still,
we can not determine it’s value, but the chance of
encountering a certain value. The probability function p(x)
defines this chance.
Random Variables and Probability (2)
Probability is our way of formally describing our world and
evaluating uncertainties.We define our world  as a set of
random variables and divide it into elementary events, or
states. The states are mutually exclusive. A probability space
holds the following properties:
 p( x)  1
1  p( x)  0
{ x}
For a continuous random variable, p(x) is the distribution
density function:
 p( x)  1
{ x}
Note: p(x) can be greater then 1 because it is not a probability value.
Expectation, Variance and Deviation
The moments of a random variable define important
characteristics of random variables:
Mn   x n    x n p ( x)
{ x}
The first moment is the expectation : M 1     x   xp( x)
{ x}
Note: The expectation has a misleading name and is not always the value we
expect to see most. In the case of the number on a dice the expectation is 3.5
which is not a value we will see at all!. The expectation is as a weighted average.
The variance is defined by Var[x] = <x2> - <x>2 = M2 - M12.
The standard deviation  = Var[x]1/2 evaluates the “spread
factor”or x in relation to the mean.
Conditional probability and Bayes’ Rule
The knowledge that a certain event occurred can change the
probability that another event will occur. P(x|y) denotes the
conditional probability that x will happen given that y has
happened. Bayes’ rule states that:
P A  B 
P  A
P A | B  
 P( B | A)
P( B)
P( B)
The complete probability formula states that
P(A) = P(A|B)P(B) + P(A|  B)P(B) or in the more general
case
P( A)   P( A | b) P(b)
b
Note: P(A) = P(A|B) + (1- )P(A|B) mirrors our intuition that the
unconditional probability of an event is somewhere between it’s
conditional probability based on two opposing assumptions.
Independence between random variables
Two random variables X and Y are independent if knowledge about
X does not change the uncertainty about Y and vice versa. Formally
for a probability distribution P:
Ind(X;Y)  P(X|Y) = P(X) (and symmetrically)
 If Ind(X;Y) holds then P(X,Y) = P(X)P(Y)
 If Ind(X;Y,Z) holds then P(X,Y,Z)=P(X)P(Y,Z)
 Ind(X;Y,Z)  Ind(X;Y) and Ind(X;Z) but not the other way around!
 If Ind(X;Y|Z) holds then P(X,Y|Z)=P(X|Z)P(Y|Z) or
P(X|Y,Z)=P(X|Z)
The importance of conditional probability:
Simpson’s Paradox
The following table describes the effectiveness of a certain drug on a population:
The ratio of recovery for the whole population increases
from 40/50 to 105/90. Surprisingly, the ratio of recovery
decreased for males as well as for females. How can this be?
The paradox lies in ignoring the context, or the condition in
which the results where given. If we had written the facts in
terms of correct conditional probabilities (in a 50% male
population) instead of absolute numbers we would get:
P(Recovery|Drug) = 1/2*15/55 + 1/2*90/140 = 0.4578
Stating your facts carefully:
The Three Prisoners Paradox
Three prisoners A,B and C have been tried for murder and are
expecting their sentence in the morning. They are told that
only one of them will be convicted and that the others will go
free. They also know that the guard knows which one of them
is to be convicted. At night, prisoner A call the guard and asks
him to pass a letter to one of his innocent friends (there is at
least one so the guard’s action should not change anything).
After a while, A asks the guard to tell him the identity of the
prisoner who received the letter. Again, the guard agrees to
tell him it is B as it does not appear to change anything. A, in
his bunk, thinks: “Before I talked to the guard my chances of
being executed were 1/3. Now that I know that B is innocent
my chances of dying have gone from 33% to 50%. I made
sure I did not ask anything about my own fate. What did I
do wrong?”
Stating your facts carefully:
The Three Prisoners Paradox (2)
Let us try and work out the problem in probabilistic terms. Let
GA denote the event of A being guilty and IB of B being
innocent.P(IB|GA)=1:
P(G A ) P(G A ) 1 / 3 1
P(G A | I B )  P( I B | G A )



P( I B ) P( I B ) 2 / 3 2
worse than that, we would have gotten the same answer if the
guard said that C is innocent. How can this be?
The paradox comes from not defining the context correctly. IB
was concluded from IB’=“The guard will declare B innocent”.
P(IB’)=1/2 and not 2/3. Rewriting the equations we get:
P(G A | I B )  P( I B ' | G A )
P(G A ) 1 / 2 1 / 3 1


P( I B ' )
1/ 2
3
Bayes’ rule and inference
Bayes’ is fundamental in learning when viewed in terms of
evidence and hypothesis:
posterioir
probability
likelyhood
prior
probability
PH 
P (e | H ) P  H 
P  H | e   P (e | H )

P(e) P(e | H ) P( H )  P(e | H ) P(H )
This allows us to update our belief in a hypothesis based on
prior belief and in response to evidence.
Binomial and Poisson distributions
Assume we perform n independent experiments (Bernoulli’s)
with a probability p for “success” and 1-p for “failure”
The distribution on the number of “success” k, called the
binomial distribution is given by:
pn(k)  p (1  p)
k
nk
n
 
k 
n
E (k )   k  p (1  p)
k
k 0
nk
n
   np
k 
Var[k ]  np(1  p)
In the limit where n  , p  0, p  n   (constant), we
approximate pn(k) by the Poisson distribution:
pn ( k ) 
k
k!
e 
E[k ]  Var[k ]  
the Poisson distribution is especially important in physical
processes in time (or space) such as the firing rate of a
neuron, etc.
The Normal distribution
The one-dimensional normal distribution, characterized by
the expectation  and standard deviation  is:
  ( x   )2 
1
p ( x) 
exp 

2
2

2 


This distribution is a good approximation for the binomial
distribution where n is large (and np>>0) and in the
“vicinity”of .
The normal distribution has desired computational properties
which we will meet later in the course. For example the
distribution of a sum of guassians (variables that distribute
normally) is itself a normal distribution with simply
calculated expectations and variances.
Important inequalities
Markov’s inequality: let x be a positive random variable
with an expectation  = E(x). Then for every   0:
P( x   ) 
E ( x)

the intuition is that as  is further away from the expectation,
the probability for sample with x   decreases significantly:



0
0

E ( x)   xP( x)dx   xP( x)dx   xP( x)dx




  xP( x)dx    P( x)dx  P( x   )
substituting y=(x- )2 we get Chevichev’s inequality: for
every >0
2

P( x     )  2

Important inequalities
1 n
Substituting y   xi in Chevichev’s inequality we get:
n i 1


1
var( y )  E ( y  E ( y ))  2
n
2
n
 var( x )  
i 1
i
1 n
 2
 P   xi  E ( x)    
 0, 
2
 n i 1
 n
does remind you of anything?
Chernoff’s inequality is used to bound deviations from the
expectation for a sum of Bernoulli’s variables: S=X1+…+Xm
where p(Xi=1)=p. The additive bound (Heoffding) is:
P S  pm  (1   ) pm  e
2 m 2
and the multiplictive bounds:
PS / pm  1     e
 mp 2 / 3
PS / pm  1     e
 mp 2 / 2
2