Review of probability theory

Download Report

Transcript Review of probability theory

580.691 Learning Theory
Reza Shadmehr
Review of Probability Theory
Suggested reading:
Hoel PG, Port SC, and Stone CJ (1971) Introduction to Probability Theory,
Houghton-Mifflin publishers, Boston, pp. 14-128.
A Philosophical Essay on Probabilities (1814)
“Probability is the ratio of the number of favorable
cases to that of all cases possible.”
Suppose we throw a coin twice. What is the probability that
we will throw exactly one head?
There are four equally possible cases that might arise:
1. One head and one tail.
2. One tail and one head.
3. Two tails.
4. Two heads.
Pierre Simon de Laplace
(1749-1827)
So there are 2 cases that will give us a head. The probability
that we seek is 2/4.
Laplace firmly believed that, in reality, every event is fully determined by general laws of the universe. But nature is complex and we are
woefully ignorant of her ways; we must therefore calculate probabilities to compensate for our limitations. Event, in other words, are
probable only relative to our meager knowledge. In an epigram that has defined strict determinism ever since, Laplace boasted that if
anyone could provide a complete account of the position and motion of every particle in the universe at any single movement, then total
knowledge of nature’s laws would permit a full determination of all future history. Laplace directly links the need for a theory of probability to
human ignorance of nature’s deterministic ways. He writes: “So it is that we owe to the weakness of the human mind one of the most
delicate and ingenious of mathematical theories, the science of chance or probability.” (Analytical Theory of Probabilities, as cited by
Stephen J. Gould, Dinosaurs in a Haystack, p. 27.
“If events are independent of one another, the probability of their combined
existence is the product of their respective probabilities.”
Suppose we throw two dice at once. The probability of getting “snake eyes” (two ones)
is1/36.
“The probability that a simple event in the same circumstances will occur consecutively
a given number of times is equal to the probability of this simple event raised to the
power indicated by this number.”
“Suppose that an incident be transmitted to us by twenty witnesses in such a manner
that the first has transmitted it to the second, the second to the third, and so on.
Suppose again that the probability of each testimony be equal to the fraction 9/10.
That of the incident resulting from the testimonies will be less than 1/8. Many
historical events reputed as certain would be at least doubtful if they were submitted
to this test.”
x  0,1
P  x  1  q
P  x  0  1  q
 
P  x (1)  1, x (2)  1  P  x (1)  P  x (2)   q 2
P x (1)  1  q
“When two events depend upon each other, the probability of the compound
event is the product of the probability of the first event and the probability
that, this event having occurred, the second will occur.”
Suppose we have three urns, labeled A, B, and C. Two of the urns have only white
balls, and one urn that has only black balls. We take one ball from urn C. What is
the probability that it is white?
(A,B,C)=(1,1,0) or (0,1,1) or (1,0,1)
Probability of picking a white ball from urn C is 2/3.
When a white ball has been drawn from urn C, the probability of drawing a white ball
from urn B is 1/2.
Therefore, probability of drawing two white balls from urns B and C is 1/3.
P  b  1, c  1  P  b  1 c  1 P  c  1
p  x, y   p  x y  p  y 
 p  y x  p( x)
Bayes’ rule
p x y 
p  y x  p( x)
p y
p  y   p  y x  p  x  dx

Example: Suppose that in a group of people, 40% are male and 60% are
female, and that 50% of the males and 30% of the females smoke. Find the
probability that a smoker is male.
x =M or F
y =Smoker or Nonsmoker
P  x  M   0.4
P  x  F   0.6
P  y  S x  M   0.5
P  y  S x  F   0.3
Px  M y  S   ?
Px  M y  S  
P y  S x  M Px  M 
P y  S 
P y  S   P y  S x  M Px  M   P y  S x  F Px  F 
 (0.5)(0.4)  (0.3)(0.6)  0.38
0.20
Px  M y  S  
 0.53
0.38
Expected value and variance of scalar random variables
E  x 

 xp  x  dx   xp  x   x
E  ax   aE  x 
x

2
2
var  x   E  x  E  x   
x  x  p  x  dx


 

2
 E  x  x  


 E  x 2  2 xx  x 2   E  x 2   2 E  x  x  E  x 2 


 
 
 E  x2   2 x 2  x 2
 
 E  x2   x 2
 
var  ax   a 2 var  x 
Mean squared error
2
E  x  r    E  x 2  2 xr  r 2 




 E  x2   2E  x r  r 2
 
 var  x   E  x   2 E  x  r  r 2
2
 var  x    E  x   r 
variance
bias
2
Binomial distribution and discrete random variables
Suppose a random variable can only take one of two variables (e.g., 0 and 1,
success and failure, etc.). Such trials are termed Bernoulli trials.
x  0,1
P  x  1  q

x  x(1) , x(2) ,
Probability density
or distribution
Probability of a specific
sequence of
successes and failures
, x( N )
P  x  0  1  q

1 x
p  x   q x 1  q 
P x  q
x (1)
1 x(1)
1  q 
q
x (2)
1 x(2)
1  q 
q
x( N )
1  q 
n  number of times the trial succeeded
n
1 x( N )
N
 x (i )
i 1
N n
N!
N n
N n
p  n     q 1  q 

q n 1  q 
n ! N  n !
n
E  n   Nq
Why?
var  n   Nq 1  q 
If we just do one trial:
1 q
q
p ( n)
E  n  0(1  q)  1q  q
var  n  E  n2   E  n
 
2
0
1
var  n  q  q2  q 1  q 
n
1 q
q
 
E  n 2   0(1  q )  1q  q
 
p n2
0
1
n2
If we do N trials:
x  0,1 ,
n
E  n 
var  n  
 x(1) , x(2) ,
, x( N )
N
 x (i )
i 1
N

i 1
N
E  x (i )   Nq


 var  x(i)   Nq 1  q 
i 1

Example of Binomial distribution
Suppose a machine produces light bulbs with 0.1% probability that a bulb is
defective. In a box of 200 bulbs, what is the probability that no bulbs are defective?
x  0,1

P  x  0   0.999
x  x(1) , x(2) ,
P  x   0,0,
, x(200)

P  x  1  0.001
,0  0.999200  0.82
In the same box, what is the probability that no more than two bulbs are defective?
N  200, q  0.001
P  n  2   P  n  0   P  n  1  P  n  2 
P  n  0   0.82
P  n  1 
N!
N n
q n 1  q 
 0.164
n ! N  n !
P  n  2   0.016
P  n  2   0.99
We notice that because a defect is a rare event, the distribution of n (i.e., number
of defects) has its peak at zero and then declines very rapidly.
Example of Binomial distribution: Blindsight
John C. Marshall & Peter W. Halligan, Nature 336, 766 – 767, 1988
The patient, P.S., had sustained right cerebral damage and failed overtly to process information in
the hemispace contralateral to lesion. In common with most patients who manifest left-sided neglect,
P.S. has a left homonymous hemianopia. Nonetheless, her neglect persists despite free movement of
the head and eyes and is thus not a direct consequence of sensory loss in the left visual field. P.S.
was presented simultaneously with two line drawings of a house, in one of which the left side was
on fire. She judged that the drawings were identical; yet when asked to select which house she
would prefer to live in, she reliably chose the house that was not burning.
She was shown 17 examples of the house, and on 14 trials she picked the one that was
not on fire. How do we know if this is “reliably” different than chance?
x(i )  0,1


P x(i )  1  14 / 17  q


P x (i )  0  3 / 17  1  q
E  x (i )   P  x  1 (1)  P  x  0  (0)  q  0.82


2
2
var  x (i )   E  x  E  x    E  x 2   E  x   P  x  1 (12 )  P  x  0  (02 )  q 2




 
 q  q 2  0.15
y
E  y 
var  y  
N
 x (i )
i 1
N

i 1
N
Number of times she picked the house that was not on fire
E  x (i )   Nq  14


 var  x(i)   N  q  q 2   2.5
i 1
E  z   0.5 N  8.5


var  z   0.5  0.52 N  4.25
Her performance was about 2 SD away from chance.
Chance performance
Poisson distribution and its relation to the binomial distribution
Suppose we do N Bernoulli trials and, on average, observe  successes:
x  0,1
P  x  1 
n

N
P  x  0  1 
N!

 
p n 
1

n ! N  n !  N  
N 

N
N n
N  N  1 N  2 
N
n
N  n  1 N  n  N  n  1  n 

  
lim N  p  n  
1   1  
n
n ! N  n  N  n  1
N 
N
N 
N
n
N  N  1 N  2   N  n  1  n 
  

1
1

Nn 
n!

N  N  1 N  2 
N

n
n!
n
N  
N 
 N  n  1  n 1    N 1   n  1   n  exp
n ! 
N  
 exp   
If n is a random variable distributed as
a Poisson with parameter , then:
E  n  
var  n   
N 
n!
    1
Continuous random variables: Normal distribution
Scalar random variable
x

N  , 2


p  x   N  , 2
1
2 

  x   2 

p  x 
exp  
2


2
2


1
A normal distribution has 95% of
its area in the range x    2
E  x   xp  x  dx  

2
2
E  x        x    p  x  dx   2


  2   

x
     2
Continuous random variables: Normal distribution
Vector random variable
x   x1
x2
xn 
N  μ, C 
p ( x)

 1

exp   (x  μ)T C 1 (x  μ) 
 2

(2 ) n | C |
1
E  x   xp (x)dx  μ

T
T
E  x  μ  x  μ      x  μ  x  μ  p (x)dx  C


To take the expected value of
a vector or a matrix, we take
the expected value of the
individual elements.
cij  cov  xi , x j   E[( xi  i )( x j   j )]
When x is a vector, the
variance is expressed in terms
of a covariance matrix C,
where ρij corresponds to the
degree of correlation between
variables xi and xj
 2
121 2
1n1 n 
1


2
  
2
 2 n 2 n 
C   12 1 2





 n2 
 1n 1 n  2n 2 n
cov  xi , x j 
cov  xi , x j 
ij 

 i j
var  xi  var  x j 
 x1 
x 
 x2 
p ( x ) N  μ, C 
Ellipses representing regions of
constant probability density
 1 1
C

 1 3 
6
r 2   x  μ  C 1  x  μ 
T
6
x2
4
4
2
2
x1
-6
-4
-2
2
4
6
-6
-4
-2
2
-2
-2
-4
-4
-6
-6
Observations about the data:
1. Variance of x2 is greater than x1.
2. x1 and x2 have a negative correlation.
Data fall inside this
ellipse with 75%
probability
4
6
with 25%
probability
with 50%
probability
Expected value of powers of random variables
E  x  x
var  x    2
var  x   E  x 2   x 2
 
E  x2   x 2   2
 
E  x3   x 3  3 x 2
 
E  x 4   x 4  6 x 2 2  3 4
 
Sum of two random variables
E  x  y  E  x  E  y 
2
var  x  y   E  x  y  E  x  y  


2
 E  x  x  y  y  


2
2
 E  x  x    E  y  y    2 E  x  x  y  y  




 var  x   var  y   2cov  x, y 
cov  x, y   E  x  x  y  y    E  xy  xy  xy  x y 
 E  xy   E  x  y  xE  y   x y
 E  xy   x y
Variance of scalar and vector random variables
T
var  x   E  x  x  x  x    E  xxT   x xT




T
cov  x, y   E  x  x  y  y    E  xyT   x y T




T
cov  Ax, By   E  A  x  x   B  y  y   


T
T
 E  A  x  x  y  y  BT   AE  x  x  y  y   BT




 A cov  x, y  BT
cov  x, y   cov  y , x 
T
var  x  y   var  x   cov  x, y   cov  y , x   var  y 
var aT x   aT var  x  a


T
T
var  Ax   E  A  x  x   A  x  x     E  A  x  x  x  x  AT 




T
 AE  x  x  x  x   AT  A var  x  AT


Var and cov of vector random variables produce symmetric positive definite matrices