Transcript Document
Lecture 2
Probability
and what it has to do with
data analysis
Abstraction
Random variable, x
it has no set value, until you ‘realize’ it
its properties are described by a probability, P
One way to think about it
x
p(x)
pot of an infinite
number of x’s
Drawing one x
from the pot
“realizes” x
Describing P
If x can take on only discrete values,
say (1, 2, 3, 4, or 5)
then a table would work:
40% probability
that x=4
x
1
2
3
4
5
P
10%
30%
40%
15%
5%
Probabilities
should sum to
100%
Sometimes you see probabilities written as
fractions, instead of percentages
x
1
2
3
4
5
P
0.10
0.40
0.40
0.15
0.05
Probability
should sum
to 1
0.15 probability that x=4
And sometimes you see probabilities plotted as a
histogram
0.5
0.15 probability that x=4
P(x)
0.0
1
2
3
4
5
x
If x can take on any value, then use a smooth
function (or “distribution”) p(x) instead of a
table
probability that x is between
p(x)
x1 and x2 is proportional to
this area
x1
x2
mathematically P(x1<x<x2) = x1 p(x) dx
x2
x
p(x)
x
Probability that x is between - and + is
100%, so total area = 1
Mathematically
-
+
p(x) dx = 1
One Reason Why all this is relevant
…
Any measurement of data that contains
noise is treated as a random variable, d
and …
The distribution p(d) embodies both
the ‘true value’ of the datum being
measured and the measurement noise
and …
All quantities derived from a random
variable are themselves random
variables,
so …
The algebra of random variables allows
you to understand how …
… measurement noise affects inferences
made from the data
Basic Description of Distributions
want two basic numbers
1) something that describes what x’s commonly occur
2) something that describes the variability of the x’s
1) something that describes what x’s e
commonly occur
that is, where the distribution is centered
Mode
x at which distribution has peak
most-likely value of x
p(x)
peak
xmode
x
The most popular car in the US is the Honda CR-V
Honda CR-V
But the next car you see on the
highway will probably not
be a Honda CR-V
Where’s a CV-R?
But modes can be deceptive …
100 realizations of x
Sure, the 1-2 range has the most counts,
but most of the measurements are bigger
than 2!
N
3
18
11
8
11
14
8
7
11
9
p(x)
peak
x
0-1
1-2
2-3
3-4
4-5
5-6
6-7
7-8
8-9
9-10
0
xmode
x
10
Median
50% chance x is smaller than xmedian
50% chance x is bigger than xmedian
p(x)
No special
reason the
median needs
to coincide
with the peak
50%
50%
xmedian
x
Expected value or ‘mean’
P(x)
value you would get if you took the
mean of lots of realizations of x
Let’s examine a discrete
distribution, for simplicity ...
4
3
2
1
0
1
2
3
x
Hypothetical table of 140 realizations of x
x
1
2
3
Total
N
20
80
40
140
[ 20 1 + 80 2 + 40 3 ] / 140
mean =
=
(20/140) 1 + (80/140) 2 + (40/140) 3
=
p(1) 1 + p(2) 2 + p(3) 3
=
Σi
p(xi) xi
by analogy
for a smooth distribution
Expected (or mean) value of x
E(x) = -
+
x p(x) dx
2) something that describes the variability
of the x’s
that is, the width of the distribution
p(x)
Here’s a perfectly sensible way to define
the width of a distribution…
50%
25%
W50
25%
x
… it’s not used much, though
Width of a distribution
Here’s another way…
p(x)
Parabola [x-E(x)]2
E(x)
x
… multiply and integrate
p(x)
Idea is that if distribution is narrow, then most
of the probability lines up with the low spot of
the parabola
[x-E(x)]2 p(x)
E(x)
But if it is wide, then some of the
probability lines up with the high parts
of the parabola
Compute this
total area …
E(x)
Variance =
x
s2
= -
+
[x-E(x)]2 p(x) dx
x
variance = s
p(x)
A measure of width …
s
E(x)
x
we don’t immediately know its
relationship to area, though …
the Gaussian or normal distribution
s2 is variance
x is expected
value
p(x) =
1
(2p)s
exp{ - (x-x)2 / 2s2 )
Memorize me !
p(x)
x=1
s=1
Examples of
x
Normal
p(x)
Distributions
x=3
s = 0.5
x
Properties of the normal distribution
Expectation =
p(x)
Median =
Mode = x
95%
x
x-2s x
x+2s
95% of
probability
within 2s of
the expected
value
Again, Why all this is relevant …
Inference depends on data …
You use measurement, d, to deduce the values of
some underlying parameter of interest, m.
e.g.
use measurements of travel time, d, to deduce
the seismic velocity, m, of the earth
model parameter, m, depends on measurement, d
so m is a function of d, m(d)
so …
If data, d, is a random variable
then so is model parameter, m
All inferences made from uncertain data are
themselves uncertain
Model parameters are described by a
distribution, p(m)
Functions of a random variable
any function of a random variable is itself a random variable
Special case of a linear relationship
and a normal distribution
Normal p(d) with mean d and variance s2d
Linear relationship m = a d + b
Normal p(m) with mean ad+b and variance
a2s2d
multivariate distributions
Example
Liberty island is inhabited by both
pigeons and seagulls
40% of the birds are pigeons
and 60% of the birds are gulls
50% of pigeons are white and 50% are tan
100% of gulls are white
Two variables
species s takes two values
pigeon p
and gull g
Of 100 birds,
20 are white pigeons
color c takes two values
white w
and tan t
20 are tan pigeons
60 are white gulls
0 are tan gulls
What is the probability that a bird has
species s and color c ?
a random
bird, that is
p
20%
20%
g
60%
0%
s
w
t
c
Note: sum of
all boxes is
100%
This is called the
Joint Probability
and is written
P(s,c)
Two continuous variables
say x1 and x2
have a joint probability distribution
and written
p(x1, x2)
with
p(x1, x2) dx1 dx2 = 1
You would contour a joint
probability distribution
and it would look something like
x2
x1
What is the probability that a bird has
color c ?
Of 100
birds,
start with P(s,c)
p
20%
20%
s
g
and sum columns
60%
0%
w
t
c
To get P(c)
80%
20%
20 are white
pigeons
20 are tan
pigeons
60 are white
gulls
0 are tan
gulls
What is the probability that a bird has
species s ?
start with P(s,c)
p
20%
20%
s
Of 100 birds,
g
60%
0%
20 are white
pigeons
20 are tan
pigeons
60 are white
gulls
0 are tan gulls
w
t
c
40%
and
sum
rows
60%
To get P(s)
x2
These operations make sense with
distributions, too
x2
x2
x1
p(x1)
x1
x1
p(x1) = p(x1,x2) dx2
distribution of x1
(irrespective of x2)
p(x2) = p(x1,x2) dx1
distribution of x2
(irrespective of x1)
p(x2)
Given that a bird is species s
what is the probability that it has color c ?
Of 100 birds,
20 are white pigeons
p
50%
50%
g
100%
0%
s
20 are tan pigeons
60 are white gulls
0 are tan gulls
w
t
c
Note, all
rows sum to
100
This is called the
Conditional Probability of c given s
and is written
P(c|s)
similarly …
Given that a bird is color c
what is the probability that it has species s ?
Of 100 birds,
20 are white pigeons
20 are tan pigeons
p
25%
100%
g
75%
0%
s
60 are white gulls
0 are tan gulls
So 25% of white birds
are pigeons
w
t
c
Note, all
columns
sum to 100
This is called the
Conditional Probability of s given c
and is written
P(s|c)
Beware!
P(c|s)
p
50%
p
50%
s
P(s|c)
25%
100%
75%
0%
s
g
100%
0%
w
t
c
g
w
t
c
Lot of errors occur from
confusing the two:
Probability that, if you have
pancreatic cancer, that you
will die from it
90%
Probability that, if you die,
you will have died of
pancreatic cancer
Actor Patrick Swayse
pancreatic cancer victim
1.4%
note
P(s,c) = P(s|c) P(c)
p
20
p
20
25
100
=s
s
g
60
w
0
c
t
25% of 80 is 20
g
75
80
20
0
w
w
c
t
t
c
and
P(s,c) = P(c|s) P(s)
50% of 40 is 20
p
20
p
20
50
50
=s
s
g
60
w
0
c
t
p
g
100
40
s
0
g
w
c
t
60
Note that since
P(s,c) = P(s|c) P(c) = P(c|s) P(s)
then
P(s) = c P(s,c) = c P(s|c) P(c)
and
P(c) = s P(s,c) = c P(c|s) P(s)
Continuous versions:
p(s) = p(s,c) dc = p(s|c) p(c) dc
and
p(c) = p(s,c) ds = p(c|s) p(s) ds
Also, since
P(s,c) = P(s|c) P(c) = P(c|s) P(s)
then
P(s|c) = P(c|s) P(s) / P(c)
and
P(c|s) = P(s|c) P(c) / P(s)
… which is called Bayes Theorem
In this example
bird color is the observable, the “data”, d
bird species is the “model parameter”, m
P(c|s) “color given species” or P(d|m) is
“making a prediction based on the model”
Given a pigeon, what the probability that it’s grey?
P(s|c), “species given color” or P(m|d) is
“making an inference from the data”
Given a grey bird, what the probability that it’s a pigeon?
Bayes Theorem with data d and model m
P(m|d) = P(d|m) P(m) / P(d)
= P(d|m) P(m) / i P(d|mi) P(mi)
Bayesian Inference: Interpret P(m) as our
knowledge of m before measuring d. Then
P(m|d) is our updated state of knowledge
after measuring d.
Example of Bayesian Inference
Scenaio: A body of a man is brought to the morgue. The coroner wants to know, “did the man die
of pancreatic cancer?”. Thus there is one model parameter, m, takes one of two values, Y (he
died of pancreatic cancer) and N (he didn’t). Before examining the body, the best estimate of
P(m) that can be made is P(Y)=0.014 and P(N)=0.986, the rate of death by pancreatic cancer in
the general population. Now the coroner performs a test for pancreatic cancer, giving one data, d,
and its positive, + (as contrasted to negative, -). But the test is not perfect. It has a non-zero rate
of both false-positives and false-negatives, as quantified by the conditional distribution:
false negatives (didn’t have cancer but tested +)
Y
0.995
0.01
N
0.005
0.99
+
-
false positives (did have cancer but tested -)
P(d|m)
P(Y|+) = P(+|Y) P(Y) / [P(+|Y) P(Y)+P(+|N) P(N)]
= 0.9950.014 / [0.9950.014+0.0050.986]
A 74% chance that person died of pancreatic
= 0.74 or 74%
cancer is not all that conclusive!
Why Bayes Theorem is important
It provides a framework for relating
making a prediction from the model, P(d|m)
to
making an inference from the data, P(m|d)
Bayes Theorem also implies that the joint distribution of
data and model parameters
p(d, m)
is the fundamental quantity
If you know p(d, m), you know everything there is to
know …
Expectation
Variance
And
Covariance
Of a multivariate distribution
x2
The expectation is computed by first reducing the
distribution to one dimension
x2
x1
p(x1)
x1
take the
expectation
of p(x1) to get x1
x2
x2
x1
x1
p(x2)
take the
expectation
of p(x2) to get x2
x2
The varaince is also computed by first reducing the
distribution to one dimension
x2
x1
p(x1)
x1
s1
x1
take the
variance
of p(x1) to get s12
x2
x2
x1
s2
p(x2)
take the
variance
of p(x2) to get s22
Note that in this distribution
if x1 is bigger than x1, then x2 is bigger than x2 and
if x1 is smaller than x1, then x2 is smaller than x2
x2
This is a
positive correlation
x2
x1
x1
Expected value
Conversely, in this distribution
if x1 is bigger than x1, then x2 is smaller than x2 and
if x1 is smaller than x1, then x2 is smaller than x2
x2
This is a
negative correlation
x2
x1
x1
Expected value
This correlation can be quantified by multiplying the
distribution by a four-quadrant function
x2
x2
x1
-
+
+
-
x1
And then integrating. The function (x1-x1)(x2-x2) works fine
C = (x1-x1) (x2-x2) p(x1,x2) dx1dx2
Called the “covariance”
Note that the matrix C with elements
Cij = (xi-xi) (xj-xj) p(x1,x2) dx1dx2
has diagonal elements of sxi2 the variance of xi
and
off-diagonal elements of cov(xi,xj) the covariance of xi and xj
s12
C=
cov(x1,x2) cov(x1,x3)
cov(x1,x2) s22
cov(x2,x2)
cov(x1,x3) cov(x2,x2) s32
The “vector of means” of multivatiate distribution
x
and the “Covariance matrix” of multivariate
distribution
Cx
summarized a lot – but not everything –
about a multivariate distribution
Functions of a set of random variables,
x
A vector of of N
random variables in a
vector, x
Special Case
linear function y=Mx
the expectation of y is
y=Mx
Memorize!
the covariance of y is
So Cy = M Cx MT
Memorize!
Note that these rules work regardless
of the distribution of x
if y is linearly related to x, y=Mx
then
y=Mx
(rule for means)
Cy = M Cx MT
(rule for propagating error)
Memorize!