Transcript Document

Bayesian Wrap-Up
(probably)
5 minutes of math...
• Marginal probabilities
• If you have a joint PDF:
• ... and want to know about the probability of
just one RV (regardless of what happens to the
others)
• Marginal PDF of
or
:
5 minutes of math...
• Conditional probabilities
• Suppose you have a joint PDF, f(H,W)
• Now you get to see one of the values, e.g.,
H=“183cm”
• What’s your probability estimate of W, given
this new knowledge?
5 minutes of math...
• Conditional probabilities
• Suppose you have a joint PDF, f(H,W)
• Now you get to see one of the values, e.g.,
H=“183cm”
• What’s your probability estimate of A, given
this new knowledge?
5 minutes of math...
• From cond prob. rule, it’s 2 steps to Bayes’
rule:
• (Often helps algebraically to think of “given
that” operator, “|”, as a division operation)
Everything’s random...
• Basic Bayesian viewpoint:
• Treat (almost) everything as a random variable
• Data/independent var: X vector
• Class/dependent var: Y
• Parameters: Θ
• E.g., mean, variance, correlations,
multinomial params, etc.
• Use Bayes’ Rule to assess probabilities of
classes
• Allows us to say: “It is is very unlikely that the
Uncertainty over params
• Maximum likelihood treats parameters as
(unknown) constants
• Job is just to pick the constants so as to
maximize data likelihood
• Fullblown Bayesian modeling treats params as
random variables
• PDF over parameter variables tells us how
certain/uncertain we are about the location
of that parameter
• Also allows us to express prior beliefs
(probabilities) about params
Example: Coin flipping
• Have a “weighted” coin -- want to figure out
θ=Pr[heads]
• Maximum likelihood:
• Flip coin a bunch of times, measure #heads;
#tails
• Use estimator to return a single value for θ
• Bayesian (MAP):
• Start w/ distribution over what θ might be
• Flip coin a bunch of times, measure #heads;
#tails
Example: Coin flipping
?
?
?
?
?
?
0 flips total
?
Example: Coin flipping
1 flip total
Example: Coin flipping
5 flips total
Example: Coin flipping
10 flips total
Example: Coin flipping
20 flips total
Example: Coin flipping
50 flips total
Example: Coin flipping
100 flips total
How does it work?
• Think of parameters as just another kind of
random variable
• Now your data distribution is
• This is the generative distribution
• A.k.a. observation distribution, sensor
model, etc.
• What we want is some model of parameter as
a function of the data
• Get there with Bayes’ rule:
What does that mean?
• Let’s look at the parts:
• Generative distribution
• Describes how data is generated by the
underlying process
• Usually easy to write down (well, easier than
the other parts, anyway)
• Same old PDF/PMF we’ve been working
with
• Can be used to “generate” new samples of
data that “look like” your training data
What does that mean?
• The parameter prior or a priori distribution:
• Allows you to say “this value of
•
•
•
is more
likely than that one is...”
Allows you to express beliefs/assumptions/
preferences about the parameters of the
system
Also takes over when the data is sparse
(small N)
In the limit of large data, prior should “wash
out”, letting the data dominate the estimate
What does that mean?
• The data prior:
• Expresses the probability of seeing data set
X independent of any particular model
• Huh?
What does that mean?
• The data prior:
• Expresses the probability of seeing data set
X independent of any particular model
• Can get it from the joint data/parameter
model:
What does that mean?
• Finally, the posterior (or a posteriori)
distribution:
• Lit., “from what comes after” or “after the
fact” (Latin)
• Essentially, “What we believe about the
parameter after we look at the data”
• As compared to the “prior” or “a priori” (lit.,
“from what is before” or “before the fact”)
parameter distribution,
Exercise
• Suppose you want to estimate the average air
•
speed of an unladen (African) swallow
Let’s say that airspeeds of individual swallows,
x, are Gaussianly distributed with mean and
variance 1:
• Let’s say, also, that we think the mean
is
“around” 50 kph, but we’re not sure exactly
what it is. But our uncertainty (variance) is 10