Transcript lecture
Flipping A Biased Coin
Suppose you have a coin with an unknown bias,
θ ≡ P(head).
You flip the coin multiple times and observe the
outcome.
From observations, you can infer the bias of the coin
Maximum Likelihood Estimate
Sequence of observations
HTTHTTTH
Maximum likelihood estimate?
Θ = 3/8
What about this sequence?
TTTTTHHH
What assumption makes order unimportant?
Independent Identically Distributed (IID) draws
The Likelihood
P(Head | q ) = q
P(Tail | q ) = 1- q
Independent events ->
P(HHTHTT ...| q ) = q
NH
(1- q )
Related to binomial distribution
æ N H + NT ö
N
N
P(N H , NT | q ) = ç
÷ q H (1- q ) T
NH
è
ø
NH and NT are sufficient statistics
How to compute max likelihood solution?
NT
Bayesian Hypothesis Evaluation:
Two Alternatives
Two hypotheses
h0: θ=.5
h1: θ=.9
hypothesis, not head!
Role of priors diminishes as number of flips increases
Note weirdness that each hypothesis has an associated
probability, and each hypothesis specifies a probability
probabilities of probabilities!
Setting prior to zero -> narrowing hypothesis space
Bayesian Hypothesis Evaluation:
Many Alternatives
11 hypotheses
h0: θ=0.0
h1: θ=0.1
…
h10: θ=1.0
Uniform priors
P(hi) = 1/11
priors
trial 1: H
trial 2: T
trial 3: T
0.25
0.25
0.25
0.25
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0
0
0.5
model
1
0
0
trial 4: H
0.5
model
1
0
0
trial 5: T
0.5
model
1
0
trial 6: T
0.25
0.25
0.25
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0
0.5
model
1
0
0
0.5
model
1
0
0
0.5
model
0.5
model
1
trial 7: T
0.25
0
0
1
0
0
0.5
model
1
MATLAB Code
Infinite Hypothesis Spaces
Consider all values of θ, 0 <= θ <= 1
●
Inferring θ is just like any other sort of Bayesian
inference
●
Likelihood is as before:
●
Normalization term:
●
With uniform priors on θ:
●
●
priors
trial 1: H
trial 2: T
trial 3: T
3
3
3
3
2.5
2.5
2.5
2.5
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0.5
model
1
0
0
trial 4: H
0.5
model
1
0
0
trial 5: T
0.5
model
1
0
trial 6: T
3
3
3
2.5
2.5
2.5
2.5
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0.5
model
1
0
0
0.5
model
1
0
0
0.5
model
0.5
model
1
trial 7: T
3
0
0
1
0
0
0.5
model
1
Infinite Hypothesis Spaces
Consider all values of θ, 0 <= θ <= 1
●
Inferring θ is just like any other sort of Bayesian
inference
●
Likelihood is as before:
●
Normalization term:
●
With uniform priors on θ:
●
This is a beta distribution: Beta(NH+1, NT+1)
●
Beta Distribution
Beta(a , b ) =
B(a , b ) =
1
xa -1 (1- x)b -1
B(a , b )
G(a )G(b )
G(a + b )
If a and b are integers,
B(a , b ) =
(a - 1)!(b - 1)!
(a + b - 1)!
x
Incorporating Priors
Suppose we have a Beta prior
●
1
p(q ) = Beta(VH ,VT ) =
q VH -1 (1- q )VT -1
B(VH ,VT )
Can compute posterior analytically
●
p(q | d) ~ q N H +VH (1- q ) NT +VT
p(q | d) =
(N H + VH + N T + VT - 1)! N H +VH -1
q
(1- q )NT +VT -1
(N H + VH - 1)!(N T + VT - 1)!
p(q | d) = Beta(N H + VH , N T + VT )
Posterior is also
Beta distributed
Imaginary Counts
p(q ) = Beta(VH ,VT ) =
1
q VH -1 (1- q )VT -1
B(VH ,VT )
VH and VT can be thought of as the outcome of coin
flipping experiments either in one’s imagination or in
past experience
Equivalent sample size = VH + VT
The larger the equivalent sample size, the more
confident we are about our prior beliefs…
And the more evidence we need to overcome priors.
Regularization
Suppose we flip coin once and get a tail, i.e.,
NT = 1, NH = 0
What is maximum likelihood estimate of θ?
What if we toss in imaginary counts, VH = VT = 1?
i.e., effective NT = 2, NH = 1
What if we toss in imaginary counts, VH = VT = 2?
i.e., effective NT = 3, NH = 2
Imaginary counts smooth estimates to
avoid bias by small data sets
Issue in text processing
Some words don’t appear in train
corpus
Prediction Using Posterior
Given some sequence of n coin flips (e.g., HTTHH),
what’s the probability of heads on the next flip?
expectation of a beta
distribution
EBeta(a ,b ) (q ) =
a
a +b
Summary So Far
Beta prior on θ
p(q ) = Beta(VH ,VT )
Binomial likelihood for
observations
Beta posterior on θ
p(q | d) = Beta(N H + VH , NT + VT )
Conjugate priors
The Beta distribution is the conjugate prior of a binomial or
Bernoulli distribution
Conjugate Mixtures
If a distribution Q is a conjugate prior for likelihood R,
then so is a distribution that is a mixture of Q’s.
E.g., mixture of Betas
p(q ) = 0.5 Beta(q | 20,20) + 0.5 Beta(q | 30,10)
After observing 20 heads and 10 tails:
p(q | D ) = 0.346 Beta(q | 40, 30) +
0.654 Beta(q | 50,20)
Example from Murphy (Fig 5.10)
Dirichlet-Multinomial Model
We’ve been talking about the Beta-Binomial model
Observations are binary, 1-of-2 possibilities
What if observations are 1-of-K possibilities?
K sided dice
K English words
K nationalities
Multinomial RV
Variable X with values x1, x2, … xK
P(X = xk ) = q k
K
åq
k
=1
k=1
Likelihood, given Nk observations of xk:
Analogous to binomial draw
θ specifies a probability mass function (pmf)
Dirichlet Distribution
The conjugate prior of a multinomial likelihood
… for θ in K-dimensional probability simplex,
0 otherwise
Dirichlet is a distribution over probability mass
functions (pmfs)
Compare {αk} to
VH and VT
From Frigyik, Kapila, & Gupta (2010)
Hierarchical Bayes
Consider generative model for multinomial
One of K alternatives is chosen by drawing alternative
k with probability k
But when we have uncertainty in the { k}, we must
draw a pmf from {αk}
Hyperparameters
Parameters of
multinomial
Hierarchical Bayes
Whenever you have a parameter you don’t know,
instead of arbitrarily picking a value for that
parameter, pick a distribution.
Weaker assumption than selecting parameter value.
Requires hyperparameters (hypernparameters), but
results are typically less sensitive to hypernparameters
than hypern-1parameters
Example Of Hierarchical Bayes:
Modeling Student Performance
Collect data from S students on performance on N test
items.
There is variability from student-to-student and from
item-to-item
item distribution
student distribution
Item-Response Theory
Parameters for
Student ability
Item difficulty
P(correct) = logistic(Abilitys-Difficultyi)
Need different ability parameters for each student,
difficulty parameters for each item
But can we benefit from the fact that students in the
population share some characteristics, and likewise for
items?