Dirichlet Prior Sieves in Finite Normal Mixtures

Download Report

Transcript Dirichlet Prior Sieves in Finite Normal Mixtures

Dirichlet Prior Sieves in Finite
Normal Mixtures
By: Hemant Ishwaran and Mahmoud Zarepour
John Paisley
Dirichlet Distribution
Reparameterize:
Written this way, alpha is a scalar and g is a pmf
How do you draw from a Dirichlet
distribution?
• Two finite exact methods
– Using Gamma Distributions
– Using “exact” stick-breaking
• With \pi_1 set to v_1 and \pi_k getting the rest of the stick
How do you draw from a Dirichlet
distribution? (2)
• Two converging methods
– Polya Urn
• alpha balls distributed according to g are placed in
an urn. Balls are then drawn with replacement and
another ball of the same color as that drawn
placed back in the urn. Empirical distribution in the
urn represents a draw from DD as the number of
ball draws approaches infinity
• Sethuraman’s Stick-Breaking
Sethuraman’s Stick-Breaking
• To draw from below distribution, do the following
• Sethuraman’s discovery allows one to draw from
a Dirichlet distribution by drawing the weights
and the components independently of one
another. The final pi values are not independent.
Dirichlet Process
Movie: In DP, we enforce that G_0(B_i) = 1/k --- What does that look like as a
function of k?
I don’t have the math proof, but I say that these regions converge to points and a
uniform draw of a region (Dirichlet component) is the same as drawing a point
from G_0.
Letting k go to infinity
• Using a uniform prior, the breaks
will be infinitely small and since the probability of drawing
the same Y is zero with a uniform G_0(B_i), you are left
with G_0 in the limit.
• For alpha less than infinity, you have the Dirichlet
process (obvious when looking at this written in stickbreaking representation)
• Also, notice that the “exact” methods are now
impractical, but the two “infinite” methods can be used to
approximate the draw.
This Paper: How DD approximates
DP for mixture modeling
• After parameters are drawn, the mathematical forms are exactly the
same (with truncated DP)
• For DD, \pi comes from a DD
• For DP, \pi comes from stick-breaking
• \thetas are drawn iid from G_0 in each case
• The only question is: How does \pi differ between the two
• Answer: We assume for DD prior that draws of Y never reduplicate,
which is where we differ from DP. However, if we set k large enough
and alpha small enough for the DD, there can be a good probability
that most of the stick (defined by truncation error) will be allocated
before duplicate copies of Y are drawn. When this happens, we’ve
coincidentally drawn from a DP. Increasing k and/or reducing alpha
increases the probability of this happening.
Summary
• DP is a sparseness promoting prior on the
weights with a mathematical relation to the
atoms that makes everything logical.
• DD, when alpha < k is also a sparseness
promoting prior on the weights, but there is no
strict mathematical relation to the components,
making it more “ad hoc” (but not really)
• In practice, however, they both function to obtain
the same exact ends.
Aside
• For the generation of a DD mixture model, if we were to also draw
the Y’s “just for fun,” then on the occasions when the Y’s are all
unique, that specific draw is exactly the same as a draw from a DP
(as I understand it).
• Using Sethuraman’s definition, and given a specific truncation error
for the stick, we can get an expected number of necessary breaks
(to meet that error) as a function of alpha, which is also the number
of Y’s we need to draw, call it N.
•
We can then calculate the probability of drawing duplicate Y’s
because we are drawing from N balls uniformly with replacement
and we can find the probability that we draw only unique balls.
Therefore, we can determine the probability that a draw from a DD
mixture model is also a draw from a DP mixture model.
• This paper says you can use a DD to approximate a DP. What this
value would do is give you a probability as a function of alpha and N
that is a measure of how close you are to a DP. Clearly when N is
greater than the number of components, that probability is zero.