High-Dimensional Unsupervised Selection and Estimation of

Download Report

Transcript High-Dimensional Unsupervised Selection and Estimation of

High-Dimensional Unsupervised
Selection and Estimation of a Finite
Generalized Dirichlet Mixture model
Based on Minimum Message Length
by Nizar Bouguila and Djemel Ziou
Dissusion led by Qi An
Duke University Machine Learning Group
Outline
• Introduction
• The generalized Dirichlet mixture
• The minimal message length (MML)
criterion
• Fisher information matrix and priors
• Density estimation and model selection
• Experimental results
• Conclusions
Introduction
• How to determine the number of components in
a mixture model for high-dimensional data?
– Stochastic and resampling (Slow)
• Implementation of model selection criteria
• Fully Bayesian way
– Deterministic (Fast)
• Approximate Bayesian criteria
• Information/coding theory concepts
– Minimal message length (MML)
– Akaike’s information criterion (AIC)
The generalized Dirichlet
distribution
• A d dimensional generalized Dirichlet distribution
is defined to be
d
where
X
i 1
i
 1 and 0  X i  1 , i  0 , i  0,  i  i  i 1  i 1
It can be reduced to the Dirichlet distribuiton when
i  i 1  i 1
The generalized Dirichlet
distribution
For the generalized Dirichlet distribution:
The GDD has a more general covariance structure than
the DD and it is conjugate to multinomial distribution.
GDD vs. Gaussian
• The GDD has smaller number of parameters to
estimate. The estimation can be more accurate
• The GDD is defined in a support [0,1] and can
be extended to a compact support [A,B]. It is
more appropriate for the nature of data.
Beta distribution:
Beta type-II distribution:
They are equal if we set u=v/(1+v).
A GDD mixture model
A generalized Dirichlet mixture model with M components,
where p(X|α) takes a form of the GDD.
The MML criterion
• The message length is defined as minus
the logarithm of the posterior probability.
• After placing an explicit prior over
parameters, the message length for a
mixture of distribution is given as
prior
likelihood
optimal
quantization
constant
Fisher
Information
Fisher Information matrix
• The Fisher information matrix is the expected
value of the Hessian minus the logarithm of the
likelihood
where
Prior distribution
• Assume the independence between difference
components
Mixture
weighs
GDD
parameters
Place a Dirichlet distribution and a generalized Dirichlet distribution
on P and α, respectively, with parameters set to 1.
Message length
• After obtaining the Fisher information and specifying the
prior distribution, the message length can be expressed as
Estimation and selection algorithm
• The authors use an EM algorithm to
estimate the mixture parameters.
• To overcome the computation issue and
local maxima problem, they implement a
fairly sophisticated initialization algorithm.
• The whole algorithm is summarized in the
next page
Experimental results
The correct number of mixture are 5, 6, 7, respectively
Experimental results
Experimental results
• Web mining:
– Training with multiple
classes of labels
– Use
to
predict the label of testing
sample
– Use top 200 words
frequency
Conclusions
• A MML-based criterion is proposed to select the
number of components in generalized Dirichlet
mixtures.
• Full dimensionality of the data is used.
• Generalized Dirichlet mixtures allow more
modeling flexibility than mixture of Gaussians.
• The results indicate clearly that the MML and
LEC model selection methods outperform the
other methods.