Applying Finite Mixture Models

Download Report

Transcript Applying Finite Mixture Models

Applying Finite Mixture Models
Presenter: Geoff McLachlan
Department of Mathematics & Institute of Molecular Bioscience
University of Queensland
Institute for Molecular Bioscience Building,
University of Queensland
Topics
•
•
•
•
•
•
•
•
Introduction
Application of EM algorithm
Examples of normal mixtures
Robust mixture modeling
Number of components in a mixture model
Number of nonnormal components
Mixture models for failure-time data
Mixture software
1.1 Flexible Method of Modeling
• Astronomy
• Genetics
• Biology
• Marketing
• Economics
• Medicine
• Engineering
• Psychiatry
1.2 Initial Approach to
Mixture Analysis
• Classic paper of Pearson (1894)
Figure 1: Plot of forehead to body length data
on 1000 crabs and of the fitted one-component
(dashed line) and two-component (solid line)
normal mixture models.
1.3 Basic Definition
We let Y1,…. Yn denote a random sample of
size n where Yj is a p-dimensional random
vector with probability density function f (yj)
g
f ( y j )   pi f i ( y j )
(1)
i 1
where the f i(yj) are densities and the pi are
nonnegative quantities that sum to one.
1.4 Interpretation of Mixture Models
An obvious way of generating a random vector
Yj with the g-component mixture density f (Yj),
given by (1), is as follows.
Let Zj be a categorical random variable taking
on the values 1,…,g with probabilities p1, … pg,
respectively, and suppose that the conditional
density of Yj given Zj=i is f i(yj) (i=1, … , g).
Then the unconditional density of Yj, (that is,
its marginal density) is given by f (yj).
1.5 Shapes of Some Univariate
Normal Mixtures
Consider
f (y j )  p1(y j;m1, s )  p2(y j;m2 , s )
2
2
(5)
where
 12
1
( y j ; m, s )  (2p) s exp{  ( y j  m) s }
2
1
2
2
2
denotes the univariate normal density with
mean m and variance s2.
(6)
D=1
D=3
D=2
D=4
Figure 2: Plot of a mixture density of two
univariate normal components in equal
proportions with common variance s2=1
D=1
D=3
D=2
D=4
Figure 3: Plot of a mixture density of two
univariate normal components in proportions
0.75 and 0.25 with common variance
1.6 Parametric Formulation of
Mixture Model
In many applications, the component densities
fi(yj) are specified to belong to some parametric
family.
In this case, the component densities fi(yj) are
specified as fi(yj;qi), where qi is the vector of
unknown parameters in the postulated form for
the ith component density in the mixture.
The mixture density f(yj) can then be written as
1.6 cont.
g
f ( y j ; )   pi fi ( y j ; qi )
(7)
i 1
where the vector Y containing all the parameters
in the mixture model can be written as
  (p1,...,pg1,  )
T T
(8)
where  is the vector containing all the parameters
in q1,…qg known a priori to be distinct.
In practice, the components are often taken to
belong to the normal family, leading to
normal mixtures. In the case of multivariate
normal components, we have that
fi ( y j ; θi )   ( y j ; μi , Σi )
where
 ( y j ; μi , Σ i )  ( 2p )

p
2

| Σi |
1
2
 1

1
exp- ( y j  μi )T Σ i ( y j  μi ) 
 2

denotes the multivariate normal density with
mean (vector) μi and covariance matrix Σ i
(i =1,…,2)
(9)
In this case, the vector  of unknown parameters
is given by
ψ  (p1,,p g 1, x )
T T
i
where xi consists of the elements of the component
means μ1,, μg and the distinct elements of the
component-covariance matrices Σ1 ,, Σ g
In the case of normal homoscedastic components
where the component covariance matrices Σ i are
restricted to being equal,
Σi  Σ
(i=1,…,g)
(10)
xi consists of the elements of the component means
μ1,, μg and the distinct elements of the
common component-covariance matrix Σ
1.7 Identifiability of Mixture
Distributions
In general, a parametric family of densities
f (yj;) is identifiable if distinct values of the
parameter  determine distinct members of
the family of densities
{ f (y j; ) :    }
where  is the specified parameter space; that
is,
f (y j; )  f (y j;  )
*
(11)
1.7 cont.
if and only if

*
(12)
identifiability for mixture distributions is defined
slightly different.
To see why this is necessary, suppose that f(yj;)
has two component densities, say, f i(y; qi) and
f h(y; qh), that belong to the same parametric
family. Then (11) will still hold when the
component labels i and h are interchanged in .
1.8 Estimation of
Mixture Distributions
•In the 1960s, the fitting of finite mixture models
by maximum likelihood had been studied in a
number of papers, including the seminal papers
by Day (1969) and Wolfe (1965, 1967, 1970).
• However, it was the publication of the seminal
paper of Dempster, Laird, and Rubin (1977) on
the EM algorithm that greatly stimulated interest
in the use of finite mixture distributions to model
heterogeneous data.
1.8 Cont.
This is because the fitting of mixture
models by maximum likelihood is a classic
example of a problem that is simplified
considerably by the EM's conceptual
unification of maximum likelihood (ML)
estimation from data that can be viewed as
being incomplete.
1.9 Mixture Likelihood Approach
to Clustering
Suppose that the purpose of fitting the finite
mixture model (7) is to cluster an observed
random sample y1,…,yn into g components.
This problem can be viewed as wishing to
infer the associated component labels z1,…,zn
of these feature data vectors. That is, we wish
to infer the zj on the basis of the feature data
yj.
1.9 Cont.
After we fit the g-component mixture model
ˆ of the vector of
to obtain the estimate 
unknown parameters in the mixture model,
we can give a probabilistic clustering of the
n feature observations y1,…,yn in terms of
their fitted posterior probabilities of
component membership. For each yj, the g
ˆ ) ,…, tg(yj;
ˆ ) give the
probabilities t1(yj;
estimated posterior probabilities that this
observation belongs to the first, second,…,
and gth component, respectively, of the
mixture (j=1,…,n).
1.9 Cont.
We can give an outright or hard clustering of
these data by assigning each yj to the
component of the mixture to which it has the
highest posterior probability of belonging.
That is, we estimate the component-label
vector zj by zˆ j , where zˆ ij  (zˆ j )i is defined
by

zˆij  1,
 0,
if i  arg maxt h ( y j ; ),
h
otherwise,
for i=1,…,g; j=1,…,n.
(14)
1.10 Testing for the Number of
Components
In some applications of mixture models,
there is sufficient a priori information for
the number of components g in the mixture
model to be specified with no uncertainty.
For example, this would be the case where
the components correspond to externally
existing groups in which the feature vector
is known to be normally distributed.
1.10 Cont.
However, on many occasions, the number
of components has to be inferred from the
data, along with the parameters in the
component densities. If, say, a mixture
model is being used to describe the
distribution of some data, the number of
components in the final version of the
model may be of interest beyond matters of
a technical or computational nature.
2. Application of EM algorithm
2.1 Estimation of Mixing Proportions
Suppose that the density of the random
vector Yj has a g-component mixture from
g
f ( y j ; )   pi f i ( y j ),
(15)
i 1
where =(p1,….,pg-1)T is the vector
containing the unknown parameters, namely
the g-1 mixing proportions p1,…,pg-1, since
g 1
p g  1   pi
i 1
2.1 cont.
In order to pose this problem as an
incomplete-data one, we now introduce as
the unobservable or missing data the vector
z  (z ,, z ) ,
T
1
T T
n
(18)
where zj is the g-dimensional vector of
zero-one indicator variables as defined
above. If these zij were observable, then the
MLE of pi is simply given by
n
z
j 1
ij
/n
(i=1,…,g),
(19)
2.1 Cont.
The EM algorithm handles the addition of
the unobservable data to the problem by
working with Q(;(k)), which is the
current conditional expectation of the
complete-data log likelihood given the
observed data. On defining the completedata vector x as
x  (y , z ) ,
T
T T
(20)
2.1 Cont.
the complete-data log likelihood for Y has
the multinomial form
g
n
log Lc ( )   z ij log pi  C,
i 1 j1
where
g
n
C   z ij log f i ( y j )
i 1 j1
does not depend on .
(21)
2.1 Cont.
As (21) is linear in the unobservable data zij,
the E-step (on the (k+1)th iteration) simply
requires the calculation of the current
conditional expectation of Zij given the
observed data y, where Zij is the random
variable corresponding to zij. Now
E( k ) (Zij y)  pr( k ) {Zij  1 y}
t ,
( k)
ij
(22)
2.1 Cont.
where by Bayes Theorem,
t
( k)
ij
 ti (y j;  )
( k)
(23)
 p f (y j ) / f (y j;  )
( k)
i
i
( k)
for i=1,…,g; j=1,…,n.
The quantity ti(yj;(k)) is the posterior
probability that the jth member of the
sample with observed value yj belongs
to the ith component of the mixture.
2.1 Cont.
The M-step on the (k+1)th iteration
simply requires replacing each zij by tij(k)
in (19) to give
p
( k 1)
i
for i=1,…,g.
n
 t
j1
(k)
ij
/n
(24)
2.2 Example 2.1:
Synthetic Data Set 1
We generated a random sample of n=50
observations y1,…,yn from a mixture of
two univariate normal densities with
means m1=0 and m2=2 and common
variance s2=1 in proportions p1=0.8 and
p2=0.2.
Table 1: Results of EM Algorithm for Example
on Estimation of Mixing Proportions
Iteration
k
0
1
2
3
4
5
6
7

27
(k)
1
log L( p )
0.50000
0.68421
0.70304
0.71792
0.72885
0.73665
0.74218
0.74615
-91.87811
-85.55353
-85.09035
-84.81398
-84.68609
-84.63291
-84.60978
-84.58562
p

0.75743
(k)
1

-84.58562
2.3 Univariate Normal
Component Densities
The normal mixture model to be fitted is thus
g
f ( y j ; )   pi fi ( y j ; qi ),
(28)
i 1
where
fi (y j; qi )  (y j , mi , s )
2
1
2 2
 ( 2ps ) exp{  ( y j  mi ) / s }.
1
2
2
2
2.3 Cont.
The complete-data log likelihood function for
Y is given by (21), but where now
g
n
C   zij log fi ( y j ; qi )
i 1 j1
1
2
g n
1
2
i 1 j1
  n log(2p)

 z {logs
ij
2
 ( y j  mi ) / s }.
2
2
2.3 Cont.
The E-Step is the same as before, requiring
the calculation of (23).
The M-step now requires the computation
of not only (24), but also the values
( k 1)
1
m
( k1)
g
,, m
and
that, along with p ,, p
Q(;(k)).
( k)
1
(k)
g1
2
(k+1)
s
maximize
2.3 Cont.
Now
n
n
 zijyij /  zij and
j 1
j1
g
n
 z ( y
i 1 j1
ij
 m) / n
2
j
(29)
are the MLE’s of mi and s 2 respectively, if
the zij were observable.
2.3 Cont.
As logLc() is linear in the zij, it follows that
the zij in (29) and (30) are replaced by their
(k)
current conditional expectations tij , which
here are the current estimates ti(yj;(k)) of
the posterior probabilities of membership of
the components of the mixture, given by
ti (y j;  )  p f (y j; q ) / f (y j;  )
( k)
( k)
i
i
( k)
i
( k)
2.3 Cont.
This yields
m
( k 1)
i
n
n
 t yj / t
j1
and
(k)
ij
j1
(k)
ij
(i  1,, g)
(31)
( k 1)
s
2
g
n
  t ( y j  m
i 1 j1
and p
( k 1)
is
i
(k)
ij
given by (24).
( k 1) 2
i
) /n
(32)
2.4 Multivariate Component Densities
m

n
n
 t y j / t
( k 1)
i
j1
 k 1)
g
(k)
ij
j1
(k)
ij
(i  1,, g)
(34)

n
 t
i 1 j1
(k)
ij
(y j  m
( k 1)
i
)(y j  m
( k 1) T
i
) /n
(35)
2.4 Cont.
In the case of normal components with
arbitrary covariance matrices, equation (35)
is replaced by
( k1)
i


n
t
j1
(k)
ij
(y j  m
( k 1)
i
)(y j  m
n
) / t
( k 1) T
i
j1
(i  1,, g)
(k)
ij
(36)
2.5 Starting Values for EM Algorithm
The EM algorithm is started from some initial
value of , (0). Hence in practice we have to
specify a value for (0).
An alternative approach is to perform the first
E-step by specifying a value tj(0) for t(yj;) for
each j (j=1,…,n), where
t(y j; )  (t1(y j; ),...,tg (y j; ))
T
is the vector containing the g posterior
probabilities of component membership for yj,
2.5 Cont.
The latter is usually undertaken by setting tj(0)=zj
(0) for j=1,…,n, where
z
( 0)
 (z
( 0) T
1
,...,z
( 0) T T
n
)
defines an initial partition of the data into g
groups. For example, an ad hoc way of initially
partitioning the data in the case of, say, a mixture
of g=2 normal components with the same
covariance matrices, would be to plot the data for
selections of two of the p variables, and then
draw a line that divides the bivariate data into
two groups that have a scatter that appears
normal.
2.5 Cont.
For higher dimensional data, an initial value z(0)
for z might be obtained through the use of some
clustering algorithm, such as k-means or, say, an
hierarchical procedure if n is not too large.
Another way of specifying an initial partition z(0)
of the data is to randomly divide the data into g
groups corresponding to the g components of the
mixture model.
2.6 Example 2.2:
Synthetic Data Set 2
2.7 Example 2.3:
Synthetic Data Set 3
y
p1
True Values
0.333
0.333
0.294
p2
p3
m1
m2
m3
1
0.333
0.333
0.337
0.333
0.333
0.370
(0 –2)T
(-1 0) T
(-0.154 –1.961) T
(0 0) T
(0 0) T
(0.360 0.115) T
(0 2) T
(1 0) T
(-0.004 2.027) T
1
1
2 0 


 0 0.2 
2 0 


 0 0.2 
2 0 


 0 0.2 
Initial Values Estimates by EM
1

0
1

0
1

0
0

1
0

1
0

1
 1.961

  0.016
 2.346

  0.553
 2.339

 0.042
 0.016

0.218 
 0.553

0.218 
0.042

0.206
Figure 7
Figure 8
2.8 Provision of Standard Errors
One way of obtaining standard errors of the
estimates of the parameters in a mixture
model is to approximate the covariance
ˆ by the inverse of the observed
matrix of 
information matrix, which is given by the
negative of the Hessian matrix of the log
likelihood evaluated at the MLE. It is
important to emphasize that estimates of the
covariance matrix of the MLE based on the
expected or observed information matrices
are guaranteed to be valid inferentially only
asymptotically.
2.8 Cont.
In particular for mixture models, it is well
known that the sample size n has to be very
large before the asymptotic theory of
maximum likelihood applies.
Hence we shall now consider a resampling
approach, the bootstrap, to this problem.
Standard error estimation of may be
implemented according to the bootstrap as
follows:
Step 1
2.8 Cont.
A new set of data, y*, called the bootstrap
sample, is generated according to Fˆ , an
estimate of the distribution function of Y
formed from the original observed data y.
That is, in the case where y contains the
observed values of a random sample of size
n, y* consists of the observed values of the
random sample
*
1
*
n
Y ,,Y
~ Fˆ
i.i.d
(40)
2.8 Cont.
ˆ (now denoting the
where the estimates F
distribution function of a single observation
Yj) is held fixed at its observed value.
Step 2
The EM algorithm is applied to the
bootstrap observed data y* to compute the
*
ˆ
MLE for this data set, .
Step 3
2.8 Cont.
*
ˆ
The bootstrap covariance matrix of  is
given by
*
ˆ
cov ( ) 
*
* ˆ
*
* ˆ* T
ˆ
ˆ
E [{  E ()}{  E ( )} ],
*
where E* denotes expectation over the
bootstrap distribution specified by Fˆ .
(41)
2.8 Cont.
The bootstrap covariance matrix can be
approximated by Monte Carlo methods.
Steps (1) and (2) are repeated independently
a number of times (say, B) to give B
*
ˆ
independent realizations of  , denoted by
*
* .
ˆ
ˆ
 ,...,
1
B
2.8 Cont.
Then (41) can be approximated by the
sample covariance matrix of these B
bootstrap replications to give
*
ˆ
cov ( ) 
*
B
*
*
*
* T
ˆ
ˆ
ˆ
ˆ
 (b   )(b   ) /(B  1),
b1
where
B
*
*
ˆ
ˆ
    / B.
b1
(43)
(42)
3 Examples of Normal Mixtures
3.1 Basic Model in Genetics
??? Need to add something here?
3.2 Example 3.1:
PTC Sensitivity Data
We report in Table 3, the results of Jones and
McLachlan (1991) who fitted a mixture of
three normal components to data on
phenylthiocarbamide (PTC) sensitivity for
three groups of people.
Table 3: Fit of Mixture Model to Three Data Sets
Parameter
pA
m1
m2
m3
Test statistic:
-2logl(s22=s32)
-2logl(HWE)
Data Set 1 Data Set 2 Data Set 3
0.572(.027) 0.626(.025) 0.520(.026)
2.49(.15)
1.62(.14)
1.49(.09)
9.09(.18)
8.09(.15)
7.47(.47)
10.37(.28) 8.63(.50)
9.08(.08)
1.34(.29)
1.44(.28)
0.34(.09)
2.07(.39)
1.19(.22) 6.23(2.06)
0.57(.33)
0.10(.18)
0.48(.10)
3.60
0.00
6.87
3.76
58.36
1.06
3.3 Example 3.2:
Screening for Hemochronatosis
We consider the case study of McLaren et al.
(1998) on the screening for hemochromatosis.
3.3 Cont.
Studies have suggested that mean transferrin
saturation values for heterozygotes are higher
than among unaffected subjects, but lower
than homozygotes. Since the distribution of
transferrin saturation is known to be well
approximated by a single normal distribution
in unaffected subjects, the physiologic models
used in the study of McLaren et al. (1998)
were a single normal component and a
mixture of two normal components.
Table 4: Transferrin Saturation Results Expressed as
Mean Percentage  SD.
Sex
Asymptomatic
Individual Identified
Individuals
by Pedigree Analysis
Postulated Postulated
Known
Known
Unaffected Heterozygotes Heterozygotes Homozygotes
Male
24.16.0
Female 22.5 6.4
37.3 7.7
37.1 17.0
82.7 14.4
37.6 10.4
32.5 15.3
75.3 19.3
Figure 9: Plot of the densities of the mixture of two
normal heteroscedastic components fitted to some
transferrin values on asymptomatic Australians.
3.4 Example 3.3:
Crab Data
Figure 10: Plot of Crab Data
3.4 Cont.
Progress of fit to Crab Data
Figure 11: Contours of the fitted component densities
on the 2nd & 3rd variates for the blue crab data set.
3.5 Choice of Local Maximizer
The choice of root of the likelihood equation
in the case of homoscedastic components is
straightforward in the sense that the MLE
exists as the global maximizer of the
likelihood function. The situation is less
straightforward in the case of heteroscedastic
components as the likelihood function is
unbounded.
3.5 Cont.
But assuming the univariate result of
Hathaway (1985) extends to the case of
multivariate normal components, then the
constrained global maximizer is consistent
provided the true value of the parameter
vector  belongs to the parameter space
constrained so that the component
generalized variances are not too disparate;
for example,
| h | / | i | C  0
(1  h  i  g).
(46)
3.5 Cont
If we wish to proceed in the heteroscedastic
case by the prior imposition of a constraint
of the form (46), then there is the problem
of how small the lower bound C must be to
ensure that the constrained parameter space
contains the true value of the parameter
vector .
3.5 Cont
Therefore to avoid having to specify a value
for C beforehand, we prefer where possible
to fit the normal mixture without any
constraints on the component covariances
i. It thus means we have to be careful to
check that the EM algorithm has actually
converged and is not on its way to a
singularity which exists since the likelihood
is unbounded for unequal componentcovariance matrices.
3.5 Cont
Even if we can be sure that the EM
algorithm has converged to a local
maximizer, we have to be sure that it is
not a spurious solution that deserves to be
discarded. After these checks, we can take
the MLE of  to be the root of the
likelihood equation corresponding to the
largest of the remaining local maxima
located.
3.6 Choice of Model for
Component-Covariance Matrices
A normal mixture model without restrictions on
the component-covariance matrices may be
viewed as too general for many situations in
practice. At the same time, though, we are
reluctant to impose the homoscedastic condition
i= (i=1,…,g), as we have noted in our analyses
that the imposition of the constraint of equal
component-covariance matrices can have a
marked effect on the resulting estimates and the
implied clustering. This was illustrated in
Example 3.3.
3.7 Spurious Local Maximizers
In practice, consideration has to be given to
the problem of relatively large local
maxima that occur as a consequence of a
fitted component having a very small (but
nonzero) variance for univariate data or
generalized variance (the determinant of the
covariance matrix) for multivariate data.
3.7 Cont.
Such a component corresponds to a cluster
containing a few data points either
relatively close together or almost lying in
a lower dimensional subspace in the case
of multivariate data. There is thus a need to
monitor the relative size of the fitted
mixing proportions and of the component
variances for univariate observations, or of
the generalized component variances for
multivariate data, in an attempt to identify
these spurious local maximizers.
3.8 Example 3.4:
Synthetic Data Set 4
Table 5: Local Maximizers for Synthetic Data Set 4.
Local
Max.
log L
p1
m1
m2
s12
s22
s12 s22
y1
-170.56 0.157 -0.764 1.359
y2
-165.94 0.020 -2.161 1.088 5.2210-9 2.626 1.9710-9
3
-187.63 0.205 -0.598 1.400
(binned)
0.752
0.399
1.602 0.4696
1.612 0.2475
Figure 12: Histogram of Synthetic Data Set 4 for
fit y2 of the normal mixture density.
Figure 13: Histogram of Synthetic Data Set 4 for
fit y1 of the normal mixture density.
Figure 14: Histogram of Synthetic Data Set 4 for
fit y3 of the normal mixture density.
3.9 Example 3.5:Galaxy Data Set
Figure 15: Plot of fitted six-component normal mixture
density for galaxy data set
Table 6: A Six-Component Normal Mixture Solution for the
Galaxy Data Set.
Component i
pi
mi
1
0.085
9.7101
0.178515
2
0.024
16.127
0.001849
3
0.037
33.044
0.849564
4
0.425
22.920
1.444820
5
0.024
26.978
0.000300
6
0.404
19.790
0.454717
s12
4.1 Mixtures of t Distributions
f ( y; m, , ) 

 p
2
)
1
2
( p) ( 2 )1  ( x, m; )  
1
p
2
1
(  p )
2
(49)
where
1
( y, m; )  ( y  m)  ( y  m)
T
(50)
4.2 ML Estimation
where
(k)
i
u
 p
 (k)
(k)
(k)
i  ( y j , mi , i )
(k)
i
The update estimates of mi and i (i=1,…,g)
are given by
m
( k 1)
i
n
n
  t u yi /  t u
j1
(k) (k)
ij
ij
j1
(k) (k)
(53)
ij
ij
4.2 Cont.
and
( k1)
i

n
t
j1

u (y i  m
(k) (k)
ij
ij
( k 1)
i
) (y i  m
( k 1) T
i
)
n
t
j1
(k)
ij
(54)
4.2 Cont.
It follows that 
( k 1)
i
is a solution of the equation
{y( i )  log( i )  1
1
2
1
2
n
 n( k )  t (logu
(k)
ij
1
i
j1
 y(
(i k )  p
2
where n   t
function.
(k)
i
n
(k)
j1 ij
(k)
ij
)  log(
u )
(k)
j
(i k )  p
2
)}  0
and y() is the Digamma
(55)
Example 4.1: Noisy Data Set
A sample of 100 points was simulated from
μ1  0 3) μ2  3 0) μ3   3 0)
T
T
 2

1

  2  

1  
 0.5 0.5 
 0 0.1
 2


 3  
  0.5 0.5 
T
Example 4.1: Noisy Data Set
To this simulated
sample 50 noise
points were added
from a uniform
distribution
over
the range -10 to 10
on each variate.
4.1 True Solution
4.1 Normal Mixture Solution
4.1 Normal + Uniform Solution
4.1 t Mixture Solution
4.1 Comparison of Results
True Solution
Normal Mixture Solution
Normal + Uniform Mixture
t Mixture Solution
5. Number of Components
in a Mixture Model
Testing for the number of components g in
a mixture is an important but very difficult
problem which has not been completely
resolved. We have seen that finite mixture
distributions are employed in the modeling
of data with two main purposes in mind.
5.1 Cont
One is to provide an appealing semiparametric
framework in which to model unknown
distributional shapes, as an alternative to, say,
the kernel density method.
The other is to use the mixture model to
provide a model-based clustering. In both
situations, there is the question of how many
components to include in the mixture.
5.1 Cont.
In the former situation of density
estimation, the commonly used criteria of
AIC and BIC would appear to be adequate
for choosing the number of components g
for a suitable density estimate.
5.2 Order of a Mixture Model
A mixture density with g components might
be empirically indistinguishable from one
with either fewer than g components or
more than g components. It is therefore
sensible in practice to approach the question
of the number of components in a mixture
model in terms of an assessment of the
smallest number of components in the
mixture compatible with the data.
5.2 Cont.
To this end, the true order g0 of the gcomponent mixture model
g
f ( y; )   pi f i ( y; qi )
(57)
i 1
is defined to be the smallest value of g such
that all the components fi(y;qi) are different
and all the associated mixing proportions pi
are nonzero.
5.3 Adjusting for Effect of Skewness
on LRT
We now consider the effect of skewness on
hypothesis tests for the number of
components in normal mixture models. The
Box-Cox (1964) transformation can be
employed initially in an attempt to obtain
normal components. Hence to model some
univariate data y1,…,yn by a two-component
mixture distribution, the density of Yj is
taken to be
5.3 Cont.
f (y; ) 
()
j
()
j
1
j
{p1(y ; m1, s )  p2(y ;m2 , s )}y
2
1
2
2
(58)
where
( y  1) /  if   0,

log
y
if


0
,
j

()
y
()
(59)
5.3 Cont.
and where the last term on the RHS of (58)
corresponds to the Jacobian of the
transformation from y(j ) to y j .
Gutierrez et al. (1995) adopted this mixture
model of transformed normal components in
an attempt to identify the number of
underlying physical phenomena behind
tomato root initiation. The observation yj
corresponds to the inverse proportion of the
jth lateral root which expresses GUS
(j=1,…,40).
Figure 20: Kernel density estimate and normal Q-Q plot
of the yj . From Gutierrez et al. (1995).
Figure 21: Kernel density estimate and normal Q-Q plot
of the yj-1. From Gutierrez et al. (1995).
5.4 Example 5.1:1872 Hidalgo
Stamp Issue of Mexico
Izenman and Sommer (1988) considered
the modeling of the distribution of stamp
thickness for the printing of a given stamp
issue from different types of paper. Their
main concern was the application of the
nonparametric approach to identify
components by the resulting placement of
modes in the density estimate.
5.4 Cont.
The specific example of a philatelic
mixture, the 1872 Hidalgo issue of Mexico,
was used as a particularly graphic
demonstration of the combination of a
statistical investigation and extensive
historical data to reach conclusions
regarding the mixture components.
g=7
(Izenman &
Sommer)
Figure 22
Figure 23
g=3
Figure 24
g=7
(Basford et al.)
g=4
(equal variances)
Figure 25
Figure 26
g=5
(equal variances)
Figure 28
g=8
(equal variances)
Figure 27
g=7
(equal variances)
Table 7:Value of the Log Likelihood for g=1 to 9
Normal Components.
Number of
Components
1
2
3
Unrestricted
Variances
1350
1484
1518
Equal
Variances
1350.3
1442.6
1475.7
4
5
6
7
8
9
1521
1527
1535
1538
1544
1552
1487.4
1489.5
1512.9
1525.3
1535.1
1536.5
5.6 Likelihood Ratio Test Statistic
An obvious way of approaching the problem
of testing for the smallest value of the number
of components in a mixture model is to use
the LRTS, -2logl.
Suppose we wish to test the null hypothesis,
H0 : g  g0
(60)
for some g1>g0.
versus H1 : g  g1
(61)
5.6 Cont.
Usually, g1=g0+1 in practice as it is
common to keep adding components until
the increase in the log likelihood starts to
fall away as g exceeds some threshold. The
value of this threshold is often taken to be
the g0 in H0. Of course it can happen that
the log likelihood may fall away for some
intermediate values of g only to increase
sharply at some larger value of g, as in
Example 5.1.
5.6 Cont.
ˆ denote the MLE of  calculated
We let 
i
under Hi , (i=0,1). Then the evidence against
H0 will be strong if l is sufficiently small, or
equivalently, if -2logl is sufficiently large,
where
ˆ )  logL(
ˆ )}
 2 logl  2{logL(
1
0
(62)
5.7 Bootstrapping the LRTS
McLachlan (1987) proposed a resampling
approach to the assessment of the P-value
of the LRTS in testing
H0 : g  g0
v
H1 : g  g1
for a specified value of g0.
(63)
5.8 Application to Three Real Data Sets
Figure 29a: Acidity data set.
Figure 29b: Enzyme data set.
Figure 29c: Galaxy data set.
Table 9: P-Values Using Bootstrap LRT.
P-Value for g (versus g+1)
Data Set
1
2
3
4
5
6
Acidity 0.01 0.08 0.44
-
-
-
Enzyme 0.01 0.02 0.06
0.39
-
-
Galaxy 0.01 0.01 0.01
0.04
0.02
0.22
5.9 Akaike’s Information Criterion
Akaike’s Information Criterion (AIC) selects
the model that minimizes
ˆ )  2d
 2 log L(
(65)
where d is equal to the number of parameters
in the model.
5.10 Bayesian Information Criterion
The Bayesian information criterion (BIC) of
Schwarz (1978) is given by
ˆ )  d logn
 2 log L(
(66)
as the penalized log likelihood to be maximized
in
model selection,including the present
situation for the number of components g in a
mixture model.
5.11 Integrated Classification
Likelihood Criterion
ˆ )  2EN(tˆ )  d logn
 2 log L(
where
g
n
EN( t)   tij log tij
i 1 j1
(67)
6. Mixtures of Non-normal
Components
We first consider the case of mixed feature
variables, where some are continuous and
some are categorical. We shall outline the
use of the location model for the component
densities, as in Jorgensen and Hunt (1996),
Lawrence and Krzanowski (1996), and Hunt
and Jorgensen (1999).
6.1 Cont.
The ML fitting of commonly used components,
such as the binomial and Poisson, can be
undertaken within the framework of a mixture
of generalized linear models (GLMs). This
mixture model also has the capacity to handle
the regression case, where the random variable
Yj for the jth entity is allowed to depend on the
value xj of a vector x of covariates. If the first
element of x is taken to be one, then we can
specialize this model to the nonregression
situation by setting all but the first element in
the vector of regression coefficients to zero.
6.1 Cont.
One common use of mixture models with
discrete data is to handle overdispersion in
count data. For example, in medical research,
data are often collected in the form of counts,
corresponding to the number of times that a
particular event of interest occurs. Because of
their simplicity, one-parameter distributions
for which the variance is determined by the
mean are often used, at least in the first
instance to model such data. Familiar
examples are the Poisson and binomial
distributions, which are members of the oneparameter exponential family.
6.1 Cont.
However, there are many situations, where
these models are inappropriate, in the sense
that the mean-variance relationship implied
by the one-parameter distribution being
fitted is not valid. In most of these
situations, the data are observed to be
overdispersed; that is, the observed sample
variance is larger than that predicted by
inserting the sample mean into the meanvariance relationship. This phenomenon is
called overdispersion.
6.1 Cont.
These phenomena are also observed with
the fitting of regression models, where the
mean (say, of the Poisson or the binomial
distribution), is modeled as a function of
some covariates. If this dispersion is not
taken into account, then using these models
may lead to biased estimates of the
parameters and consequently incorrect
inferences about the parameters.
6.1 Cont.
Concerning mixtures for multivariate
discrete data, a common application arises
in latent class analyses, in which the feature
variables (or response variables in a
regression context) are taken to be
independent in the component distributions.
This latter assumption allows mixture
models in the context of a latent class
analysis to be fitted within the above
framework of mixtures of GLMs.
6.2 Mixed Continuous and
Categorical Variables
We consider now the problem of fitting a
mixture model
g
f ( y j ; )   pi fi ( y j ; qi )
i 1
to some data, y=(y1T,…,ynT)T, where where
some of the feature variables are categorical.
(70)
6.2 Cont.
The simplest way to model the component
densities of these mixed feature variables is
to proceed on the basis that the categorical
variables are independent of each other and
of the continuous feature variables, which
are taken to have, say, a multivariate
normal distribution. Although this seems a
crude way in which to proceed, it often
does well in practice as a way of clustering
mixed feature data. In the case where there
are data of known origin available, this
procedure is known as the naive Bayes
classifier.
6.2 Cont.
We can refine this approach by adopting the
location model. Suppose that p1 of the p
feature variables in Yj are categorical, where
the qth categorical variable takes on mq
distinct values (q=1,…,p1).
Then there are m  q1 mq distinct patterns of
these p1 categorical variables.
p1
6.2 Cont.
With the location model, the p1 categorical
variables are replaced by a single multinomial
random variable Yj(1) with m cells; that is,
(Yj(1)}s=1 if the realizations of the p1
categorical variables in Yj correspond to the
sth pattern
6.2 Cont.
Any associations between the original
categorical variables are converted into
relationships
among
the
resulting
multinomial cell probabilities. The location
model assumes further that conditional on
(yj(1))s=1 and membership of the ith
component of the mixture model, the
distribution of the p-p1 continuous feature
variables is normal with mean mis and
covariance matrix i, which is the same for
all cells s.
6.2 Cont.
The intent in MULTIMIX is to divide the
feature vector into as many subvectors as
possible that can be taken to be
independently distributed. The extreme
form would be to take all p feature
variables to be independent and to include
correlation structure where necessary.
6.3 Example 6.1:
Prostate Cancer Data
To illustrate the approach adopted in
MULTIMIX, we report in some detail a case
study of Hunt and Jorgensen (1999).
They considered the clustering of patients on
the basis of pretrial covariates alone for some
prostate cancer clinical trial data. This data set
was obtained from a randomized clinical trial
comparing four treatments for n=506 patients
with prostatic cancer grouped on clinical
criteria into Stages 3 and 4 of the disease.
6.3 Cont.
As reported by Byar and Green (1980), Stage
3 represents local extension of the disease
without evidence of distant metastasis, while
Stage 4 represents distant metastasis as
evidenced by elevated acid phosphatase, Xray evidence, or both.
Table 10: Pretreatment Covariates
Covariate
Age
Weight index
Performance rating
Cardiovascular disease history
Systolic blood pressure
Diastolic blood pressure
Electrocardigram code
Serum hemoglobin
Size of primary tumor
Index of tumor stage & histolic grade
Serum prostatic acid phosphatase
Bone metastases
Abbreviation
Age
WtI
PF
HX
SBP
DBP
EKG
Number of Levels
(if Categorical)
4
2
7
HG
SZ
SG
AP
BM
2
Table 11: Models and Fits.
Model
Variable Groups
No. of
Log Lik
Parameters d +11386.265
55
0.000
[ind]
-
[2]
{SBP,DBP}
57
117.542
[3,2]
{BM,WtI,HG},{SBP,DBP}
63
149.419
[5]
{BM,WtI,HG,SBP,DBP}
75
169.163
[9]
Complement of {PF,HX,EKG}
127
237.092
Table 12: Clusters and Outcomes for Treated and Untreated
Patients.
Patient
Group
Cluster 1 Stage 3
Stage 4
Cluster 2 Stage 3
Stage 4
Cluster 1 Stage 3
Stage 4
Cluster 2 Stage 3
Stage 4
Outcome
Alive Prostate Dth. Cardio Dth. Other Dth.
Untreated Patients
39
18
37
33
3
4
3
3
1
4
2
3
14
49
18
6
Treated Patients
50
3
52
20
4
0
1
3
1
6
3
1
25
37
22
10
6.4 Generalized Linear Model
With the generalized linear model (GLM)
approach originally proposed by Nelder and
Wedderburn (1972), the log density of the
(univariate) variable Yj has the form
logf ( y j ; q j , ) 
1
mj {qjy j  b(qj )}  c(y j; ),
(71)
where qj is the natural or canonical parameter, 
is the dispersion parameter, and mj is the prior
weight.
6.4 Cont.
The mean and variance of Yj are given by
E( Yj )  m j  b' (q j )
and
var(Yj )  b' ' (qj ),
respectively, where the prime denotes
differentiation with respect to qj
6.4 Cont.
In a GLM, it is assumed that
j  h(m j )
 x b,
T
where xj is a vector of covariates or explanatory
variables on the jth response yj; b is a vector of
unknown parameters, and h(·) is a monotonic
function known as the link function.
If the dispersion parameter  is known, then the
distribution (71) is a member of the (regular)
exponential family with natural or canonical
parameter qj.
6.4 Cont.
The variance of Yj is the product of two terms,
the dispersion parameter  and the variance
function b''(qj), which is usually written in the
form
V(m j )  m j / qj.
So-called natural or canonical links occur
when j=qj, which are respectively the log and
logit functions for the Poisson and binomial
distributions;
6.4 Cont.
The likelihood equation for b can be expressed
as m
 m w(m )(y
j1
j
j
j
m j )' (m j )  0,
where '(mj)=dj/dmj and w(mj) is the weight
function defined by
w(m j )  1/[{ (m j )} V(m j )].
'
j
2
It can be seen that for fixed , the likelihood
equation for b is independent of .
(73)
6.4 Cont.
The likelihood equation (73) can be solved
iteratively using Fisher's method of scoring,
which for a GLM is equivalent to using
iteratively reweighted least squares (IRLS);
see Nelder and Wedderburn (1972). On the
(k+1) th iteration, we form the adjusted
~
response variable y j as
( k)
( k)
( k)
( k)
~
y j  (m j )  (y j  m j )' (m j )
(74)
6.4 Cont.
These n adjusted responses are then regressed
on the covariates x1,…,xn using weights
m1w(m1(k)), …, mnw(mn(k)). This produces an
updated estimate b(k+1) for b, and hence updated
estimates mj(k+1) for the mj, for use in the righthand side of (74) to update the adjusted
responses, and so on. This process is repeated
until changes in the estimates are sufficiently
small.
6.5 Mixtures of GLMs
For a mixture of g component distributions of
GLMs in proportions p1…,pg, we have that
the density of the jth response variable Yj is
given by
f ( y j ;  )  i1 pi f ( y j ; qij,  i ),
g
(75)
where for a fixed dispersion parameter i,
log f ( y j ;qij , i ) 
1
j i
m  {qijy j  bi (qij )}  ci (y j; i )
(76)
6.5 Cont.
for the ith component GLM, we let mij be the
mean of Yj, hi(mij) the link function, and
i  hi (mij)  b x j
T
i
the linear predictor (i=1,…g).
A common model for expressing the ith mixing
proportion pi as a function of x is the logistic.
Under this model, we have corresponding to
the jth observation yj with covariate vector xj
6.5 Cont.
pij  pi (x j; )
g 1
T
T
 exp( i x j ) /{1  h 1 exp( h x j )}
(78)
where g  0 and   ( ,..., ) contains
the logistic regression coefficients. The first
element of xj is usually taken to be one, so that
the first element of each wi is an intercept.
The EM algorithm of Dempster et al. (1977) can
be applied to obtain the MLE of  as in the case
of a finite mixture of arbitrary distributions.
T
1
T
T
g1
6.5 Cont.
As the E-step is essentially the same as for
arbitrary component densities, we move straight
to the M-step.
If the b1,…,bg have no elements in common a
priori, then (82) reduces to solving
n
 t (y ; 
j1
ij
j
(k)
) w(mij )(y j  mij ) (mij )x j  0
'
i
(83)
separately for each bi to produce bi(k+1)
(i=1,…,g).
6.6 A General ML Analysis of
Overdispersion in a GLM
In an extension to a GLM for overdispersion,
a random effect Uj can be introduced
additively into a GLM on the same scale as
the linear predictor, as proposed by Aitkin
(1996). This extension in a two-level
variance component GLM has been
considered recently by Aitkin (1999).
6.6 Cont.
For an unobservable random effect uj for
the jth response on the same scale as the
linear predictor, we have that
j  b x j  sm j ,
T
where uj is realization of a random variable
Uj distributed N(0,1) independently of
the j th response Yj(j=1,…, n).
6.6 Cont.
The (marginal) log likelihood is thus
n

log L( )   log  f ( y j ; b, s, u)( u)du.
j1

The integral (84) does not exist in closed
form except for a normally distributed
response yj.
( 84)
6.6 Cont.
Following the development in Anderson
and Hinde (1988), Aitkin (1996, 1999)
suggested that it be approximated by
Gaussian quadrature, whereby the integral
over the normal distribution of U is
replaced by a finite sum of g Gaussian
quadrature mass-points ui with masses pi;
the ui and pi are given in standard
references, for example, Abramowitz and
Stegun (1964).
6.6 Cont.
The log likelihood so approximated thus has
the form for that of a g-component mixture
model,
n
g
 log p f ( y ;b, s, u ),
j1
i 1
i
j
i
where the masses p1,…, pg correspond to the
(known) mixing proportions, and the
corresponding mass points u1,…,ug to the
(known) parameter values.
6.6 Cont.
The linear predictor for the j th response in
the ith component of the mixture is
 j  b x j  sμij ,
T
(i 1,...,g),
Hence in this formulation, ui is an intercept term.
6.6 Cont.
Aitkin (1996, 1999) suggested treating the
masses p1,…,pg as g unknown mixing
proportions and the mass points u1,…,ug
as g unknown values of a parameter. This
g-component mixture model is then fitted
using the EM algorithm, as described
above.
6.6 Cont.
In this framework since now ui is also
unknown, we can drop the scale parameter
s and define the linear predictor for the jth
response in the ith component of the
mixture as
ij  b x j  smi ,
T
Thus ui acts as an intercept parameter for
the ith component. One of the ui parameters
will be aliased with the intercept term b0;
alternatively, the intercept can be removed
from the model.
6.7 Example 6.2:
Fabric Faults Data
We report the analysis by Aitkin (1996), who
fitted a Poisson mixture regression model to
a data set on some fabric faults.
Table 14: Results of Fitting Mixtures of
Poisson Regression Models.
g
b0
b1
(SE)
2
-2.979
m1
m2
m3
(p1)
(p2)
(p3)
0.800 0.609
-0.156
Deviance
49.364
(0.201) (0.203) (0.797)
3
-2.972
0.799 0.611
-0.154
-0.165
(0.201) (0.202) (0.711) (0.087)
49.364
7.1 Mixture Models for
Failure-Time Data
It is only in relatively recent times that the
potential of finite mixture models has
started to be exploited in survival and
reliability analyses.
7.1 Cont.
In the analysis of failure-time data, it is often
necessary to consider different types of
failure. For simplicity of exposition, we shall
consider the case of g=2 different types of
failure or causes, but the results extend to an
arbitrary number g.
7.1 Cont.
An item is taken to have failed with the
occurrence of the first failure from either
cause, and we observe the time T to failure
and the type of failure i, (i=1,2).
In the case where the study terminates before
failure occurs, T is the censoring time and the
censoring indicator  is set equal to zero to
indicate that the failure time is right-censored.
7.1 Cont.
The traditional approach to the modeling of
the distribution of failure time in the case of
competing risks is to postulate the existence
of so-called latent failure times, T1 and T2,
corresponding to the two causes and to
proceed with the modeling of T=min (T1,T2)
on the basis that the two causes are
independent of each other.
7.1 Cont.
An alternative approach is to adopt a twocomponent mixture model, whereby the
survival function of T is modeled as
S( t; x)  p1 (x)S1 ( t; x)  p2 (x)S2 ( t; x)
(85)
where the ith component survival function
Si(t;x) denotes the conditional survival
function given failure is due to the ith cause,
and pi(x) is the probability of failure from the
ith cause (i=1,2).
7.1 Cont.
Here x is a vector of covariates associated with
the item. It is common to assume that the
mixing proportions pi(x) have the logistic
form,
p1 (x;b)  1  p2 (x;b)
 exp(a  b x) /{1  exp(a  b x)},
T
T
(86)
where b=(a,bT)T is the vector of logistic
regression coefficients.
7.2 ML Estimation for Mixtures of
Survival Functions
The log likelihood for  that can be formed
from the observed data y is given by
n
log L()  [I[1] ( j ) log{p1 ( x j ; b)f1 ( t j ; q1 , x j )}
j1
 I[2] ( j ) log{p2 (x;b)f2 (t j; q2 , x j )}
 I[0] ( j ) log S (t j ; , x j )]
(88)
where I[h](j) is the indicator function that equals
one if j=h(h=0,1,2).
7.3 Example 7.1:
Heart-Valve Data
To illustrate the application of mixture models
for competing risks in practice, we consider
the problem studied in Ng et al. (1999). They
considered the use of the two-component
mixture model (85) to estimate the probability
that a patient aged x years would undergo a
rereplacement operation after having his/her
native aortic valve replaced by a xenograft
prosthesis.
7.3 Cont.
At the time of the initial replacement
operation, the surgeon has the choice of
using either a mechanical valve or a
biologic valve, such as a xenograft (made
from porcine valve tissue) or an allograft
(human donor valve). Modern day
mechanical valves are very reliable, but a
patient must take blood-thinning drugs
for the rest of his/her life to avoid
thromboembolic events.
7.3 Cont.
On the other hand, biologic valves have a
finite working life, and so have to be
replaced if the patient were to live for a
sufficiently long enough time after the
initial replacement operation. Thus
inferences about the probability that a
patient of a given age will need to
undergo a rereplacement operation can be
used to assist a heart surgeon in deciding
on the type of valve to be used in view of
the patient's age.
Figure 30: Estimated probability of reoperation
at a given age of patient.
Figure 31: Conditional probability of reoperation
(xenograft valve) for specified age of patient.
7.4 Conditional Probability of a
Reoperation
As a patient can avoid a reoperation by
dying first, it is relevant to consider the
conditional probability of a reoperation
within a specified time t after the initial
operation given that the patient does not die
without a reoperation during this period.
7.5 Long-term Survivor Model
In some situations where the aim is to
estimate the survival distribution for a
particular type of failure, a certain fraction
of the population, say p1, may never
experience this type of failure. It is
characterized by the overall survival curve
being leveled at a nonzero probability. In
some applications, the surviving fractions
are said to be “cured.”
7.5 Cont.
It is assumed that an entity or individual has
probability p2=1-p1 of failing from the
cause of interest and probability p1 of never
experiencing failure from this cause.
In the usual framework for this problem, it
is assumed further that the entity cannot fail
from any other cause during the course of
the study (that is, during follow-up).
7.5 Cont.
We let T be the random variable denoting the
time to failure, where T= denotes the event
that the individual will not fail from the
cause of interest. The probability of this latter
event is p1.
7.5 Cont.
The unconditional survival function of T can
then be expressed as
S( t )  p1  p2S2 ( t ),
(92)
where S2(t) denotes the conditional survival
function for failure from the cause of interest.
The mixture model (92) with the first
component having mass one at T= can be
regarded as a nonstandard mixture model.
8. Mixture Software
8.1 EMMIX
McLachlan, Peel, Adams, and Basford
http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
EMMIX for Windows
http://www.maths.uq.edu.au/~gjm/EMMIX_Demo/emmix.html
Other Software
• AUTOCLASS
• BINOMIX
• C.A.MAN
• MCLUST/EMCLUST
• MGT
• MIX
• MIXBIN
Other Software(cont.)
• Program for Gompertz Mixtures
• MPLUS
• MULTIMIX
• NORMIX
• SNOB
• Software for Flexible Bayesian Modeling
and Markov Chain Sampling
La speranza voi vedertutti in Cairns in 2004!