Applying Finite Mixture Models

Download Report

Transcript Applying Finite Mixture Models

Applying Finite Mixture Models
Presenter: Geoff McLachlan
Topics
Introduction
Application of EM algorithm
Examples of normal mixtures
Robust mixture modeling
Number of components in a mixture model
Number of nonnormal components
Mixture models for failure-time data
Mixture software
1.1 Flexible Method of Modeling
Astronomy
Biology
Genetics
Medicine
Psychiatry
Economics
Engineering
Marketing
1.2 Initial Approach to Mixture
Analysis
Classic paper of Pearson (1894)
Figure 1: Plot of forehead to body length
data on 1000 crabs and of the fitted onecomponent (dashed line) and two-component
(solid line) normal mixture models.
1.3 Basic Definition
We let Y1,…. Yn denote a random sample
of size n where Yj is a p-dimensional
random vector with probability density
function f (yj)
g
f ( y j )   pi f i ( y j )
i 1
(1)
where the f i(yj) are densities and the pi
are nonnegative quantities that sum to
one.
1.4 Interpretation of Mixture
Models
An obvious way of generating a random
vector Yj with the g-component mixture
density f (Yj), given by (1), is as follows.
Let Zj be a categorical random variable
taking on the values 1, … ,g with
probabilities p1, … pg, respectively, and
suppose that the conditional density of Yj
given Zj=i is f i(yj) (i=1, … , g). Then the
unconditional density of Yj, (that is, its
marginal density) is given by f (yj).
1.5 Shapes of Some Univariate
Normal Mixtures
Consider
f (y j )  p1(y j;m1, s )  p2(y j;m2 , s )
2
2
(5)
where
 12
1
( y j ; m, s )  (2p) s exp{  ( y j  m) s }
2
1
2
2
2
(6)
denotes the univariate normal density with
mean m and variance s2.
D=1
D=3
D=2
D=4
Figure 2: Plot of a mixture density of two
univariate normal components in equal
proportions with common variance s2=1
D=1
D=3
D=2
D=4
Figure 3: Plot of a mixture density of two
univariate normal components in proportions
0.75 and 0.25 with common variance
1.6 Parametric Formulation of
Mixture Model
In many applications, the component
densities fi(yj) are specified to belong to
some parametric family. In this case, the
component densities fi(yj) are specified as
fi(yj;qi), where qi is the vector of unknown
parameters in the postulated form for the
ith component density in the mixture. The
mixture density f(yj) can then be written
as
1.6 cont.
g
f ( y j ; )   pi fi ( y j ; qi )
(7)
i 1
where the vector Y containing all the
parameters in the mixture model can be
written as
  (p1,...,pg1, x )
T T
(8)
where x is the vector containing all the
parameters in q1,…qg known a priori to be
distinct.
1.7 Identifiability of Mixture
Distributions
In general, a parametric family of densities
f (yj;) is identifiable if distinct values of
the parameter  determine distinct members
of the family of densities
{ f (y j; ) :   W }
where W is the specified parameter space;
that is,
f (y j; )  f (y j;  )
*
(11)
1.7 cont.
if and only if

*
(12)
identifiability for mixture distributions is
defined slightly different. To see why this
is necessary, suppose that f (yj;) has two
component densities, say, f i(y;qi) and f
h(y; qh), that belong to the same
parametric family. Then (11) will still
hold when the component labels i and h
are interchanged in .
1.8 Estimation of Mixture
Distributions
In the 1960s, the fitting of finite mixture
models by maximum likelihood had been
studied in a number of papers, including
the seminal papers by Day (1969) and
Wolfe (1965, 1967, 1970).
 However, it was the publication of the
seminal paper of Dempster, Laird, and
Rubin (1977) on the EM algorithm that
greatly stimulated interest in the use of
finite mixture distributions to model
heterogeneous data.
1.8 Cont.
This is because the fitting of mixture
models by maximum likelihood is a
classic example of a problem that is
simplified considerably by the EM's
conceptual unification of maximum
likelihood (ML) estimation from data
that can be viewed as being
incomplete.
1.9 Mixture Likelihood
Approach to Clustering
Suppose that the purpose of fitting the finite
mixture model (7) is to cluster an observed
random sample y1,…,yn into g components.
This problem can be viewed as wishing to
infer the associated component labels z1,…,zn
of these feature data vectors. That is, we wish
to infer the zj on the basis of the feature data
yj.
1.9 Cont.
After we fit the g-component mixture model
ˆ of the vector of
to obtain the estimate 
unknown parameters in the mixture model,
we can give a probabilistic clustering of the
n feature observations y1,…,yn in terms of
their fitted posterior probabilities of
component membership. For each yj, the g
ˆ ) ,…, tg(yj; 
ˆ ) give the
probabilities t1(yj;
estimated posterior probabilities that this
observation belongs to the first, second,…,
and g th component, respectively, of the
mixture (j=1,…,n).
1.9 Cont.
We can give an outright or hard clustering of
these data by assigning each yj to the
component of the mixture to which it has the
highest posterior probability of belonging.
That is, we estimate the component-label
vector zj by zˆ j , where zˆ ij  (zˆ j )i is defined
by
z ij  1,
ˆ
zˆ ij  0,
if i  arg max t h (y j ; ),
h
otherwise,
for i=1,…,g; j=1,…,n.
(14)
1.10 Testing for the Number of
Components
In some applications of mixture models,
there is sufficient a priori information for
the number of components g in the mixture
model to be specified with no uncertainty.
For example, this would be the case where
the components correspond to externally
existing groups in which the feature vector
is known to be normally distributed.
1.10 Cont.
However, on many occasions, the number
of components has to be inferred from the
data, along with the parameters in the
component densities. If, say, a mixture
model is being used to describe the
distribution of some data, the number of
components in the final version of the
model may be of interest beyond matters of
a technical or computational nature.
2. Application of EM algorithm
2.1 Estimation of Mixing Proportions
Suppose that the density of the random
vector Yj has a g-component mixture from
g
f ( y j ; )   pi f i ( y j ),
i 1
=(p1,….,pg-1)T
(15)
where
is the vector
containing the unknown parameters, namely
the g-1 mixing proportions p1,…,pg-1, since
g 1
p g  1   pi
i 1
2.1 cont.
In order to pose this problem as an
incomplete-data one, we now introduce as
the unobservable or missing data the vector
z  ( z ,, z ) ,
T
1
T T
n
(18)
where zj is the g-dimensional vector of
zero-one indicator variables as defined
above. If these zij were observable, then the
MLE of pi is simply given by
n
z
j1
ij
/ n (i  1,, g),
(19)
2.1 Cont.
The EM algorithm handles the addition of
the unobservable data to the problem by
working with Q(;(k)), which is the
current conditional expectation of the
complete-data log likelihood given the
observed data. On defining the completedata vector x as
x  (x , z ) ,
T
T T
(20)
2.1 Cont.
the complete-data log likelihood for Y has
the multinomial form
g
n
log Lc ( )   z ij log pi  C,
i 1 j1
where
g
n
C   z ij log f i ( y j )
i 1 j1
does not depend on .
(21)
2.1 Cont.
As (21) is linear in the unobservable data zij,
the E-step (on the (k+1) th iteration)
simply requires the calculation of the
current conditional expectation of Zij given
the observed data y, where Zij is the
random variable corresponding to zij. Now
E( k ) (Zij y)  pr( k ) {Zij  1 y}
t ,
( k)
ij
(22)
2.1 Cont.
where by Bayes Theorem,
t
( k)
ij
 ti (y j;  )
( k)
(23)
 p f (y j ) / f (y j;  )
( k)
i
i
( k)
for i=1,…,g; j=1,…,n. The quantity
ti(yj;(k)) is the posterior probability that
the j th member of the sample with
observed value yj belongs to the i th
component of the mixture.
2.1 Cont.
The M-step on the (k+1) th iteration
simply requires replacing each zij by tij(k)
in (19) to give
p
( k 1)
i
for i=1,…,g.
n
 t
j1
(k)
ij
/n
(24)
2.2 Example 2.1:
Synthetic Data Set 1
We generated a random sample of n=50
observations y1,…,yn from a mixture of
two univariate normal densities with
means m1=0 and m2=2 and common
variance s2=1 in proportions p1=0.8 and
p2=0.2.
Table 1: Results of EM Algorithm for Example
on Estimation of Mixing Proportions
Iteration
k
0
1
2
3
4
5
6
7

27
(k)
1
log L( p )
0.50000
0.68421
0.70304
0.71792
0.72885
0.73665
0.74218
0.74615
0.50000
0.68421
-91.87811
-85.55353
-85.09035
-84.81398
-84.68609
-84.63291
-84.60978
-84.58562
-91.87811
-85.55353
p
(k)
1
2.3 Univariate Normal
Component Densities
The normal mixture model to be fitted is thus
g
f ( y j ; )   pi fi ( y j ; qi ),
(28)
i 1
where
fi (y j; qi )  (y j , mi , s )
2
1
2 2
 ( 2ps ) exp{  ( y j  mi ) / s }.
1
2
2
2
2.3 Cont.
The complete-data log likelihood function
for Y is given by (21), but where now
g
n
C   zij log fi ( y j ; qi )
i 1 j1
1
2
g n
1
2
i 1 j1
  n log(2p)

 z {logs
ij
2
 ( y j  mi ) / s }.
2
2
2.3 Cont.
The E-Step is the same as before, requiring
the calculation of (23). The M-step now
requires the computation of not only (24), but
also the values m1( k1) ,, m(gk1) and s(k+1)2 that,
along with p1( k ) ,, p(gk)1 maximize Q(;(k)).
Now
n
n
 zijyij /  zij and
j 1
j1
g
n
 z ( y
i 1 j1
ij
 m) / n
2
j
(29)
are the MLE’s of mi and s2 respectively, if the
zij were observable.
2.3 Cont.
As logLc() is linear in the zij, it follows that
the zij in (29) and (30) are replaced by their
(k)
current conditional expectations tij , which
here are the current estimates ti(yj;(k)) of
the posterior probabilities of membership of
the components of the mixture, given by
ti (y j;  )  p f (y j; q ) / f (y j;  )
( k)
( k)
i
i
( k)
i
( k)
2.3 Cont.
This yields
m
( k 1)
i
n
n
 t yj / t
j1
and
(k)
ij
j1
(k)
ij
(i  1,, g)
(31)
( k 1)
s
2
g
n
  t ( y j  m
i 1 j1
and p
( k 1)
is
i
(k)
ij
given by (24).
( k 1) 2
i
) /n
(32)
2.4 Multivariate Component
Densities
m
( k 1)
i

n
 t y j / t
 k 1)
g
n
j1
(k)
ij
j1
(k)
ij
(i  1,, g)
(34)

n
 t
i 1 j1
(k)
ij
(y j  m
( k 1)
i
)(y j  m
( k 1) T
i
) /n
(35)
2.4 Cont.
In the case of normal components with
arbitrary covariance matrices, equation (35)
is replaced by
( k1)
i


n
t
j1
(k)
ij
(y j  m
( k 1)
i
)(y j  m
n
) / t
( k 1) T
i
(i  1,, g)
j1
(k)
ij
(36)
2.5 Starting Values for EM
Algorithm
The EM algorithm is started from some initial
value of , (0). Hence in practice we have to
specify a value for (0).
Hence an alternative approach is to
perform the first E-step by specifying a value
tj(0) for t(yj;) for each j (j=1,…,n), where
t(y j; )  (t1(y j; ),...,tg (y j; ))
T
is the vector containing the g posterior
probabilities of component membership for
yj,
2.5 Cont.
The latter is usually undertaken by setting
tj(0)=zj (0) for j=1,…,n, where
z
( 0)
 (z
( 0) T
1
,...,z
( 0) T T
n
)
defines an initial partition of the data into g
groups. For example, an ad hoc way of
initially partitioning the data in the case of,
say, a mixture of g=2 normal components
with the same covariance matrices, would
be to plot the data for selections of two of
the p variables, and then draw a line that
divides the bivariate data into two groups
that have a scatter that appears normal.
2.5 Cont.
For higher dimensional data, an initial value
z(0) for z might be obtained through the use
of some clustering algorithm, such as kmeans or, say, an hierarchical procedure if n
is not too large.
Another way of specifying an initial partition
z(0) of the data is to randomly divide the data
into g groups corresponding to the g
components of the mixture model.
2.6 Example 2.2:
Synthetic Data Set 2
2.7 Example 2.3:
Synthetic Data Set 3
y
p1
True Values
0.333
0.333
0.294
p2
p3
m1
m2
m3
1
0.333
0.333
0.337
0.333
0.333
0.370
(0 –2)T
(-1 0) T
(-0.154 –1.961) T
(0 0) T
(0 0) T
(0.360 0.115) T
(0 2) T
(1 0) T
(-0.004 2.027) T
1
1
2 0 


 0 0.2 
2 0 


 0 0.2 
2 0 


 0 0.2 
Initial Values Estimates by EM
1

0
1

0
1

0
0

1
0

1
0

1
 1.961

  0.016
 2.346

  0.553
 2.339

 0.042
 0.016

0.218 
 0.553

0.218 
0.042

0.206
Figure 7
Figure 8
2.8 Provision of Standard Errors
One way of obtaining standard errors of the
estimates of the parameters in a mixture
model is to approximate the covariance
ˆ by the inverse of the observed
matrix of 
information matrix, which is given by the
negative of the Hessian matrix of the log
likelihood evaluated at the MLE. It is
important to emphasize that estimates of the
covariance matrix of the MLE based on the
expected or observed information matrices
are guaranteed to be valid inferentially only
asymptotically.
2.8 Cont.
In particular for mixture models, it is well
known that the sample size n has to be
very large before the asymptotic theory of
maximum likelihood applies.
Hence we shall now consider a
resampling approach, the bootstrap, to
this problem.
Standard error estimation of
may be
implemented according to the bootstrap as
follows.
Step 1
2.8 Cont.
A new set of data, y*, called the bootstrap
sample, is generated according to Fˆ , an
estimate of the distribution function of Y
formed from the original observed data y.
That is, in the case where y contains the
observed values of a random sample of size
n, y* consists of the observed values of the
random sample
ˆ
Y ,, Y ~ F
*
1
* i.i.d
n
(40)
2.8 Cont.
ˆ (now denoting the
where the estimates F
distribution function of a single observation
Yj) is held fixed at its observed value.
Step 2
The EM algorithm is applied to the
bootstrap observed data y* to compute the
ˆ .*
MLE for this data set, 
2.8 Cont.
Step 3
ˆ * is
The bootstrap covariance matrix of 
given by
*
(41)
ˆ
cov ( ) 
*
* ˆ
*
* ˆ* T
ˆ
ˆ
E [{  E ()}{  E ( )} ],
*
where E* denotes expectation over the
bootstrap distribution specified by Fˆ .
2.8 Cont.
The bootstrap covariance matrix can be
approximated by Monte Carlo methods.
Steps (1) and (2) are repeated independently
a number of times (say, B) to give B
*
ˆ
independent realizations of  , denoted by
*
* .
ˆ
ˆ
 ,...,
1
B
2.8 Cont.
Then (41) can be approximated by the
sample covariance matrix of these B
bootstrap replications to give
*
ˆ
cov ( ) 
*
(42)
B
*
*
*
* T
ˆ
ˆ
ˆ
ˆ
 (b   )(b   ) /(B  1),
b1
where
B
*
*
ˆ
ˆ
    / B.
b1
(43)
3 Examples of Normal Mixtures
3.1 Basic Model in Genetics
3.2 Example 3.1:
PTC Sensitivity Data
We report in Table 3, the results of Jones
and McLachlan (1991) who fitted a
mixture of three normal components to
data on phenylthiocarbamide (PTC)
sensitivity for three groups of people.
Table 3: Fit of Mixture Model to Three Data Sets
Parameter
pA
m1
m2
m3
Test statistic:
-2logl(s22=s32)
-2logl(HWE)
Data Set 1 Data Set 2 Data Set 3
0.572(.027) 0.626(.025) 0.520(.026)
2.49(.15)
1.62(.14)
1.49(.09)
9.09(.18)
8.09(.15)
7.47(.47)
10.37(.28) 8.63(.50)
9.08(.08)
1.34(.29)
1.44(.28)
0.34(.09)
2.07(.39)
1.19(.22) 6.23(2.06)
0.57(.33)
0.10(.18)
0.48(.10)
3.60
0.00
6.87
3.76
58.36
1.06
3.3 Example 3.2:
Screening for Hemochronatosis
We consider the case study of McLaren et
al. (1998) on the screening for
hemochromatosis.
3.3 Cont.
Studies have suggested that mean
transferrin
saturation
values
for
heterozygotes are higher than among
unaffected subjects, but lower than
homozygotes. Since the distribution of
transferrin saturation is known to be well
approximated by a single normal
distribution in unaffected subjects, the
physiologic models used in the study of
McLaren et al. (1998) were a single
normal component and a mixture of two
normal components.
Table 4: Transferrin Saturation Results Expressed
as Mean Percentage  SD.
Sex
Asymptomatic
Individual Identified
Individuals
by Pedigree Analysis
Postulated Postulated
Known
Known
Unaffected Heterozygotes Heterozygotes Homozygotes
Male
24.16.0
Female 22.5 6.4
37.3 7.7
37.1 17.0
82.7 14.4
37.6 10.4
32.5 15.3
75.3 19.3
Figure 9: Plot of the densities of the mixture of two
normal heteroscedastic components fitted to some
transferrin values on asymptomatic Australians.
3.4 Example 3.3:Crab Data
Figure 10: Plot of Crab Data
3.4 Cont.
Progress of fit to Crab Data
Figure 11: Contours of the fitted component
densities on the 2nd & 3rd variates for the blue crab
data set.
3.5 Choice of Local Maximizer
The choice of root of the likelihood
equation in the case of homoscedastic
components is straightforward in the sense
that the MLE exists as the global
maximizer of the likelihood function. The
situation is less straightforward in the case
of heteroscedastic components as the
likelihood function is unbounded.
3.5 Cont.
But assuming the univariate result of
Hathaway (1985) extends to the case of
multivariate normal components, then the
constrained global maximizer is consistent
provided the true value of the parameter
vector  belongs to the parameter space
constrained so that the component
generalized variances are not too disparate;
for example,
| h | / | i | C  0
(1  h  i  g).
(46)
3.5 Cont
If we wish to proceed in the heteroscedastic
case by the prior imposition of a constraint
of the form (46), then there is the problem
of how small the lower bound C must be to
ensure that the constrained parameter space
contains the true value of the parameter
vector .
3.5 Cont
Therefore to avoid having to specify a value
for C beforehand, we prefer where possible
to fit the normal mixture without any
constraints on the component covariances
i. It thus means we have to be careful to
check that the EM algorithm has actually
converged and is not on its way to a
singularity which exists since the likelihood
is unbounded for unequal componentcovariance matrices.
3.5 Cont
Even if we can be sure that the EM
algorithm has converged to a local
maximizer, we have to be sure that it is
not a spurious solution that deserves to be
discarded. After these checks, we can take
the MLE of  to be the root of the
likelihood equation corresponding to the
largest of the remaining local maxima
located.
3.6 Choice of Model for
Component-Covariance Matrices
A normal mixture model without restrictions
on the component-covariance matrices may
be viewed as too general for many situations
in practice. At the same time, though, we
are reluctant to impose the homoscedastic
condition i= (i=1,…,g), as we have noted
in our analyses that the imposition of the
constraint of equal component-covariance
matrices can have a marked effect on the
resulting estimates and the implied
clustering. This was illustrated in Example
3.3.
3.7 Spurious Local Maximizers
In practice, consideration has to be given to
the problem of relatively large local
maxima that occur as a consequence of a
fitted component having a very small (but
nonzero) variance for univariate data or
generalized variance (the determinant of the
covariance matrix) for multivariate data.
3.7 Cont.
Such a component corresponds to a cluster
containing a few data points either
relatively close together or almost lying in
a lower dimensional subspace in the case
of multivariate data. There is thus a need to
monitor the relative size of the fitted
mixing proportions and of the component
variances for univariate observations, or of
the generalized component variances for
multivariate data, in an attempt to identify
these spurious local maximizers.
3.8 Example 3.4:
Synthetic Data Set 4
Table 5: Local Maximizers for Synthetic Data Set 4.
Local
Max.
log L
p1
m1
m2
s12
s22
s12 s22
y1
-170.56 0.157 -0.764 1.359
y2
-165.94 0.020 -2.161 1.088 5.2210-9 2.626 1.9710-9
3
-187.63 0.205 -0.598 1.400
(binned)
0.752
0.399
1.602 0.4696
1.612 0.2475
Figure 12: Histogram of Synthetic Data Set
4 for fit y2 of the normal mixture density.
Figure 13: Histogram of Synthetic Data Set 4
for fit y1 of the normal mixture density.
Figure 14: Histogram of Synthetic Data Set 4
for fit y3 of the normal mixture density.
3.9 Example 3.5:Galaxy Data Set
Figure 15: Plot of fitted six-component
normal mixture density for galaxy data set
Table 6: A Six-Component Normal Mixture
Solution for the Galaxy Data Set.
Component i
pi
mi
1
0.085
9.7101
0.178515
2
0.024
16.127
0.001849
3
0.037
33.044
0.849564
4
0.425
22.920
1.444820
5
0.024
26.978
0.000300
6
0.404
19.790
0.454717
s12
4.1 Mixtures of t Distributions
f ( y; m, , ) 

 p
2
)
1
(49)
2
( p) ( 2 )1  ( x, m; )  
1
p
2
1
(  p )
2
where
1
( y, m; )  ( y  m)  ( y  m)
T
(50)
4.2 ML Estimation
where
(k)
i
u
 p
 (k)
(k)
(k)
i  ( y j , mi , i )
(k)
i
The update estimates of mi and i (i=1,…,g)
are given by
m
( k 1)
i
n
n
  t u yi /  t u
j1
(k) (k)
ij
ij
j1
(k) (k)
(53)
ij
ij
4.2 Cont.
and
( k1)
i

n
t
j1

u (y i  m
(k) (k)
ij
ij
( k 1)
i
) (y i  m
( k 1) T
i
)
n
t
j1
(k)
ij
(54)
4.2 Cont.
( k 1)

It follows that i is a solution of the
equation
{y( 12 i )  log(12 i )  1
n
 n( k )  t (logu
1
i
j1
 y(
(k)
ij
(i k )  p
2
(k)
ij
)  log(
u )
(i k )  p
2
(k)
j
)}  0
Where n   t and y() is the
Digamma function.
(k)
i
n
(k)
j1 ij
(55)
Example 4.1:Noisy Data Set
A sample of 100 points was simulated from
m1  0 3) m 2  3 0 ) m 3   3 0 )
T
T
 2

1

1  
 2  

 0.5 0.5 
 0 0.1
 2

3  

  0.5 0.5 
T
Example 4.1:Noisy Data Set
To this
simulated
sample 50 noise
points were
added from a
uniform
distribution
over the range 10 to 10 on
each variate.
4.1 True Solution
4.1 Normal Mixture Solution
4.1 Normal + Uniform Solution
4.1 t Mixture Solution
4.1 Comparison of Results
True Solution
Normal Mixture Solution
Normal + Uniform Mixture
t Mixture Solution
5. Number of Components in a
Mixture Model
Testing for the number of components g in
a mixture is an important but very difficult
problem which has not been completely
resolved. We have seen that finite mixture
distributions are employed in the modeling
of data with two main purposes in mind.
5.1 Cont
One
is to provide an appealing
semiparametric framework in which to
model unknown distributional shapes, as
an alternative to, say, the kernel density
method.
The other is to use the mixture model to
provide a model-based clustering. In both
situations, there is the question of how
many components to include in the
mixture.
5.1 Cont.
In the former situation of density
estimation, the commonly used criteria of
AIC and BIC would appear to be adequate
for choosing the number of components g
for a suitable density estimate.
5.2 Order of a Mixture Model
A mixture density with g components might
be empirically indistinguishable from one
with either fewer than g components or
more than g components. It is therefore
sensible in practice to approach the question
of the number of components in a mixture
model in terms of an assessment of the
smallest number of components in the
mixture compatible with the data.
5.2 Cont.
To this end, the true order g0 of the gcomponent mixture model
g
f ( y; )   pi f i ( y; qi )
(57)
i 1
is defined to be the smallest value of g such
that all the components fi(y;qi) are different
and all the associated mixing proportions pi
are nonzero.
5.3 Adjusting for Effect of
Skewness on LRT
We now consider the effect of skewness on
hypothesis tests for the number of
components in normal mixture models. The
Box-Cox (1964) transformation can be
employed initially in an attempt to obtain
normal components. Hence to model some
univariate data y1,…,yn by a twocomponent mixture distribution, the density
of Yj is taken to be
5.3 Cont.
f (y; ) 
()
j
()
j
1
j
{p1(y ; m1, s )  p2(y ;m2 , s )}y
2
1
2
2
where
( y  1) /  if   0,

log
y
if


0
,
j

()
y
()
(58)
(59)
5.3 Cont.
and where the last term on the RHS of (58)
corresponds to the Jacobian of the
transformation from y(j ) to y j .
Gutierrez et al. (1995) adopted this mixture
model of transformed normal components in
an attempt to identify the number of
underlying physical phenomena behind
tomato root initiation. The observation yj
corresponds to the inverse proportion of the j
th lateral root which expresses GUS
(j=1,…,40).
Figure 20: Kernel density estimate and
normal Q-Q plot of the yj . From Gutierrez et
al. (1995).
Figure 21: Kernel density estimate and normal
Q-Q plot of the yj-1. From Gutierrez et al.
(1995).
5.4 Example 5.1:1872 Hidalgo
Stamp Issue of Mexico
Izenman and Sommer (1988) considered
the modeling of the distribution of stamp
thickness for the printing of a given stamp
issue from different types of paper. Their
main concern was the application of the
nonparametric approach to identify
components by the resulting placement of
modes in the density estimate.
5.4 Cont.
The specific example of a philatelic
mixture, the 1872 Hidalgo issue of Mexico,
was used as a particularly graphic
demonstration of the combination of a
statistical investigation and extensive
historical data to reach conclusions
regarding the mixture components.
g=7
(Izenman &
Sommer)
Figure 22
Figure 23
g=3
Figure 24
g=7
(Basford et al.)
g=4
(equal variances)
Figure 25
Figure 26
g=5
(equal variances)
Figure 28
g=8
(equal variances)
Figure 27
g=7
(equal variances)
Table 7:Value of the Log Likelihood for
g=1 to 9 Normal Components.
Number of
Components
1
2
3
Unrestricted
Variances
1350
1484
1518
Equal
Variances
1350.3
1442.6
1475.7
4
5
6
7
8
9
1521
1527
1535
1538
1544
1552
1487.4
1489.5
1512.9
1525.3
1535.1
1536.5
5.6 Likelihood Ratio Test
Statistic
An obvious way of approaching the
problem of testing for the smallest value of
the number of components in a mixture
model is to use the LRTS, -2logl. Suppose
we wish to test the null hypothesis,
H0 : g  g0
(60)
for some g1>g0.
versus
H1 : g  g1
(61)
5.6 Cont.
Usually, g1=g0+1 in practice as it is
common to keep adding components until
the increase in the log likelihood starts to
fall away as g exceeds some threshold. The
value of this threshold is often taken to be
the g0 in H0. Of course it can happen that
the log likelihood may fall away for some
intermediate values of g only to increase
sharply at some larger value of g, as in
Example 5.1.
5.6 Cont.
ˆ denote the MLE of  calculated
We let 
i
under Hi , (i=0,1). Then the evidence against
H0 will be strong if l is sufficiently small, or
equivalently, if -2logl is sufficiently large,
where
ˆ )  logL(
ˆ )} (62)
 2 logl  2{logL(
1
0
5.7 Bootstrapping the LRTS
McLachlan (1987) proposed a resampling
approach to the assessment of the P-value of
the LRTS in testing
H0 : g  g0
v
H1 : g  g1
for a specified value of g0.
(63)
5.8 Application to Three Real Data Sets
Figure 29a: Acidity data set.
Figure 29b: Enzyme data set.
Figure 29c: Galaxy data set.
Table 9: P-Values Using Bootstrap LRT.
P-Value for g (versus g+1)
Data Set
1
2
3
4
5
6
Acidity 0.01 0.08 0.44
-
-
-
Enzyme 0.01 0.02 0.06
0.39
-
-
Galaxy 0.01 0.01 0.01
0.04
0.02
0.22
5.9 Akaike’s Information
Criterion
Akaike’s Information Criterion (AIC)
selects the model that minimizes
ˆ )  2d
 2 log L(
where d is equal to the number of
parameters in the model.
(65)
5.10 Bayesian Information
Criterion
The Bayesian information criterion (BIC)
of Schwarz (1978) is given by
ˆ )  d logn
 2 log L(
(66)
as the penalized log likelihood to be
maximized in model selection,including
the present situation for the number of
components g in a mixture model.
5.11 Integrated Classification
Likelihood Criterion
ˆ )  2EN(tˆ )  d logn
 2 log L(
(67)
where
g
n
EN( t)   tij log tij
i 1 j1
6. Mixtures of Nonnormal
Components
We first consider the case of mixed feature
variables, where some are continuous and
some are categorical. We shall outline the
use of the location model for the component
densities, as in Jorgensen and Hunt (1996),
Lawrence and Krzanowski (1996), and Hunt
and Jorgensen (1999).
6.1 Cont.
The ML fitting of commonly used
components, such as the binomial and
Poisson, can be undertaken within the
framework of a mixture of generalized
linear models (GLMs). This mixture
model also has the capacity to handle the
regression case, where the random variable
Yj for the j th entity is allowed to depend
on the value xj of a vector x of covariates. If
the first element of x is taken to be one,
then we can specialize this model to the
nonregression situation by setting all but
the first element in the vector of regression
coefficients to zero.
6.1 Cont
One common use of mixture models with
discrete data is to handle overdispersion in
count data. For example, in medical
research, data are often collected in the
form of counts, corresponding to the
number of times that a particular event of
interest occurs.
Because of their
simplicity, one-parameter distributions for
which the variance is determined by the
mean are often used, at least in the first
instance to model such data. Familiar
examples are the Poisson and binomial
distributions, which are members of the
one-parameter exponential family.
6.1 Cont
However, there are many situations, where
these models are inappropriate, in the sense
that the mean-variance relationship implied
by the one-parameter distribution being
fitted is not valid. In most of these
situations, the data are observed to be
overdispersed; that is, the observed sample
variance is larger than that predicted by
inserting the sample mean into the meanvariance relationship. This phenomenon is
called overdispersion.
6.1 Cont
These phenomena are also observed with
the fitting of regression models, where the
mean (say, of the Poisson or the binomial
distribution), is modeled as a function of
some covariates. If this dispersion is not
taken into account, then using these
models may lead to biased estimates of the
parameters and consequently incorrect
inferences about the parameters.
6.1 Cont
Concerning mixtures for multivariate
discrete data, a common application arises
in latent class analyses, in which the
feature variables (or response variables in a
regression context) are taken to be
independent in the component distributions.
This latter assumption allows mixture
models in the context of a latent class
analysis to be fitted within the above
framework of mixtures of GLMs.
6.2 Mixed Continuous and
Categorical Variables
We consider now the problem of fitting
a mixture model
g
f ( y j ; )   pi fi ( y j ; qi )
i 1
y=(y1T,…,ynT)T,
(70)
To some data,
where
where some of the feature variables are
categorical.
6.2 Cont.
The simplest way to model the component
densities of these mixed feature variables
is to proceed on the basis that the
categorical variables are independent of
each other and of the continuous feature
variables, which are taken to have, say, a
multivariate normal distribution. Although
this seems a crude way in which to proceed,
it often does well in practice as a way of
clustering mixed feature data. In the case
where there are data of known origin
available, this procedure is known as the
naive Bayes classifier.
6.2 Cont.
We can refine this approach by adopting the
location model. Suppose that p1 of the p
feature variables in Yj are categorical, where
the q th categorical variable takes on mq
distinct values (q=1,…,p1). Then there are
p
m  q 1 mq distinct patterns of these p1
categorical variables.
With the location
model, the p1 categorical variables are
replaced by a single multinomial random
variable Yj(1) with m cells; that is, (Yj(1)}s=1
if the realizations of the p1 categorical
variables in Yj correspond to the s th pattern
1
6.2 Cont.
Any associations between the original
categorical variables are converted into
relationships
among
the
resulting
multinomial cell probabilities. The location
model assumes further that conditional on
(yj(1))s=1 and membership of the i th
component of the mixture model, the
distribution of the p-p1 continuous feature
variables is normal with mean mis and
covariance matrix i, which is the same for
all cells s.
6.2 Cont.
The intent in MULTIMIX is to divide the
feature vector into as many subvectors as
possible that can be taken to be
independently distributed. The extreme
form would be to take all p feature
variables to be independent and to
include correlation structure where
necessary.
6.3 Example 6.1:
Prostate Cancer Data
To illustrate the approach adopted in
MULTIMIX, we report in some detail a
case study of Hunt and Jorgensen (1999).
They considered the clustering of patients
on the basis of pretrial covariates alone
for some prostate cancer clinical trial
data. This data set was obtained from a
randomized clinical trial comparing four
treatments for n=506 patients with
prostatic cancer grouped on clinical
criteria into Stages 3 and 4 of the disease.
6.3 Cont.
As reported by Byar and Green (1980),
Stage 3 represents local extension of
the disease without evidence of distant
metastasis, while Stage 4 represents
distant metastasis as evidenced by
elevated acid phosphatase, X-ray
evidence, or both.
Table 10:Pretreatment Covariates
Covariate
Age
Weight index
Performance rating
Cardiovascular disease history
Systolic blood pressure
Diastolic blood pressure
Electrocardigram code
Serum hemoglobin
Size of primary tumor
Index of tumor stage & histolic grade
Serum prostatic acid phosphatase
Bone metastases
Abbreviation
Age
WtI
PF
HX
SBP
DBP
EKG
Number of Levels
(if Categorical)
4
2
7
HG
SZ
SG
AP
BM
2
Table 11: Models and Fits.
Model
Variable Groups
No. of
Log Lik
Parameters d +11386.265
55
0.000
[ind]
-
[2]
{SBP,DBP}
57
117.542
[3,2]
{BM,WtI,HG},{SBP,DBP}
63
149.419
[5]
{BM,WtI,HG,SBP,DBP}
75
169.163
[9]
Complement of {PF,HX,EKG}
127
237.092
Table 12: Clusters and Outcomes for Treated and Untreated
Patients.
Patient
Group
Cluster 1 Stage 3
Stage 4
Cluster 2 Stage 3
Stage 4
Cluster 1 Stage 3
Stage 4
Cluster 2 Stage 3
Stage 4
Outcome
Alive Prostate Dth. Cardio Dth. Other Dth.
Untreated Patients
39
18
37
33
3
4
3
3
1
4
2
3
14
49
18
6
Treated Patients
50
3
52
20
4
0
1
3
1
6
3
1
25
37
22
10
6.4 Generalized Linear Model
With the generalized linear model (GLM)
approach originally proposed by Nelder
and Wedderburn (1972), the log density
of the (univariate) variable Yj has the form
logf ( y j ; q j , k) 
1
mjk {qjy j  b(qj )}  c(y j; k),
(71)
where qj is the natural or canonical
parameter, k is the dispersion parameter,
and mj is the prior weight.
6.4 Cont.
The mean and variance of Yj are given by
E( Yj )  m j  b' (q j )
and
var(Yj )  kb' ' (qj ),
respectively, where the prime
differentiation with respect to qj
denotes
6.4 Cont.
In a GLM, it is assumed that
j  h(m j )
 x b,
T
where xj is a vector of covariates or
explanatory variables on the j th response yj
b is a vector of unknown parameters, and
h(·) is a monotonic function known as the
link function. If the dispersion parameter k
is known, then the distribution (71) is a
member of the (regular) exponential family
with natural or canonical parameter qj.
6.4 Cont.
The variance of Yj is the product of two
terms, the dispersion parameter k and the
variance function b''(qj), which is usually
written in the form
V(m j )  m j / qj.
So-called natural or canonical links occur
when j=qj, which are respectively the log
and logit functions for the Poisson and
binomial distributions;
6.4 Cont.
The likelihood equation for b can be
expressed as
m
 m w(m )(y
j1
j
j
j
m j )' (m j )  0,
(73)
where '(mj)=dj/dmj and w(mj) is the weight
function defined by
w(m j )  1/[{ (m j )} V(m j )].
'
j
2
It can be seen that for fixed k, the likelihood
equation for b is independent of k.
6.4 Cont.
The likelihood equation (73) can be solved
iteratively using Fisher's method of scoring,
which for a GLM is equivalent to using
iteratively reweighted least squares (IRLS);
see Nelder and Wedderburn (1972). On the
(k+1) th iteration, we form the adjusted
~
response variable y j as
( k)
( k)
( k)
( k)
~
y j  (m j )  (y j  m j )' (m j )
(74)
6.4 Cont.
These n adjusted responses are then
regressed on the covariates x1,…,xn using
weights m1w(m1(k)), …, mnw(mn(k)). This
produces an updated estimate b(k+1) for b,
and hence updated estimates mj(k+1) for the
mj, for use in the right-hand side of (74) to
update the adjusted responses, and so on.
This process is repeated until changes in the
estimates are sufficiently small.
6.5 Mixtures of GLMs
For a mixture of g component distributions of
GLMs in proportions p1…,pg, we have that
the density of the j th response variable Yj is
given by
f ( y j ;  )  i1 pi f ( y j ; qij, k i ),
g
(75)
where for a fixed dispersion parameter ki,
log f (y j; qij, ki ) 
1
j i
m k {qijy j  bi (qij )}  ci (y j; ki )
(76)
6.5 Cont.
for the i th component GLM, we let mij be
the mean of Yj, hi(mij) the link function, and
i  hi (mij)  b x j
T
i
the linear predictor (i=1,…g).
A common model for expressing the i th
mixing proportion pi as a function of x is the
logistic. Under this model, we have
corresponding to the j th observation yj with
covariate vector xj
6.5 Cont.
pij  pi (x j; )
g 1
T
T
 exp( i x j ) /{1  h 1 exp( h x j )}
(78)
T
T
T


0
where g
and   (1 ,...,g1 )
contains the logistic regression coefficients. The
first element of xj is usually taken to be one, so
that the first element of each wi is an intercept.
The EM algorithm of Dempster et al. (1977) can
be applied to obtain the MLE of  as in the case
of a finite mixture of arbitrary distributions.
6.5 Cont.
As the E-step is essentially the same as for
arbitrary component densities, we move
straight to the M-step.
If the b1,…,bg have no elements in common
a priori, then (82) reduces to solving
n
 t (y ; 
j1
ij
j
(k)
) w(mij )(y j  mij ) (mij )x j  0
'
i
separately for each bi to produce bi(k+1)
(i=1,…,g).
(83)
6.6 A General ML Analysis of
Overdispersion in a GLM
In an extension to a GLM for
overdispersion, a random effect Uj
can be introduced additively into a
GLM on the same scale as the linear
predictor, as proposed by Aitkin
(1996). This extension in a two-level
variance component GLM has been
considered recently by Aitkin (1999).
6.6 Cont.
For an unobservable random effect uj
for the j th response on the same scale
as the linear predictor, we have that
j  b x j  sm j ,
T
where uj is realization of a random
variable
Uj
distributed
N(0,1)
independently of the j th response
Yj(j=1,…, n).
6.6 Cont.
The (marginal) log likelihood is thus
n

log L( )   log  f ( y j ; b, s, u)( u)du.
j1

( 84)
The integral (84) does not exist in closed
form except for a normally distributed
response yj. Following the development in
Anderson and Hinde (1988), Aitkin (1996,
1999) suggested that it be approximated by
Gaussian quadrature, whereby the integral
over the normal distribution of U is replaced
by a finite sum of g Gaussian quadrature
mass-points ui with masses pi; the ui and pi
are given in standard references, for
example, Abramowitz and Stegun (1964).
6.6 Cont.
The log likelihood so approximated thus has
the form for that of a g-component mixture
model, n
g
 log p f ( y ;b, s, u ),
j1
i 1
i
j
i
where the masses p1,…, pg correspond to the
(known) mixing proportions, and the
corresponding mass points u1,…,ug to the
(known) parameter values. The linear
predictor for the j th response in the i th
component of the mixture is
 j  b x j  smij ,
T
(i  1,..., g),
6.6 Cont.
Hence in this formulation, ui is an
intercept term.
Aitkin (1996, 1999) suggested
treating the masses p1,…,pg as g
unknown mixing proportions and the
mass points u1,…,ug as g unknown
values of a parameter. This gcomponent mixture model is then
fitted using the EM algorithm, as
described above.
6.6 Cont.
In this framework since now ui is also
unknown, we can drop the scale parameter
s and define the linear predictor for the j th
response in the i th component of the
mixture as
ij  b x j  smi ,
T
Thus ui acts as an intercept parameter for
the i th component. One of the ui
parameters will be aliased with the intercept
term b0; alternatively, the intercept can be
removed from the model.
6.7 Example 6.2:
Fabric Faults Data
We report the analysis by Aitkin (1996), who
fitted a Poisson mixture regression model to
a data set on some fabric faults.
Table 14: Results of Fitting Mixtures of
Poisson Regression Models.
g
b0
b1
(SE)
2
-2.979
m1
m2
m3
(p1)
(p2)
(p3)
0.800 0.609
-0.156
Deviance
49.364
(0.201) (0.203) (0.797)
3
-2.972
0.799 0.611
-0.154
-0.165
(0.201) (0.202) (0.711) (0.087)
49.364
7.1 Mixture Models for FailureTime Data
It is only in relatively recent times that the
potential of finite mixture models has
started to be exploited in survival and
reliability analyses.
7.1 Cont.
In the analysis of failure-time data, it is often
necessary to consider different types of
failure. For simplicity of exposition, we
shall consider the case of g=2 different types
of failure or causes, but the results extend to
an arbitrary number g. An item is taken to
have failed with the occurrence of the first
failure from either cause, and we observe the
time T to failure and the type of failure i,
(i=1,2). In the case where the study
terminates before failure occurs, T is the
censoring time and the censoring indicator 
is set equal to zero to indicate that the failure
time is right-censored.
7.1 Cont.
The traditional approach to the modeling of
the distribution of failure time in the case of
competing risks is to postulate the existence
of so-called latent failure times, T1 and T2,
corresponding to the two causes and to
proceed with the modeling of T=min (T1,T2)
on the basis that the two causes are
independent of each other.
7.1 Cont.
An alternative approach is to adopt a twocomponent mixture model, whereby the
survival function of T is modeled as
S( t; x)  p1 (x)S1 ( t; x)  p2 (x)S2 ( t; x)
(85)
where the i th component survival function
Si(t;x) denotes the conditional survival
function given failure is due to the i th cause,
and pi(x) is the probability of failure from the
i th cause (i=1,2).
7.1 Cont.
Here x is a vector of covariates associated
with the item. It is common to assume that
the mixing proportions pi(x) have the
logistic form,
p1 (x;b)  1  p2 (x;b)
(86)
 exp(a  b x) /{1  exp(a  b x)},
T
T
where b=(a,bT)T is the vector of logistic
regression coefficients.
7.2 ML Estimation for Mixtures
of Survival Functions
The log likelihood for  that can be formed
from the observed data y is given by
log L() 
n
[I
j1
[1]
( j ) log{p1 ( x j ; b)f1 ( t j ; q1 , x j )}
 I[2] ( j ) log{p2 (x;b)f2 (t j; q2 , x j )}
 I[0] ( j ) logS( t j; , x j )}
(88)
where I[h](j) is the indicator function that
equals one if j=h(h=0,1,2).
7.3 Example 7.1:
Heart-Valve Data
To illustrate the application of mixture
models for competing risks in practice, we
consider the problem studied in Ng et al.
(1999). They considered the use of the twocomponent mixture model (85) to estimate
the probability that a patient aged x years
would undergo a rereplacement operation
after having his/her native aortic valve
replaced by a xenograft prosthesis.
7.3 Cont.
At the time of the initial replacement
operation, the surgeon has the choice
of using either a mechanical valve or a
biologic valve, such as a xenograft
(made from porcine valve tissue) or an
allograft (human donor valve). Modern
day mechanical valves are very
reliable, but a patient must take bloodthinning drugs for the rest of his/her
life to avoid thromboembolic events.
7.3 Cont.
On the other hand, biologic valves have a
finite working life, and so have to be
replaced if the patient were to live for a
sufficiently long enough time after the
initial replacement operation. Thus
inferences about the probability that a
patient of a given age will need to
undergo a rereplacement operation can be
used to assist a heart surgeon in deciding
on the type of valve to be used in view of
the patient's age.
Figure
30:
Estimated
probability
reoperation at a given age of patient.
of
Figure 31: Conditional probability of
reoperation (xenograft valve) for specified
age of patient.
7.4 Conditional Probability of a
Reoperation
As a patient can avoid a reoperation by
dying first, it is relevant to consider the
conditional probability of a reoperation
within a specified time t after the initial
operation given that the patient does not
die without a reoperation during this period.
7.3 Long-term Survivor Model
In some situations where the aim is to
estimate the survival distribution for a
particular type of failure, a certain fraction
of the population, say p1, may never
experience this type of failure. It is
characterized by the overall survival curve
being leveled at a nonzero probability. In
some applications, the surviving fractions
are said to be “cured.”
7.3 Cont.
It is assumed that an entity or individual has
probability p2=1-p1 of failing from the cause
of interest and probability p1 of never
experiencing failure from this cause. In the
usual framework for this problem, it is
assumed further that the entity cannot fail
from any other cause during the course of the
study (that is, during follow-up). We let T be
the random variable denoting the time to
failure, where T= denotes the event that the
individual will not fail from the cause of
interest. The probability of this latter event is
p1.
7.5 Long-term Survivor Model
In some situations where the aim is to
estimate the survival distribution for a
particular type of failure, a certain fraction
of the population, say p1, may never
experience this type of failure. It is
characterized by the overall survival curve
being leveled at a nonzero probability. In
some applications, the surviving fractions
are said to be “cured.’’
7.5 Cont.
It is assumed that an entity or individual has
probability p2=1-p1 of failing from the cause
of interest and probability p1 of never
experiencing failure from this cause. In the
usual framework for this problem, it is
assumed further that the entity cannot fail
from any other cause during the course of
the study (that is, during follow-up). We let
T be the random variable denoting the time
to failure, where T= denotes the event that
the individual will not fail from the cause of
interest. The probability of this latter event
is p1.
7.5 Cont.
The unconditional survival function of T
can then be expressed as
S( t )  p1  p2S2 ( t ),
(92)
where S2(t) denotes the conditional
survival function for failure from the
cause of interest. The mixture model (92)
with the first component having mass one
at T= can be regarded as a nonstandard
mixture model.
8. Mixture Software
8.1 EMMIX
McLachlan, Peel, Adams, and Basford
 http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
Other Software
AUTOCLASS
BINOMIX
C.A.MAN
MCLUST/EMCLUST
MGT
MIX
MIXBIN
Other Software(cont.)
Program for Gompertz Mixtures
MPLUS
MULTIMIX
NORMIX
SNOB
Software for Flexible Bayesian Modeling
and Markov Chain Sampling