7. Decision Trees and Decision Rules
Download
Report
Transcript 7. Decision Trees and Decision Rules
國立雲林科技大學
National Yunlin University of Science and Technology
Mixture model clustering for mixed data
with missing information
Advisor :Dr. Hsu
Graduate: Yu Cheng Chen
Author: Lynette Hunt, Murray Jorgensen
Computation statistics & Data Analysis, 2002
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation
Objective
Introduction
The Mixture approach to Clustering Data
Application
Discussion
Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Missing observations are frequently seen in data sets.
Specimen may be damaged result.
Expensive test may only be administered to a random sub-sample of the
items.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
We need to implement some technique when the data
to be clustered are incomplete.
Extends mixture likelihood approach to analyse data
with mixed categorical and continuous attributes and
where some of the data are missing at random.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
Data are described as ‘missing at random’
when the probability that a variable is missing for a
particular individual may depend on the values of the
observed variables, but not for on the value of the missing
variable.
The distribution of the missing data does not depend on
the missing data.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
Rubin(1976) showed the process that causes the missing
data can be ignored when making likelihood-based about
the parameter of the data if the data are ‘missing at random’.
The EM algorithms of Dempster et al . is a general iterative
procedure maximum likelihood estimation in incomplete
data problems.
Little and Schluchter(1985) present maximum likelihood
procedure using the EM algorithms for the general location
model with missing data.
Intelligent Database Systems Lab
The Mixture approach to Clustering Data
Suppose p attributes are measured on n individuals.
Let xi,…, xn be the observed values of a random sample from a
mixture of K populations in known proportions, π1,…,πk
Let the density of xi in the kth group be fk(xi; θk), where θk is the
parameter vector for group k.
Let ψ=(θ’, π’)’, where π=(π1,…,πk)’, θ=(θ1,…, θk)’
K
f ( xi ; ) k f k ( xi ; k )
k 1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The Mixture approach to Clustering Data
In EM algorihm of Dempster et al., the ‘missing’ data
are the unobserved indicators of group membership.
Let the vector of indicator variables, zi=(zi1,…,zik)
1 if individual i group k
zik
0
if
individual
i
group
k
zˆik pr (individual i group k | x i ;ˆ)
ˆ k f k ( xi ;ˆk )
K
k 1ˆ k ( xi ;ˆk )
for k=1,…K; and xi is assigned to group k if
zik > zik’ , k != k’
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The Mixture approach to Clustering Data
The latent class model is a finite mixture model for data where each of
the p attributes is discrete.
Suppose that the jth attribute can take on 1,…,M1 and let λkjm be the
probability that for individuals from group k, the jth attribute has level m.
Then, individual I belonging to group k is defined as
p
f k ( xi , k ) kjxij
j 1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Multimix
Jorgensen and Hunt(1996) Hunt and Jorgensen(1999)
proposed a general class of mixture models to include
data having continuous and categorical attributes.
By partitioning the observational vector xi such that
xi ( xi1 | ... | xil | ... | xiL )'
If individual I belongs to group k, we can write
L
f k ( xi ) f kl ( xil )
l 1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Multimix
Discrete distribution:
where xil is a one-dimensional discrete attribute taking values
1,…Ml with probabilities λklM1
Multivariate Normal distribution:
x
where il is a pl-dimensional vector with a Npl(μkl,∑kl)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Graphical models
A alternative way of looking at these multivariate models within
the framework of graphical models.
The graph of a model contains vertices and edges
vertex corresponding to each variable.
Edges shows the independence of corresponding vertices.
Latent class models for p variable are represented by a graph
on p+1 vertices corresponding to the variables plus 1
categorical variable indicating the cluster.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
We put forward a method for mixture model
clustering based on the assumption that the data are
missing at random.
We write the observation vector xi in the form
(xobs,i ,xmiss,i)
xobs,i is the observed attributes for observation i
xmiss,i is the missing attributes for observation i
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
The E step of the EM algorithm require the calculation of Q(ψ,
ψ(t))=E{ LC(ψ)|xobs; ψ(t)}, the expectation of the
complete data log-likelihood conditional on the
observed data and the current value of the parameters.
We
calculate Q(ψ, ψ(t)) by replace zik with
zik zik( t ) E ( zik | X obs,i ; (t ) )
k f k (x obs,i ; k(t) )
(t)
f
(x
;
k 1 k k obs,i k )
K
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
The remaining calculations in the E step require the
calculation of the expected value of the complete data
sufficient statistics for each partition cell l.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
For multivariate normal partition cells,
Eliminating one cluster at a time
Calculate the between-cluster entropy based on remaining clusters
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
Sweep is usefulness in maximum likelihood
estimation for multivariate missing data problems.
We form the augmented covariance matrix Al using
the current estimates of the parameters for group k in
cell l
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
Sweeping on the elements of Al corresponding to the
observed xij in cell l, yields the conditional distribution
of the missing xij’ on the observed xij in the cell.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
The new parameter estimates θ(t+1) of parameters are
estimated form the complete data sufficient statistic.
Mixing proportion:
ˆ
( t 1)
k
1 n (t )
zik
n i 1
for k 1,..., k
Discrete distribution parameters:
klm
1
n k
zik ilm
n
i 1
for k 1,..., K, m 1,..., M l
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Missing data
Multivariate Normal parameters:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Application
Prostate cancer clinical trial data of Byar and
Green(1980).
The data were obtained from a randomized clinical
trial comparing 4 treatments for 506 patients with
prostatic cancer.
There are 12 pre-trial covariates measured on each
patient, 7 variables may be taken to be continuous, 4
to be discrete and 1 variable (SG) is an index. We
treat SG as a continuous variable.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Application
1/3 individual have at least one of pre-trial covariates
missing, giving a total of 62 missing values.
As only approximately 1% of the data are missing.
Missing values were created by assigning each
attribute of each individual a random digit generated
from the discrete[0,1], respectively,
as .10, .15, .20, .25 and .30.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Application
The data set reported in detail here had 1870values
recorded as missing.
Separate data into two clusters.
We regard the data as a random sample from the
distribution
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Application
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discussion
The multimix approach allows to clustering of mixed
finite data containing both types of variables.
The finite mixture model leads itself well into coping
with missing values.
The approach implemented in this paper works well for
mixed data set that had a very large amount of missing
data.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion
…
Intelligent Database Systems Lab