Classifier based on mixture of density tree

Download Report

Transcript Classifier based on mixture of density tree

Classifier based
on mixture of density tree
CSE Department,
The Chinese University of
Hong Kong
Huang,Kaizhu
Basic Problem

Given a dataset
{(X1,C), (X2,C),… ,(XN-1,C), (XN,C)}
Here Xi stands for the training data,C stands for the class
label,assuming we have m classes,
We estimate the probability
P(Ci |X), i=1,2,…,m
(1)
The classifier is then denoted by:
arg max P(Ci | X )
i
The key point is How we can estimate the posterior
probability (1)?
Huang,Kaizhu
Density estimation problem

Given a dataset D
{X1, X2,… ,XN-1, XN,}
Where Xi is a instance of a m-variable vector
{v1 , v2 ,… vm-1, vm}
The goal is to find a joint distribution
P(v1, v2 ,… vm-1, vm ) which can maximize the
negative entropy
N
Q  arg max  log P ( X i )
P
i 1
Huang,Kaizhu
Interpretation
N
 log P( X )  N  P( x) log P( x)
i 1
i
X 
define  as the state space
According to information theory,This measures
how many bites are needed to described D
based on the probability distribution P .To find
the maximum P ,it is actually to find the wellknow MDL (minimal description length)
Huang,Kaizhu
Graphical Density estimation




Naive Bayesian Network (NB)
Tree augmented Naive Network(TANB)
Chow-Liu tree network(CL)
Mixture of tree network(MT)
Huang,Kaizhu
NB
TANB
Chow-Liu
EM
Huang,Kaizhu
Mixture of tree
Naive Bayesian Network

Given a problem in Slide 3,we make
the following assumption about the
variables:
All the variables are independent,given
the class label

Then the joint distribution P can be
written as:
P((v1 , v2 ...vm1 , vm ) | C ) 
Huang,Kaizhu
i
P(vi | C )
Structure of NB
1. With this structure ,it
is easy to estimate
the joint distribution
since we can obtain
the P(vi | C ) by the
following easy
accumulation
P(vi  k | C j ) 
N vi  k
Nc j
,
assume vi is a discrete variable,
k is one of vi ' s value
Huang,Kaizhu
Chow-Liu Tree network

Given a problem in Slide 3,we make the
following assumption about the variables:
Each variable vi has direct dependence relationship
with just one other variable and is conditional independent
with other variables ,given the class label.

Thus the joint distribution can be written into :
m
P(v1 , v2 ,..., vm 1 , vm | C )   P(vli |(vlj (i ) , C )
1
0  j (i )  i, where (l1, l 2,..., lm ) is an unknown
permutation of int ergers 1,2,..., m
Huang,Kaizhu
Example of CL methods
Fig2 is an example of CL tree where
1.v3 is just conditional dependent on
v4,and conditional independent on
other variables
P(v3|v4,B)=P(v3|v4)
2.v5 is just conditional dependent on
v4, and conditional independent on
other variables
P(v5|v4,B)=P(v5|v4)
3.v2 is just conditional dependent on
v3, and conditional independent on
other variables
P(v2|v3,B)=P(v2|v3)
4.v1 is just conditional dependent on
v3 ,and and conditional independent on
other variables
P(v1|v3,B)=P(v1|v3)
Huang,Kaizhu
P(v1 , v2 , v3 , v4 , v5 )
 P(v1 , v2 , v3 , v5 | v4 ) P(v4 )
 P(v4 ) P(v3 | v4 ) P(v1 , v2 , v5 | v3 , v4 )
 P(v4 ) P(v3 | v4 ) P(v5 | v1 , v2 , v3 , v4 ) P(v1 , v2 | v3 , v4 )
 P(v4 ) P(v3 | v4 ) P(v5 | v4 ) P(v1 , v2 | v3 , v4 )
 P(v4 ) P(v3 | v4 ) P(v5 | v4 ) P(v1 , v2 | v3 )
 P(v4 ) P(v3 | v4 ) P(v5 | v4 ) P(v1 | v2 , v3 ) P(v2 | v3 )
 P(v4 ) P(v3 | v4 ) P(v5 | v4 ) P(v1 | v3 ) P(v2 | v3 )
Huang,Kaizhu
CL Tree
•The key point about CL tree is that:
We use a multiplication of 2-dimension variable
distributions to approximate the high-dimension
distributions.
Then how can we find the best multiplication of 2dimension variable distributions to approximate the highdimension distributions optimally
m
P(v1 , v2 ,..., vm 1 , vm | C )   P(vli |(vlj (i ) , C )
1
0  j (i )  i, where (l1, l 2,..., lm ) is an unknown
permutation of int ergers 1,2,..., m
Huang,Kaizhu
CL tree algorithm
1.Obtaining P(vi|vj), P(vi,vj) for each pair of (vi,vj)
by accumulating process .
2.Calculating the mutual entropy
I (vi , v j )   P(vi , v j ) log(
vi ,v j
P(vi , v j )
P(vi ) P(v j )
)
3.Utilizing Maximum spanning tree algorithm to
find the optimal tree structure,which the edge
weight between two nodes vi,vj is I((vi,vj)
This CL algorithm was proved to be optimal in [1]
Huang,Kaizhu
Mixture of tree (MT)

A mixture of tree model is defined to be a
distribution of the form:
l
Q (v )   k T k (v )
k 1
with k  0, k  1,..., l ;
l

k 1
k
 1.
T k (v) is a CL tree, l is defined as an random integer

A MT can be viewed as containing a
unobserved choice variable z,which takes
values k{1,…l }
Huang,Kaizhu
Huang,Kaizhu
Difference between MT & CL
 z can be any integer variable, especially when
unobserved variable z is the class label,the MT is
changed into the multi-CL tree
 CL is a supervised learning algorithm,which has
to be trained each tree for each class
 MT is a unsupervised learning algorithm,which
considers the class variable as the training data
Huang,Kaizhu
Optimization problem of MT

Given a data set of observations
D  {x1 , x 2 ,..., x N }

We are required to find the mixture of
tree Q that satisfies
N
Q  arg max  log Q( X )
i
Q'

i 1
(1)
This optimization problem in mixture
model can be solved by EM(Expectation
Maximizing) methods
Huang,Kaizhu
Huang,Kaizhu
Huang,Kaizhu
We maximize (7) with respect k and Tk with the
constraint N
 i  1;
i 1
We can obtain the update equation:
As for the second term of (7) ,
In fact it is a CL procedure,so we can maximize
it by finding a CL tree based on
Huang,Kaizhu
MT in classifiers
1. In training phrase.
Train the MT model on the training data
domain {c}V,C is the class label, V is
the input domain
2. In testing phrase
a new instance xV is classified by
picking the most likely value of the class
variable given the settings of the other
variables:
c( x)  arg max Q( xc , x)
xc
Huang,Kaizhu
Multi-CL in handwritten digit recognition

1. Feature extraction
The four basic configurations above are rotated
in four cardinal directions and applied to the
characters in the six overlapped zones shown in
the following
Huang,Kaizhu
Multi-CL in handwritten digit recognition
So we have 4*4*6=96 dimension features
Huang,Kaizhu
Multi-CL in handwritten digit recognition
1.For a given pattern,we
calculate the probabilities this
pattern belongs to each
class(for digit we have 10 class
,0,1,2,…9)
2.We choose the maximum
class probability as the
classification result
Here the probability the
pattern “2” belongs to
class 2 is the maxim,so
we classified it as digit 2
Huang,Kaizhu
Discussion



1. When all of the component trees are in
the same structure,the MT becomes a
TANB mode.
2. When z is class label,MT becomes a CL
mode
3.The MT is a general mode of CL and
naive bayesian mode.So the performance
is expected to be better than NB,
TANB,CL
Huang,Kaizhu