Conditional Gaussian Networks

Download Report

Transcript Conditional Gaussian Networks

Dimensionality Reduction in
Unsupervised Learning of Conditional
Gaussian Networks
Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I.
In IEEE Trans. on PAMI, 23(6), 2001.
Summarized by Kyu-Baek Hwang
Abstract

Feature selection for unsupervised learning of Gaussian
networks
 Unsupervised learning for Bayesian networks?
 Which feature is good for the learning task?

Assessment of the relevance of the feature for learning
process
 How to determine the threshold for cutting?

Accelerate the learning time and still obtain reasonable
models
 Two artificial datasets
 Two benchmark datasets from the UCI repository
Unsupervised Learning for Conditional
Gaussian Networks



Data clustering  learning the probabilistic graphical
model from the unlabeled data
Cluster membership  a hidden variable
Conditional Gaussian networks
 Cluster variable is the ancestor for all the other variables.
 The joint probability distribution over all the other variables
given the cluster membership is multivariate Gaussian.

Feature selection in classification  feature selection
in clustering
 Consider all the features eventually, to describe the domain.
Conditional Gaussian Distribution

Data clustering
 X = (Y, C) = (Y1, …, Yn, C)

Conditional Gaussian distribution
 Pdf for Y given C = c is,
 whenever
p(c) = p(C = c) > 0
Positive definite
Conditional Gaussian Networks

Factorization of the conditional Gaussian distribution
 Conditional independencies among all the variables is encoded
by the network structure s.
 Local probability distribution
An Example of CGNs
C
Learning CGNs from Data

Incomplete dataset d
N

n
1
O
H
Structural EM algorithm
Structural EM Algorithm
Relaxed version:
p(c | y i ,ˆsl , s lh )
Expected score
Scoring Metrics
for the Structural Search

The log marginal likelihood of the expected complete
data
Feature Selection

Large databases
 Many instances
 Many attributes
  Dimensionality reduction required

Select features based on some criterion.
 The criterion differs from the purpose of learning.
 Learning speed, accurate predictions, and the comprehensibility
of the learned models

Non exhaustive search (2n)
 Sequential selection (forward or backward)
 Evolutionary, population-based, randomized search based on the
EDA.
Wrapper and Filter

Wrapper
 Feature subsets tailored to the performance function of learning
process
 Predictive accuracy on the test data set.

Filter
 Based on the intrinsic properties of the data set.
 Correlation between the class label and each attribute


 Supervised learning
Two problems in unsupervised learning
 Absence of the class label  different criterion for the feature
selection
 No standard accepted performance task  multiple predictive
accuracy or class prediction
Feature Selection in Learning CGNs

Data analysis (clustering)  description, not prediction
 All the features are necessary for the description.

CGN learning with many features is a time-consuming
task.
 Preprocessing: feature selection
 Learning CGNs
 Postprocessing: addition of the other features as conditionally
independent given the cluster membership

The goal  how to measure the relevance
 Fast learning time
 Accuracy  log likelihood for the test data
Relevance

Those features that exhibit low correlation with the rest
of the features can be considered irrelevant for the
learning process.
 Conditionally independent given the cluster membership.

First trial in the continuous domain
Relevance Measure

The relevance measure:
 Null hypothesis (edge exclusion test)

r2ij|rest
 The sample partial correlation of Yi and Yj
 The maximum likelihood estimates (mles) of the elements of the
inverse variance matrix
Graphical Gaussian Models (1/2)
1 T
f ( x; )  (2 )
|  | exp{ x x},
2
   1 , w  vec ()
1
1
l ( x; w)  c  log |  |  tr (xxT )
2
2
1
1 T
 c  log |  |  w Js
2
2
J : the diagonal matrix for coefficien ts
p/2
1/ 2
s  vec ( xxT )
1 
1
1
log |  |  Js  J (  s),
2 w
2
2
  vec ()
U ( w) 
Graphical Gaussian Models (2/2)

1

U
(
w
)


J

T
T
w
2 w
 : the mean - value mapping
So,
 T 1
I 1  2
w J

1
cov( wˆ ij wˆ rs )  ( wir w js  wis w jr )
N
r12|rest   wˆ 12 ( wˆ 11wˆ 22 ) 1/ 2
I
Relevance Threshold

Distribution of the test statistic
 G(x): pdf of a 12 random variable
 5 percent test
 The resolution of the above equation  optimization
Learning Scheme
Experimental Settings

Model speicifications
 Tree augmented Naïve Bayes (TANB) models
 Predictive attributes may have, at most, one other predictive
attribute as a parent.

An example
C
Data Sets

Synthetic data sets (4000:1000)
 TANB model with 25 (15:14[-1, 1]) attributes, (0, 4, 8), 1
 C:
uniform, (0, 1)
 TANB model with 30 (15:14[-1, 1]) attributes, (0, 4, 8), 2
 C:

uniform, (0, 5)
Waveform (artificial data) (4000:1000)
 3 clusters, 40 attributes, the last 19 are noise attributes

Pima
 768 cases (700:68)
 8 attributes
Performance Criteria


The log marginal likelihood of the training data
The multiple predictive accuracy
 A probabilistic approach to the standard multiple predictive
accuracy

Runtime
 10 independent runs for the synthetic data sets and the
waveform data
 50 independent runs for the pima data
 On a Pentium 366 machine
Relevance Ranking
Likelihood Plots for Synthetic Data
Likelihood Plots for Real Data
Runtime
Automatic Dimensionality Reduction
Conclusions and Future Work






Relevance assessment for feature selection in
unsupervised learning and continuous domain
Reasonable learning performance
Extension to categorical domain
Redundant feature problem
Relaxation of the model structure
More realistic data set