Probabilistic Clustering-Projection Model for Discrete Data

Transcript Probabilistic Clustering-Projection Model for Discrete Data

Probabilistic Clustering-Projection Model
for Discrete Data
Shipeng Yu1,2, Kai Yu2, Volker Tresp2, Hans-Peter Kriegel1
1Institute
for Computer Science, University of Munich
2Siemens Corporate Technology, Munich, Germany
October 2005
Outline






Motivation
Previous Work
The PCP Model
Learning in PCP Model
Experiments
Conclusion and Future Work
2
Motivation

We model discrete data in this work



Fundamental problem for data mining and machine learning
In “bag-of-words” document modelling: document-word pairs
In collaborative filtering: item-rating pairs
Words
Documents

Properties


d1
d2
..
.
w1
2
0
..
.
w2
0
1
..
.
dD
1
1
¢¢¢
¢¢¢
¢¢¢
..
.
¢¢¢
wV
1
2
..
.
0
Occurrences
The data can be described as a big matrix with integer entries
The data matrix is normally very sparse (>90% are zeros)
3
Data Clustering


Goal: Group similar documents together
For continuous data: Distance-based similarity (k-means)



Iteratively minimize a distance-based cost function
Equivalent to a Gaussian mixture model
For discrete data: Occurrence-based similarity
w1 w 2 ¢ ¢ ¢ w V
d1
2
0 ¢¢¢
1
d2
0
1 ¢¢¢
2
..
..
..
..
..
.
.
.
.
.
d
1
1 ¢¢¢
0
D


Similar documents should have similar occurrences of words
No Gaussianity holds for discrete data
4
Data Projection


Goal: Find a low-dimensional feature mapping
For continuous data: Principal Component Analysis


Find orthogonal dimensions to explain data covariance
For discrete data: Topic detection
z 1 ¢ ¢ ¢ zK


d1
d2
..
.
w1
2
0
..
.
w2
0
1
..
.
dD
1
1
¢¢¢
¢¢¢
¢¢¢
..
.
¢¢¢
wV
1
2
..
.
0
Topics explain the co-occurrences of words
Topics are not orthogonal, but independent
5
Projection versus Clustering


They are normally modelled separately
But why not jointly?




More informative projection 
better document clusters
Better clustering structure 
better projection for words
There should be a stable situation
¢¢¢
z1
d1
d2
..
.
w1
2
0
..
.
w2
0
1
..
.
dD
1
1
zK
¢¢¢
¢¢¢
¢¢¢
..
.
¢¢¢
wV
1
2
..
.
0
And how? PCP Model



Well-defined generative model for the data
Standard ways for learning and inference
Generalizable to new data
6
Previous Work for Discrete Data
Projection model

PLSI [Hofmann 99]





First topic model
Not well-defined generative model
State-of-the-art topic model
Generalize PLSI with Dirichlet prior
No clustering effect is modelled
µ



NMF [Lee & Seung 99]


LDA [Blei et al 03]


Clustering model
z1
¢¢¢

Factorize the data matrix
Can be explained as a
clustering model
No projection of words is
directly modelled
zK
Joint Projection-Clustering model
w1 w 2 ¢ ¢ ¢ w V
w1 w 2 ¢ ¢ ¢ w V
Two-sided clustering¢ ¢[Hofmann
& Puzicha 98]:
¢
d1 Same
2 problem
0 ¢ ¢ ¢as PLSI
1
d1
2
0
1
Discrete-PCA [Buntine
& Perttu 03]: Similar dto LDA
0 in spirit
1 ¢¢¢
2
2
d2
0
1 ¢¢¢
2
TTMM
explanation
..
..
..
..
..
.. [Keller
.. &.. Bengio
. . 04]:.. Lack a full Bayesian
.
.
.
.
.
.
.
.
.
.
¢¢¢
¢
¢
¢
d
1
1
0
D
d
1
1
0
D
7
PCP Model: Overview

Probabilistic Clustering-Projection Model




A probabilistic model for discrete data
A clustering model using projected features
A projection model with structural data
Learning in PCP model: Variational EM


Exactly equivalent to iteratively performing
clustering and projection operations
Guaranteed convergence
8
PCP Model: Sampling Process
®
Dirichlet
document
w1
cluster µ1
¼
…...
Cluster
centers
Multinomial
z1;1
word
topic
…...
topic
Multinomial
…...
µ
Dirichlet
zD;N1
…...
document
¸
z1;N1
…...
topic
cluster µD
wD
Multinomial
Projection
topic
zD;ND
w1;1
…...
word
Projection ¯
Weights
Clustering
w1;N1
…...
word
wD;N1
…...
word
wD;ND
V words
D documents M clusters
K topics
Clustering model using projected
Projection
features
model with structural data
9
PCP Model: Plate Model
Model Parameters
Latent Variables
Observations

Likelihood
Clustering Model
Projection Model
10
Learning in PCP Model

We are interested in the posterior distribution


The integral is intractable
Variational EM learning

Approximate the posterior with a variational distribution
Variational
Parameters

Dirichlet
Multinomial
Multinomial
jj
Minimize the KL-divergence DKL (q p^)



Dirichlet
Variational E-step: Minimize w.r.t. variational parameters
Variational M-step: Minimize w.r.t. model parameters
Iterate until convergence
11
Update Equations
Clustering
Updates
Projection
Updates


Equations can be separated to clustering updates and
projection updates
Variational EM learning corresponds to iteratively performing
clustering and projection until convergence
12
Clustering Updates
Update soft cluster assignments, P (cd = m)
Sufficient
Projection term
Prior term
Prior term
Update cluster centers
Likelihood term
Update cluster weights
Likelihood term
13
Projection Updates
Update word projection, P (zd;n = k)
Update projection matrix
Sufficient
Clustering term
Empirical estimate
14
PCP Learning Algorithm
Sufficient
Clustering term
Sufficient
Projection term
Clustering
Updates
Projection
Updates
15
Experiments

Methodology




Data sets



Document Modelling: Compare model generalization
Word Projection: Evaluate topic space
Document Clustering: Evaluate clustering results
5 categories in Reuters-21578: 3948 docs, 7665 words
4 categories in 20Newsgroup: 3888 docs, 8396 words
Preprocessing


Stemming and stop-word removing
Pick up words that occur at least in 5 documents
16
Case Study

Run on a 4-group subset of 20Newsgroup data
Car
Bike
Baseball
Hockey
17
Exp1: Document Modelling


Goal: Evaluate generalization performance
Methods to compare




PLSI: A “pseudo” form for generalization
LDA: State-of-the-art method
P
P
Metric: Perplexity
Perp(Dtest ) = exp(¡ d ln p(wd )= d jwd j)
90% for training and 10% for testing
18
Exp2: Word Projection




Goal: Evaluate the projection matrix ¯
Methods to compare: PLSI, LDA
We train SVMs on the 10-dimensional space after projection
Test classification accuracy on leave-out data
Reuters
Newsgroup
19
Exp3: Document Clustering


Goal: Evaluate clustering for documents
Methods to compare



NMF: Do factorization for clustering
LDA+k-means: Do clustering on the projected space
Metric: normalized mutual information
20
Conclusion





PCP is a well-defined generative model
PCP models clustering and projection jointly
Learning in PCP corresponds to an iterative
process of clustering and projection
PCP learning guarantees convergence
Future work


Large scale experiments
Build a probabilistic model with more factors
21
Thank you!
Questions?

Probabilistic Clustering-Projection Model for Discrete Data

Transcript Probabilistic Clustering-Projection Model for Discrete Data

Directory