Probabilistic Clustering-Projection Model for Discrete Data
Download
Report
Transcript Probabilistic Clustering-Projection Model for Discrete Data
Probabilistic Clustering-Projection Model
for Discrete Data
Shipeng Yu1,2, Kai Yu2, Volker Tresp2, Hans-Peter Kriegel1
1Institute
for Computer Science, University of Munich
2Siemens Corporate Technology, Munich, Germany
October 2005
Outline
Motivation
Previous Work
The PCP Model
Learning in PCP Model
Experiments
Conclusion and Future Work
2
Motivation
We model discrete data in this work
Fundamental problem for data mining and machine learning
In “bag-of-words” document modelling: document-word pairs
In collaborative filtering: item-rating pairs
Words
Documents
Properties
d1
d2
..
.
w1
2
0
..
.
w2
0
1
..
.
dD
1
1
¢¢¢
¢¢¢
¢¢¢
..
.
¢¢¢
wV
1
2
..
.
0
Occurrences
The data can be described as a big matrix with integer entries
The data matrix is normally very sparse (>90% are zeros)
3
Data Clustering
Goal: Group similar documents together
For continuous data: Distance-based similarity (k-means)
Iteratively minimize a distance-based cost function
Equivalent to a Gaussian mixture model
For discrete data: Occurrence-based similarity
w1 w 2 ¢ ¢ ¢ w V
d1
2
0 ¢¢¢
1
d2
0
1 ¢¢¢
2
..
..
..
..
..
.
.
.
.
.
d
1
1 ¢¢¢
0
D
Similar documents should have similar occurrences of words
No Gaussianity holds for discrete data
4
Data Projection
Goal: Find a low-dimensional feature mapping
For continuous data: Principal Component Analysis
Find orthogonal dimensions to explain data covariance
For discrete data: Topic detection
z 1 ¢ ¢ ¢ zK
d1
d2
..
.
w1
2
0
..
.
w2
0
1
..
.
dD
1
1
¢¢¢
¢¢¢
¢¢¢
..
.
¢¢¢
wV
1
2
..
.
0
Topics explain the co-occurrences of words
Topics are not orthogonal, but independent
5
Projection versus Clustering
They are normally modelled separately
But why not jointly?
More informative projection
better document clusters
Better clustering structure
better projection for words
There should be a stable situation
¢¢¢
z1
d1
d2
..
.
w1
2
0
..
.
w2
0
1
..
.
dD
1
1
zK
¢¢¢
¢¢¢
¢¢¢
..
.
¢¢¢
wV
1
2
..
.
0
And how? PCP Model
Well-defined generative model for the data
Standard ways for learning and inference
Generalizable to new data
6
Previous Work for Discrete Data
Projection model
PLSI [Hofmann 99]
First topic model
Not well-defined generative model
State-of-the-art topic model
Generalize PLSI with Dirichlet prior
No clustering effect is modelled
µ
NMF [Lee & Seung 99]
LDA [Blei et al 03]
Clustering model
z1
¢¢¢
Factorize the data matrix
Can be explained as a
clustering model
No projection of words is
directly modelled
zK
Joint Projection-Clustering model
w1 w 2 ¢ ¢ ¢ w V
w1 w 2 ¢ ¢ ¢ w V
Two-sided clustering¢ ¢[Hofmann
& Puzicha 98]:
¢
d1 Same
2 problem
0 ¢ ¢ ¢as PLSI
1
d1
2
0
1
Discrete-PCA [Buntine
& Perttu 03]: Similar dto LDA
0 in spirit
1 ¢¢¢
2
2
d2
0
1 ¢¢¢
2
TTMM
explanation
..
..
..
..
..
.. [Keller
.. &.. Bengio
. . 04]:.. Lack a full Bayesian
.
.
.
.
.
.
.
.
.
.
¢¢¢
¢
¢
¢
d
1
1
0
D
d
1
1
0
D
7
PCP Model: Overview
Probabilistic Clustering-Projection Model
A probabilistic model for discrete data
A clustering model using projected features
A projection model with structural data
Learning in PCP model: Variational EM
Exactly equivalent to iteratively performing
clustering and projection operations
Guaranteed convergence
8
PCP Model: Sampling Process
®
Dirichlet
document
w1
cluster µ1
¼
…...
Cluster
centers
Multinomial
z1;1
word
topic
…...
topic
Multinomial
…...
µ
Dirichlet
zD;N1
…...
document
¸
z1;N1
…...
topic
cluster µD
wD
Multinomial
Projection
topic
zD;ND
w1;1
…...
word
Projection ¯
Weights
Clustering
w1;N1
…...
word
wD;N1
…...
word
wD;ND
V words
D documents M clusters
K topics
Clustering model using projected
Projection
features
model with structural data
9
PCP Model: Plate Model
Model Parameters
Latent Variables
Observations
Likelihood
Clustering Model
Projection Model
10
Learning in PCP Model
We are interested in the posterior distribution
The integral is intractable
Variational EM learning
Approximate the posterior with a variational distribution
Variational
Parameters
Dirichlet
Multinomial
Multinomial
jj
Minimize the KL-divergence DKL (q p^)
Dirichlet
Variational E-step: Minimize w.r.t. variational parameters
Variational M-step: Minimize w.r.t. model parameters
Iterate until convergence
11
Update Equations
Clustering
Updates
Projection
Updates
Equations can be separated to clustering updates and
projection updates
Variational EM learning corresponds to iteratively performing
clustering and projection until convergence
12
Clustering Updates
Update soft cluster assignments, P (cd = m)
Sufficient
Projection term
Prior term
Prior term
Update cluster centers
Likelihood term
Update cluster weights
Likelihood term
13
Projection Updates
Update word projection, P (zd;n = k)
Update projection matrix
Sufficient
Clustering term
Empirical estimate
14
PCP Learning Algorithm
Sufficient
Clustering term
Sufficient
Projection term
Clustering
Updates
Projection
Updates
15
Experiments
Methodology
Data sets
Document Modelling: Compare model generalization
Word Projection: Evaluate topic space
Document Clustering: Evaluate clustering results
5 categories in Reuters-21578: 3948 docs, 7665 words
4 categories in 20Newsgroup: 3888 docs, 8396 words
Preprocessing
Stemming and stop-word removing
Pick up words that occur at least in 5 documents
16
Case Study
Run on a 4-group subset of 20Newsgroup data
Car
Bike
Baseball
Hockey
17
Exp1: Document Modelling
Goal: Evaluate generalization performance
Methods to compare
PLSI: A “pseudo” form for generalization
LDA: State-of-the-art method
P
P
Metric: Perplexity
Perp(Dtest ) = exp(¡ d ln p(wd )= d jwd j)
90% for training and 10% for testing
18
Exp2: Word Projection
Goal: Evaluate the projection matrix ¯
Methods to compare: PLSI, LDA
We train SVMs on the 10-dimensional space after projection
Test classification accuracy on leave-out data
Reuters
Newsgroup
19
Exp3: Document Clustering
Goal: Evaluate clustering for documents
Methods to compare
NMF: Do factorization for clustering
LDA+k-means: Do clustering on the projected space
Metric: normalized mutual information
20
Conclusion
PCP is a well-defined generative model
PCP models clustering and projection jointly
Learning in PCP corresponds to an iterative
process of clustering and projection
PCP learning guarantees convergence
Future work
Large scale experiments
Build a probabilistic model with more factors
21
Thank you!
Questions?