Transcript pptx

Unsupervised Learning
Reading:
Chapter 8 from Introduction to Data Mining by Tan,
Steinbach, and Kumar, pp. 487-515, 532-541, 546-552
(http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)
Unsupervised learning
= No labels on training examples!
Main approach: Clustering
Example: Optdigits data set
Optdigits features
f1
f2
f3
f4
f5
f6
f7
f8
f9
x = (f1, f2, ..., f64)
Etc. ..
= (0, 2, 13, 16, 16, 16, 2, 0, 0, ...)
Partitional Clustering of Optdigits
Feature 1
Feature 2
Feature 3
64-dimensional space
Partitional Clustering of Optdigits
Feature 1
Feature 2
Feature 3
64-dimensional space
Hierarchical Clustering of Optdigits
Feature 1
Feature 2
Feature 3
64-dimensional space
Hierarchical Clustering of Optdigits
Feature 1
Feature 2
Feature 3
64-dimensional space
Hierarchical Clustering of Optdigits
Feature 1
Feature 2
Feature 3
64-dimensional space
Issues for clustering algorithms
• How to measure distance between pairs of instances?
• How many clusters to create?
• Should clusters be hierarchical? (I.e., clusters of clusters)
• Should clustering be “soft”? (I.e., an instance can belong to
different clusters, with “weighted belonging”)
Most commonly used (and simplest)
clustering algorithm:
K-Means Clustering
Adapted from Andrew Moore,
http://www.cs.cmu.edu/~awm/tutori
als
Adapted from Andrew Moore,
http://www.cs.cmu.edu/~awm/tutori
als
Adapted from Andrew Moore,
http://www.cs.cmu.edu/~awm/tutori
als
Adapted from Andrew Moore,
http://www.cs.cmu.edu/~awm/tutori
als
K-means clustering algorithm
K-means clustering algorithm
Typically, use mean of points in cluster as centroid
K-means clustering algorithm
Distance metric: Chosen by user.
For numerical attributes, often use L2 (Euclidean) distance.
d(x, y) =
n
å(x - y )
i
2
i
i=1
Centroid of a cluster here refers to the mean of the points in the cluster.
Example: Image segmentation by K-means clustering by color
From http://vitroz.com/Documents/Image%20Segmentation.pdf
K=5, RGB space
K=10, RGB space
K=5, RGB space
K=10, RGB space
K=5, RGB space
K=10, RGB space
Clustering text documents
• A text document is represented as a feature vector of word frequencies
• Distance between two documents is the cosine of the angle between their
corresponding feature vectors.
Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over
two million articles.
Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the
Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018029
Exercise 1
How to evaluate clusters produced by K-means?
• Unsupervised evaluation
• Supervised evaluation
Unsupervised Cluster Evaluation
We don’t know the classes of the data instances
Let C denote a clustering (i.e., set of K clusters that is the result of a clustering
algorithm) and let c denote a cluster in C. Let |c| denote the number of elements
in c.
We want to minimize the distance between elements of c and the centroid μc .
coherence of each cluster c – i.e., minimize Mean Square Error (mse):
å mse(c)
Average mse (C) = cÎC
K
Unsupervised Cluster Evaluation
We don’t know the classes of the data instances
Let C denote a clustering (i.e., set of K clusters that is the result of a clustering
algorithm) and let c denote a cluster in C. Let |c| denote the number of elements
in c.
We want to minimize the distance between elements of c and the centroid μc .
coherence of each cluster c – i.e., minimize Mean Square Error (mse):
å mse(c)
Average mse (C) = cÎC
K
Note: The assigned reading uses sum square error rather than mean square error.
Unsupervised Cluster Evaluation
We don’t know the classes of the data instances
We also want to maximize pairwise separation of each cluster. That is, maximize
Mean Square Separation (mss):
Exercises 2-3
Supervised Cluster Evaluation
Suppose we know the classes of the data instances
Entropy of a cluster: The degree to which a cluster consists of objects of a single class.
|Classes|
entropy(ci ) = -
å
pi, j log 2 pi, j
j=1
where
pi, j = probability that a member of cluster i belongs to class j
=
mi, j
, where mi, j is the number of instances in cluster i with class j
mi
and mi is the number of instances in cluster i
Mean entropy of a clustering: Average entropy over all clusters in the clustering
K
mean entropy(C) = å
1
mi
entropy(ci )
m
where mi is the number of instances in cluster i and m is the total number of instances in the dataset.
We want to minimize mean entropy
Entropy Example
Suppose there are 3 classes: 1, 2, 3
Cluster 1
Cluster 2
1 2 1 3 1 1 3
2 3 3 3 2 3
Cluster3
1 1 3 2 2 3 2
æ4
4 1
1 2
2ö
entropy(c1 ) = - ç log 2 + log 2 + log 2 ÷ = 1.37
è7
7 7
7 7
7ø
æ 2
2 4
4ö
entropy(c2 ) = - ç 0 + log 2 + log 2 ÷ = 0.91
è 6
6 6
6ø
æ2
2 3
3 2
2ö
entropy(c3 ) = - ç log 2 + log 2 + log 2 ÷ = 1.54
è7
7 7
7 7
7ø
7
6
7
mean entropy(C) = (1.37) + ( 0.91) + (1.54)
20
20
20
Exercise 4
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined.
– For categorical data, use K-modes: The centroid is
represented by the most frequent values.
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined.
– For categorical data, use K-modes: The centroid is
represented by the most frequent values.
• The user needs to specify K.
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means
• The algorithm is only applicable if the mean is defined.
– For categorical data, use K-modes: The centroid is
represented by the most frequent values.
• The user needs to specify K.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from other
data points.
– Outliers could be errors in the data recording or some
special data points with very different values.
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means: Problems with outliers
CS583, Bing Liu, UIC
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Dealing with outliers
• One method is to remove some data points in the clustering
process that are much further away from the centroids than
other data points.
– Expensive
– Not always a good idea!
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data points, the
chance of selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
• The algorithm is sensitive to initial seeds.
+
+
CS583, Bing Liu, UIC
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
• If we use different seeds: good results
+
+
CS583, Bing Liu, UIC
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
• If we use different seeds: good results
+
+
CS583, Bing Liu, UIC
Often can improve
K-means results by
doing several
random restarts.
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
• If we use different seeds: good results
+
+
Often can improve
K-means results by
doing several
random restarts.
Often useful to
select instances
from data as initial
seeds.
CS583, Bing Liu, UIC
Adapted from Bing Liu, UIC
http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-unsupervised-learning.ppt
Issues for K-means (cont …)
• The K-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
+
CS583, Bing Liu, UIC
Other Issues
• What if a cluster is empty?
– Choose a replacement centroid
• At random, or
• From cluster that has highest mean square error
• How to choose K ?
• The assigned reading discusses several methods for
improving a clustering with “postprocessing”.
Choosing the K in K-Means
• Hard problem! Often no “correct” answer for unlabeled data
• Many proposed methods! Here are a few:
• Try several values of K, see which is best, via cross-validation.
– Metrics: mean square error, mean square separation, penalty for too
many clusters [why?]
• Start with K = 2. Then try splitting each cluster.
– New means are one sigma away from cluster center in direction of greatest
variation.
– Use similar metrics to above.
• “Elbow” method:
– Plot average mse (or SSE) vs. K. Choose K at which SSE (or other
metric) stops decreasing abruptly.
“elbow”
– However, sometimes no clear “elbow”
Homework 5
Quiz 4 Review
Soft Clustering
with Gaussian Mixture Models
Soft Clustering with Gaussian mixture models
•
A “soft”, generative version of K-means clustering
•
Given: Training set S = {x1, ..., xN}, and K.
•
Assumption: Data is generated by sampling from a “mixture” (linear combination)
of K Gaussians.
Gaussian Mixture Models
Assumptions
• K clusters
• Each cluster is modeled by a Gaussian distribution with a certain mean and
standard deviation (or covariance). [This contrasts with K-means, in
which each cluster is modeled only by a mean.]
• Assume that each data instance we have was generated by the following
procedure:
1. Select cluster ci with probability P(ci) = πi
2. Sample point from ci’s Gaussian distribution
Mixture of three Gaussians
(one dimensional data)
p(x) = p 1N (x | m1, s 1 ) + p 2N (x | m2 , s 2 ) + p 3N (x | m3, s 3 )
where p 1 + p 2 + p 3 =1
Clustering via finite Gaussian mixture models
• Clusters: Each cluster will correspond to a single Gaussian. Each point x
 S will have some probability distribution over the K clusters.
• Goal: Given the data, find the Gaussians! (And their probabilities πi .)
I.e., Find parameters {θK} of these K Gaussians such P(S | {θK}) is
maximized.
• This is called a Maximum Likelihood method.
– S is the data
– {θK} is the “hypothesis” or “model”
– P(S | {θK}) is the “likelihood”.
General form of one-dimensional (univariate)
Gaussian Mixture Model
K
p(x) = å p iN (x | mi , s i )
i=1
K
where å p i = 1
i=1
Learning a GMM
Simple Case:
Maximum Likelihood for Single Univariate Gaussian
• Assume training set S has N values generated by a univariant
Gaussian distribution:
S = {x1,..., x N }, where
N (x | m,s ) =
1
2ps 2
e
-
(x - m )2
2s 2
• Likelihood function: probability of data given model (or
parameters of model)
N
p(S | m,s ) = Õ N (x i | m,s )
i=1
• How to estimate parameters  and  from S?
• Maximize the likelihood function with respect to  and  .
• We want the  and  that maximize the probability of the data.
N
Maximize : p(S | m,s ) = ÕN (x i | m,s )
N
=Õ
i=1
i=1
(x i - m )2
2s 2
1
2ps
2
e
• Problem: Individual values of N (x i | m,s ) are typically very
small. (Can underflow numerical precision of computer.)
• Solution: Work with log likelihood instead of likelihood.
2
xi -m ) ö
æ N
(
1
2
ç
ln p(S | m, s ) = ln Õ
e 2s ÷
ç i=1 2ps 2
÷
è
ø
æ - ( xi -m )2
N
ç e 2s 2
= å ln ç
2
i=1
ç 2ps
è
ö
2
÷ N æ æ - ( xi -m2) ö
÷ = åç ln çç e 2s ÷÷- ln
÷ i=1 çè è
ø
ø
(
æ ( x - m )2 1
ö
1
i
2
= -åçç
+ ln ( 2p ) + ln (s ) ÷÷
2
2s
2
2
i=1 è
ø
N
=-
1
2s
N
2
(x
m
)
å
i
2
i=1
N
N
ln s 2 - ln(2p )
2
2
2ps 2
)
ö
÷
÷
ø
• Solution: Work with log likelihood instead of likelihood.
2
xi -m ) ö
æ N
(
1
2
ç
ln p(S | m, s ) = ln Õ
e 2s ÷
ç i=1 2ps 2
÷
è
ø
Find a simplified expression for this.
æ - ( xi -m )2
N
ç e 2s 2
= å ln ç
2
i=1
ç 2ps
è
ö
2
÷ N æ æ - ( xi -m2) ö
÷ = åç ln çç e 2s ÷÷- ln
÷ i=1 çè è
ø
ø
(
æ ( x - m )2 1
ö
1
i
2
= -åçç
+ ln ( 2p ) + ln (s ) ÷÷
2
2s
2
2
i=1 è
ø
N
=-
1
2s
N
2
(x
m
)
å
i
2
i=1
N
N
ln s 2 - ln(2p )
2
2
2ps 2
)
ö
÷
÷
ø
• Solution: Work with log likelihood instead of likelihood.
2
xi -m ) ö
æ N
(
1
2
ç
ln p(S | m, s ) = ln Õ
e 2s ÷
ç i=1 2ps 2
÷
è
ø
æ - ( xi -m )2
N
ç e 2s 2
= å ln ç
2
i=1
ç 2ps
è
ö
2
÷ N æ æ - ( xi -m2) ö
÷ = åç ln çç e 2s ÷÷- ln
÷ i=1 çè è
ø
ø
(
æ ( x - m )2 1
ö
1
i
2
= -åçç
+ ln ( 2p ) + ln (s ) ÷÷
2
2s
2
2
i=1 è
ø
N
=-
1
2s
N
2
(x
m
)
å
i
2
i=1
N
N
ln s 2 - ln(2p )
2
2
2ps 2
)
ö
÷
÷
ø
Now, find maximum likelihood parameters,  and σ2.
First, maximize
ln p(S | m,s ) = -
1
2s 2
N
å(x
i=1
2
i - m) -
N
N
lns 2 - ln(2p )
2
2
with respect to .
ù
d é 1 N
N
N
2
2
êå(xi - m ) - 2 ln s - 2 ln(2p )ú
d m ë 2s 2 i=1
û
ù
d é 1 N
2
=
êå(xi - m ) ú
d m ë 2s 2 i=1
û
ù
1 éN
= - 2 êå -2(xi - m )ú
2s ë i=1
û
ù
1 éæ N ö
= 2 êç å xi ÷ - N m ú = 0
s ëè i=1 ø
û
Result:
m ML
1 N
= å xn
N i=n
(ML = “Maximum Likelihood”)
Now, find maximum likelihood parameters,  and σ2.
First, maximize
ln p(S | m,s ) = -
1
2s 2
N
å(x
i=1
2
i - m) -
N
N
lns 2 - ln(2p )
2
2
with respect to . Find  that maximizes this.
ù
d é 1 N
N
N
2
2
êå(xi - m ) - 2 ln s - 2 ln(2p )ú
d m ë 2s 2 i=1
û
ù
d é 1 N
2
=
êå(xi - m ) ú
d m ë 2s 2 i=1
û
ù
1 éN
= - 2 êå -2(xi - m )ú
2s ë i=1
û
ù
1 éæ N ö
= 2 êç å xi ÷ - N m ú = 0
s ëè i=1 ø
û
Result:
m ML
1 N
= å xn
N i=n
(ML = “Maximum Likelihood”)
Now, find maximum likelihood parameters,  and σ2.
First, maximize
ln p(S | m,s ) = -
1
2s 2
N
å(x
i=1
2
i - m) -
N
N
lns 2 - ln(2p )
2
2
with respect to . Find  that maximizes this.
How to do this?
Now, find maximum likelihood parameters,  and σ2.
First, maximize
ln p(S | m,s ) = -
1
2s 2
N
å(x
i=1
2
i - m) -
N
N
lns 2 - ln(2p )
2
2
with respect to . Find  that maximizes this.
Now, find maximum likelihood parameters,  and σ2.
First, maximize
ln p(S | m,s ) = -
1
2s 2
N
å(x
i=1
2
i - m) -
N
N
lns 2 - ln(2p )
2
2
with respect to .
ù
d é 1 N
N
N
2
2
êå(xi - m ) - 2 ln s - 2 ln(2p )ú
d m ë 2s 2 i=1
û
ù
d é 1 N
2
=
êå(xi - m ) ú
d m ë 2s 2 i=1
û
ù
1 éN
= - 2 êå -2(xi - m )ú
2s ë i=1
û
ù
1 éæ N ö
= 2 êç å xi ÷ - N m ú = 0
s ëè i=1 ø
û
Result:
m ML
1 N
= å xn
N i=n
(ML = “Maximum Likelihood”)
Now, maximize
N
N
N
2
ln p(S | m,s ) = - 2 å (x i - m) - lns - ln(2p )
2s i=1
2
2
with respect to σ2. Find σ2 that maximizes this.
1
2
ù
d é 1 N
N
N
2
2
- 2 å (xi - m ) - ln s - ln(2p )ú
2ê
ds ë 2s i=1
2
2
û
ù
d é 1 2 -1 N
N
2
2
=
- (s ) å (xi - m ) - ln s ú
2ê
ds ë 2
2
û
i=1
1 2 -2 N
N
1 N
N
2
2
= (s ) å (xi - m ) - 2 =
(x
m
)
å i
2
2 2
2
2
s
s
s
i=1
( ) i=1
N
å(x - m )
2
i
=
i=1
(s )
- Ns 2
2 2
=0 Þs
2
ML
1 N
= å (xn - m ML )2
N n=1
Now, maximize
N
N
N
2
ln p(S | m,s ) = - 2 å (x i - m) - lns - ln(2p )
2s i=1
2
2
with respect to σ2. Find σ2 that maximizes this.
1
2
Now, maximize
N
N
N
2
ln p(S | m,s ) = - 2 å (x i - m) - lns - ln(2p )
2s i=1
2
2
with respect to σ2.
1
2
ù
d é 1 N
N
N
2
2
- 2 å (xi - m ) - ln s - ln(2p )ú
2ê
ds ë 2s i=1
2
2
û
ù
d é 1 2 -1 N
N
2
2
=
- (s ) å (xi - m ) - ln s ú
2ê
ds ë 2
2
û
i=1
1 2 -2 N
N
1 N
N
2
2
= (s ) å (xi - m ) - 2 =
(x
m
)
å i
2
2 2
2
2
s
s
s
i=1
( ) i=1
N
å(x - m )
2
i
=
i=1
(s )
- Ns 2
2 2
=0 Þs
2
ML
1 N
= å (xn - m ML )2
N n=1
• The resulting distribution is called a “generative model”
because it can generate new data values.
N (x | m ML , s ML ) =
1
2ps 2
e
( x-m ML )2
2
2s ML
• We say that
q = {mML , s ML }
parameterizes the model.
• In general, θ is used to denote the (learnable) parameters of a
probabilistic model
Learning a GMM
More general case: Multivariate Gaussian Distribution
Multivariate (D-dimensional) Gaussian:
N ( x | μ, Σ ) 
1
(2 )
D/2
Σ
1/ 2
e

( x μ )T Σ 1 ( x μ )
2
where μ is a D-dimensiona l mean vecto r, Σ is
a D  D covariance matrix, and Σ is the determinan t of Σ.
Covariance:
Variance:
n
å(x - x )(y - y )
i
cov(x, y) =
i
i=1
(n -1)
Covariance Matrix Σ :
Σi,j = cov (xi , xj)
• Let S be a set of multivariate data points (vectors):
S = {x1, ..., xm}.
• General expression for finite Gaussian mixture model:
K
p(x) = åp k N (x | mk , Sk )
k=1
• That is, x has probability of “membership” in multiple
clusters/classes.
Maximum Likelihood for
Multivariate Gaussian Mixture Model
• Goal: Given S = {x1, ..., xN}, and given K, find the Gaussian mixture
model (with K multivariate Gaussians) for which S has maximum loglikelihood.
• Log likelihood function:
• Given S, we can maximize this function to find
{π,μ,Σ}ML
• But no closed form solution (unlike simple case in previous slides)
• In this multivariate case, we can efficiently maximize this function using
the “Expectation / Maximization” (EM) algorithm.
Expectation-Maximization (EM) algorithm
•
General idea:
– Choose random initial values for means, covariances and mixing coefficients.
(Analogous to choosing random initial cluster centers in K-means.)
– Alternate between E (expectation) and M (maximization) step:
• E step: use current values for parameters to evaluate posterior
probabilities, or “responsibilities”, for each data point. (Analogous to
determining which cluster a point belongs to, in K-means.)
• M step: Use these probabilities to re-estimate means, covariances, and
mixing coefficients. (Analogous to moving the cluster centers to the
means of their members, in K-means.)
Repeat until the log-likelihood or the parameters θ do not change significantly.
More detailed version of EM algorithm
1. Let X be the set of training data. Initialize the means k,
covariances k, and mixing coefficients k, and evaluate
initial value of log likelihood.
 K

ln p( X | π,μ,Σ)   ln    k N (x n | μ k ,Σ k ) 
n 1
 k 1

N
2. E step. Evaluate the “responsibilities” using the current
parameter values
where rn,k denotes the
“responsibilities” of the kth
cluster for the nth data point.
3. M step. Re-estimate the parameters θ using the current
responsibilities.
4. Evaluate the log likelihood with the new parameters
 K

ln p( X | π,μ,Σ)   ln    k N (x n | μ k ,Σ k ) 
n 1
 k 1

N
and check for convergence of either the parameters or the log
likelihood. If not converged, return to step 2.
• EM much more computationally expensive than K-means
• Common practice: Use K-means to set initial parameters,
then improve with EM.
•
– Initial means: Means of clusters found by k-means
– Initial covariances: Sample covariances of the clusters
found by K-means algorithm.
– Initial mixture coefficients: Fractions of data points
assigned to the respective clusters.
• Can prove that EM finds local maxima of log-likelihood
function.
• EM is very general technique for finding maximum-likelihood
solutions for probabilistic models
Using GMM for Classification
Assume each cluster corresponds to one of the classes.
A new test example x is classified according to
Case Study:
Text classification from labeled and unlabeled
documents using EM
K. Nigam et al., Machine Learning, 2000
• Big problem with text classification: need labeled data.
• What we have: lots of unlabeled data.
• Question of this paper: Can unlabeled data be used to increase
classification accuracy?
• I.e.: Any information implicit in unlabeled data? Any way to
take advantage of this implicit information?
General idea: A version of EM algorithm
• Train a classifier with small set of available labeled
documents.
• Use this classifier to assign probabilisitically-weighted class
labels to unlabeled documents by calculating expectation of
missing class labels.
• Then train a new classifier using all the documents, both
originally labeled and formerly unlabeled.
• Iterate.
Probabilistic framework
• Assumes data are generated with Gaussian mixture model
• Assumes one-to-one correspondence between mixture
components and classes.
• “These assumptions rarely hold in real-world text data”
Probabilistic framework
Let C = {c1, ..., cK} be the classes / mixture components
Let  = {1, ..., K}  {1, ..., K}  {1, ..., K} be the
mixture parameters.
Assumptions: A document di is created by first selecting a
mixture component according to the mixture weights j, then
having this selected mixture component generate a document
according to its own parameters, with distribution
p(di | cj; ).
• Likelihood of document di :
k
p ( d i |  )    k p ( d i | c j ; )
j 1
• Now, we will apply EM to a Naive Bayes Classifier
Recall Naive Bayes classifier: Assume each feature is
conditionally independent, given cj.
p(c j | x) = p(c j )Õ p( fi | c j ), i =1,..., N; j =1,..., K
i
To “train” naive Bayes from labeled data, estimate
p(c j ) and p( fi | c j ), j =1,..., K; i =1,..., N
These values are estimates of the parameters in . Call
these values ˆ .
Note that Naive Bayes can be thought of as a generative
mixture model.
Document di is represented as a vector of word frequencies
( w1, ..., w|V| ), where V is the vocabulary (all known words).
The probability distribution over words associated with each
class is parameterized by .
We need to estimate ˆ to determine what probability
distribution document di = ( w1, ..., w|V| )is most likely to
come from.
Applying EM to Naive Bayes
• We have a small number of labeled documents Slabeled, and a
large number of unlabeled documents, Sunlabeled.
• The initial parameters ˆ are estimated from the labeled
documents Slabeled.
• Expectation step: The resulting classifier is used to assign
probabilistically-weighted class labels p(c j | x) to each
unlabeled document x  Sunlabeled.
• Maximization step: Re-estimate ˆ using p(c j | x) values
for x  Sunlabeled  Sunlabeled
• Repeat until p(c j | x) or ˆ has converged.
Augmenting EM
What if basic assumptions (each document generated by one
component; one-to-one mapping between components and
classes) do not hold?
They tried two things to deal with this:
(1) Weighting unlabeled data less than labeled data
(2) Allow multiple mixture components per class:
A document may be comprised of several different sub-topics,
each best captured with a different word distribution.
Data
• 20 UseNet newsgroups
• Web pages (WebKB)
• Newswire articles (Reuters)