A Probabilistic Framework for Semi

Transcript A Probabilistic Framework for Semi

A Probabilistic Framework
for Semi-Supervised
Clustering
Sugato Basu
Mikhail Bilenko
Raymond J. Mooney
Deptment of Computer Sciences
University of Texas at Austin
Presented by Jingting Zeng
Outline





Introduction
Background
Algorithm
Experiments
Conclusion
What is Semi-Supervised Clustering?

Use human input to provide labels for some
of the data



Improve existing naive clustering methods
Use labeled data to guide clustering of unlabeled
data
End result is a better clustering of data
Motivation

Large amounts of unlabeled data exists


Expensive to generate Labels for data


More is being produced all the time
Usually requires human intervention
Want to use limited amounts of labeled data
to guide the clustering of the entire dataset in
order to provide a better clustering method
Semi-Supervised Clustering

Constraint-Based:


Modify objective functions so that it includes
satisfying constraints, enforcing constraints during
the clustering or initialization process
Distance-Based:

A distance function is used that is trained on the
supervised dataset to satisfy the labels or
constraint, and then is applied to the complete
dataset
Method





Use both constraint-based and distance
based approaches in a unified method
Use Hidden Markov Random Fields to
generate constraints
Use constraints for initialization and
assignment of points to clusters
Use an adaptive distance function to try to
learn the distance measure
Cluster data to minimize some objective
distance function
Main Points of Method

Improved initialization:


Constraint sensitive assignment of points to
clusters


Initial clusters are formed based on constraints
Points are assigned to clusters to minimize a
distortion function, while minimizing the number of
constraints violated
Iterative Distance Learning

Distortion measure is re-estimated after each
iteration
Constraints

Pairwise constraints of Must-link, or Cannotlink labels




Set M of must link constraints
Set C of cannot link constraints
A list of associated costs for violating Mustlink or cannot-link requirements
Class labels do not have to be known, but a
user can still specify relationship between
points.
HMRF
Posterior probability

This problem is an “incomplete-data problem”


Cluster representative as well as class labels are unknown
Popular method for solving this type of problem is
Expectation Maximization

K-Means is equivalent to an EM algorithm with hard
clustering assignments
Must-Link Violation Cost Function
f m ( xi , x j )  wij D ( xi , x j )


Ensures that the penalty for violating the must-link
constraint between 2 points that are far apart is
higher than between 2 points that are close
Punishes distance functions in which must-link
points are far apart
Cannot-Link Violation Cost Function
f c ( xi , x j )  wij ( D max   D ( xi , x j ))


Ensures that the penalty for violating cannot-link constraints
between points that are nearby according to the current
distance function is higher than between distant points
Punishes distance functions that place 2 cannot link points
close together
Objective Function
J obj  x  j D( xi , uli )  xi , xj  M
i
 xi , xj  C

wij ( D max  D ( xi , x j ))
 log Z
Goal: find minimum objective function



Supervised data in initialization
Constraints in cluster assignments
Distance learning in M step
wij D ( xi , x j )
Algorithm

EM framework

Initialization step:


E step:


Use constraints to guide initial cluster formations
minimizes objective function over cluster assignment
M step:


Minimizes objective function over cluster representatives
Minimizes objective function over the parameter
distortion measure
Initialization

Initialize:

Form transitive closure of the must-link constraints


Set of connected components consisting of points
connected by must-link constraints
y connected components


If y < K (number of clusters), y connected
neighborhoods used to create y initial clusters
Remaining clusters initialized by random perturbations
of the global centroid of the data
What If More Neighborhoods Then
Clusters?

If y > K (number of clusters), k initial clusters
are selecting using the distance measure


Farthest first traversal is a good heuristic
Weighted variant of farthest first traversal



Distance between 2 centroids multiplied by their
corresponding weights
Weight of each centroid is proportional to the size of the
corresponding neighborhood.
Biased to select centroids that are relatively far apart
and of a decent size
Initialization Continued

Assuming consistency of data:



Augment set M with must-link constraints inferred
from transitive closure
For each pair of neighborhoods Np ,Np’ that have
at least one cannont link constraint between them,
add connot-link constraints between every
member of Np and Np’ .
Learn as much about the data through the
constraints as possible
Augment set M
a
Must-link
b
Must-link
Inferred Must-link
c
Augment Set C
a
Must-link
b
Cannot-link
Inferred Cannot-link
c
E-step

Assignments of data points to clusters



Since model incorporates interactions between
points, computing point assignments to minimize
objective function is computationally intractable
Iterated conditional models, belief propagation,
and linear programming relaxation
ICM: uses greedy strategy to sequentially update
the cluster assignment of each point while
keeping other points fixed.
M-step



First, cluster representatives are re-estimated to
decrease objective function
Constraints do not factor into this step, so it is
equivalent to K-Means
If parameterized variant of a distance measure is
used, it is updated here


Parameters can be found through partial derivatives of
distance function
Learning step results in modifying the distortion measure
so that similar points are brought closer together, while
dissimilar points are pulled apart
Results

KMeans – I – C – D :


KMEANS – I- C:


Complete HMRF- KMeans algorithm, with
supervised data in initialization(I) and cluster
assignment(C) and distance learning(D)
HMRF-KMeans algorithm without distance
learning
KMEANS –I :

HMRF-KMeans algorithm without distance
learning and supervised cluster assignment
Results
Results
Results
Conclusion

HMRF-KMeans Performs well (compared to
naïve K-Means) with a limited number of
constraints



The goal of the algorithm was to provide a better
clustering method with the use of a limited number
of constraints
HMRF-KMeans learns quickly from a limited
number of constraints
Should be applicable to data sets where we want
to limit the amount of labeling needed to be done
by humans, and constraints can be specified in
pairwise labels
Questions

Can all types of constraints be captured in
pairwise associations?


Could other types of labels be included in this
model?


Hierarchal structure?
Use class labels as well as pairwise constraints
How does this model handle noise in the
data/labels?

Point A has must link constraint to Point B, Point B
has must list constraint to Point C, Point A has
Cannot-link constraint to point C
More Questions

How does this apply to other types of Data?



Authors mention wanting to try applying method to
other types of data in the future, such as gene
representation
Who provides weights for function violations,
and how are weights determined?
Only compared with naïve KMeans method

How does it compare with other semi-supervised
clustering methods?
Reference

S. Basu, M. Bilenko, and R.J. Mooney, “A
Probabilistic Framework for SemiSupervised Clustering,” Proc. 10th ACM
SIGKDD Int'l Conf. Knowledge Discovery
and Data Mining (KDD), Aug. 2004.
Thank you!

A Probabilistic Framework for Semi

Transcript A Probabilistic Framework for Semi

Directory