Semi-Supervised Clustering

Download Report

Transcript Semi-Supervised Clustering

Semi-Supervised Clustering
Jieping Ye
Department of Computer Science and
Engineering
Arizona State University
http://www.public.asu.edu/~jye02
Outline
• Overview of clustering and classification
• What is semi-supervised learning?
– Semi-supervised clustering
– Semi-supervised classification
• Semi-supervised clustering
– What is semi-supervised clustering?
– Why semi-supervised clustering?
– Semi-supervised clustering algorithms
Supervised classification versus
unsupervised clustering
• Unsupervised clustering Group similar objects
together to find clusters
• Minimize intra-class distance
• Maximize inter-class distance
• Supervised classification Class label for each
training sample is given
– Build a model from the training data
– Predict class label on unseen future data points
What is clustering?
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
What is Classification?
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Clustering algorithms
• K-Means
• Hierarchical clustering
• Graph based clustering (Spectral clustering)
• Bi-clustering
Classification algorithms
• K-Nearest-Neighbor classifiers
• Naïve Bayes classifier
• Linear Discriminant Analysis (LDA)
• Support Vector Machines (SVM)
• Logistic Regression
• Neural Networks
Supervised Classification Example
.
.
.
.
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Semi-Supervised Learning
• Combines labeled and unlabeled data during training to
improve performance:
– Semi-supervised classification: Training on labeled data
exploits additional unlabeled data, frequently resulting in a
more accurate classifier.
– Semi-supervised clustering: Uses small amount of labeled
data to aid and bias the clustering of unlabeled data.
Unsupervised
clustering
Semi-supervised
learning
Supervised
classification
Semi-Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Semi-Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Semi-Supervised Classification
• Algorithms:
– Semisupervised EM
[Ghahramani:NIPS94,Nigam:ML00].
– Co-training [Blum:COLT98].
– Transductive SVM’s
[Vapnik:98,Joachims:ICML99].
– Graph based algorithms
• Assumptions:
– Known, fixed set of categories given in the
labeled data.
– Goal is to improve classification of examples into
these known categories.
Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Second Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Second Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
Semi-supervised clustering: problem
definition
• Input:
– A set of unlabeled objects, each described by a set of
attributes (numeric and/or categorical)
– A small amount of domain knowledge
• Output:
– A partitioning of the objects into k clusters (possibly
with some discarded as outliers)
• Objective:
– Maximum intra-cluster similarity
– Minimum inter-cluster similarity
– High consistency between the partitioning and the
domain knowledge
Why semi-supervised clustering?
• Why not clustering?
– The clusters produced may not be the ones required.
– Sometimes there are multiple possible groupings.
• Why not classification?
– Sometimes there are insufficient labeled data.
• Potential applications
–
–
–
–
Bioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization
Semi-Supervised Clustering
• Domain knowledge
– Partial label information is given
– Apply some constraints (must-links and cannot-links)
• Approaches
– Search-based Semi-Supervised Clustering
• Alter the clustering algorithm using the constraints
– Similarity-based Semi-Supervised Clustering
• Alter the similarity measure based on the constraints
– Combination of both
Search-Based Semi-Supervised
Clustering
• Alter the clustering algorithm that searches for a
good partitioning by:
– Modifying the objective function to give a
reward for obeying labels on the supervised
data [Demeriz:ANNIE99].
– Enforcing constraints (must-link, cannot-link) on
the labeled data during clustering
[Wagstaff:ICML00, Wagstaff:ICML01].
– Use the labeled data to initialize clusters in an
iterative refinement algorithm (kMeans,)
[Basu:ICML02].
Overview of K-Means Clustering
• K-Means is a partitional clustering algorithm based
on iterative relocation that partitions a dataset into
K clusters.
Algorithm:
Initialize K cluster centers {l }l 1
randomly. Repeat until
convergence:
– Cluster Assignment Step: Assign each data point x to the
cluster Xl, such that L2 distance of x from  l (center of
Xl) is minimum
– Center Re-estimation Step: Re-estimate each cluster
center  l as the mean of the points in that cluster
K
K-Means Objective Function
• Locally minimizes sum of squared distance between
the data points and their corresponding cluster
centers: 2
K
l 1 x X || xi  l ||
i
l
• Initialization of K cluster centers:
– Totally random
– Random perturbation from global mean
– Heuristic to ensure well-separated centers
K Means Example
K Means Example
Randomly Initialize Means
x
x
K Means Example
Assign Points to Clusters
x
x
K Means Example
Re-estimate Means
x
x
K Means Example
Re-assign Points to Clusters
x
x
K Means Example
Re-estimate Means
x
x
K Means Example
Re-assign Points to Clusters
x
x
K Means Example
Re-estimate Means and Converge
x
x
Semi-Supervised K-Means
• Partial label information is given
– Seeded K-Means
– Constrained K-Means
• Constraints (Must-link, Cannot-link)
– COP K-Means
Semi-Supervised K-Means for
partially labeled data
• Seeded K-Means:
– Labeled data provided by user are used for initialization:
initial center for cluster i is the mean of the seed points
having label i.
– Seed points are only used for initialization, and not in
subsequent steps.
• Constrained K-Means:
– Labeled data provided by user are used to initialize KMeans algorithm.
– Cluster labels of seed data are kept unchanged in the
cluster assignment steps, and only the labels of the nonseed data are re-estimated.
Seeded K-Means
Use labeled data to find
the initial centroids and
then run K-Means.
The labels for seeded
points may change.
Seeded K-Means Example
Seeded K-Means Example
Initialize Means Using Labeled Data
x
x
Seeded K-Means Example
Assign Points to Clusters
x
x
Seeded K-Means Example
Re-estimate Means
x
x
Seeded K-Means Example
Assign points to clusters and Converge
x
x
the label is changed
Exercise
0 1
3
6
9
15
20 21 22
Compute the clustering using seeded Kmeans
Constrained K-Means
Use labeled data to find
the initial centroids and
then run K-Means.
The labels for seeded
points will not change.
Constrained K-Means Example
Constrained K-Means Example
Initialize Means Using Labeled Data
x
x
Constrained K-Means Example
Assign Points to Clusters
x
x
Constrained K-Means Example
Re-estimate Means and Converge
x
x
Exercise
0 1
3
6
9
15
20 21 22
Compute the clustering using constrained Kmeans
COP K-Means
• COP K-Means [Wagstaff et al.: ICML01] is KMeans with must-link (must be in same cluster)
and cannot-link (cannot be in same cluster)
constraints on data points.
• Initialization: Cluster centers are chosen
randomly, but as each one is chosen any must-link
constraints that it participates in are enforced (so
that they cannot later be chosen as the center of
another cluster).
• Algorithm: During cluster assignment step in COPK-Means, a point is assigned to its nearest cluster
without violating any of its constraints. If no such
assignment exists, abort.
COP K-Means Algorithm
Determine
its label
Illustration
Must-link
x
x
Assign to the red class
Illustration
Determine
its label
x
x
Cannot-link
Assign to the red class
Illustration
Determine
its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
Similarity-based semi-supervised clustering
• Alter the similarity measure based on the
constraints
• Paper: From Instance-level Constraints to SpaceLevel Constraints: Making the Most of Prior
Knowledge in Data Clustering. D. Klein et al.
Two types of constraints: Must-links and Cannot-links
Clustering algorithm: Hierarchical clustering
Constraints
Overview of Hierarchical Clustering
Algorithm
•
Agglomerative versus Divisive
•
Basic algorithm of Agglomerative clustering
1.
2.
3.
4.
5.
6.
•
Compute the distance matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
Key operation is the update of the distance
between two clusters
How to Define Inter-Cluster Distance
p1
Distance?
p2
p3
p4 p5
p1
p2
p3
p4
p5
•
•
•
•
MIN
MAX
Group Average
Distance Between
Centroids
.
.
.
distance matrix
...
Must-link constraints
• Distance between must-links pair to zero.
• Derive a new metric by running an all-pairsshortest distances algorithm.
– It is still a metric
– Faithful to the original metric
– Computational complexity: O(N2 C)
• C: number of points involved in must-link constraints
• N: total number of points
New distance matrix based on must-link
constraints
p1
p2
p3
p4 p5
...
p1
p2
p3
Hierarchical clustering can
be carried out based on the
new distance matrix.
p4
p5
.
.
.
New distance matrix
What is missing?
Cannot-link constraint
• Run hierarchical clustering with complete link
(MAX)
– The distance between two clusters is determined by the
largest distance.
• Set the distance between cannot-link pair to be
• The new distance matrix does not define a metric.
– Work very well in practice
Constrained complete-link clustering
algorithm
Derive a new distance
Matrix based on both
Types of constraints
Illustration
1
0
2
0.2 0.5 0.1
0
4
0.4 0.2 0.6
0
3
5
0.8
0.3 0.2
0
0.5
0
Initial distance matrix
New distance matrix
0.9
0
0.2
0.5
0.1
0.8
0
0.4
0.2
0.6
0
0.3
0.2
0
0.5
0
Must-links: 1—2, 3—4
Cannot-links: 2--3
0
0
0.1
0.1
0.8
0
0.2
0.2
0.6
0
0
0.2
0
0.2
0
Hierarchical clustering
1 and 2 form a cluster, and 3 and 4 form another cluster
1,2
1,2
0
3,4
5
3,4
0.9
0.8
0
0.2
0
1
2
3
4
5
5
Summary
• Seeded and Constrained K-Means: partially labeled
data
• COP K-Means: constraints (Must-link and Cannot-link)
•
Constrained K-Means and COP K-Means require all
the constraints to be satisfied.
– May not be effective if the seeds contain noise.
• Seeded K-Means use the seeds only in the first step
to determine the initial centroids.
– Less sensitive to the noise in the seeds.
• Semi-supervised hierarchical clustering