Semi-Supervised Clustering I

Download Report

Transcript Semi-Supervised Clustering I

Semi-Supervised Clustering
CS 685: Special Topics in Data Mining
Spring 2008
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Outline of lecture
• Overview of clustering and classification
• What is semi-supervised learning?
– Semi-supervised clustering
– Semi-supervised classification
• Semi-supervised clustering
– What is semi-supervised clustering?
– Why semi-supervised clustering?
– Semi-supervised clustering algorithms
CS685 : Special Topics in Data Mining, UKY
Supervised classification versus unsupervised
clustering
• Unsupervised clustering
– Group similar objects together to find clusters
• Minimize intra-class distance
• Maximize inter-class distance
• Supervised classification
– Class label for each training sample is given
– Build a model from the training data
– Predict class label on unseen future data points
CS685 : Special Topics in Data Mining, UKY
What is clustering?
• Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
CS685 : Special Topics in Data Mining, UKY
What is Classification?
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
CS685 : Special Topics in Data Mining, UKY
Clustering algorithms
• K-Means
• Hierarchical clustering
• Graph based clustering (Spectral
clustering)
CS685 : Special Topics in Data Mining, UKY
Classification algorithms
•
•
•
•
•
•
•
Decision Trees
Naïve Bayes classifier
Support Vector Machines (SVM)
K-Nearest-Neighbor classifiers
Logistic Regression
Neural Networks
Linear Discriminant Analysis (LDA)
CS685 : Special Topics in Data Mining, UKY
Supervised Classification Example
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Learning
• Combines labeled and unlabeled data during training to
improve performance:
– Semi-supervised classification: Training on labeled data
exploits additional unlabeled data, frequently resulting
in a more accurate classifier.
– Semi-supervised clustering: Uses small amount of
labeled data to aid and bias the clustering of unlabeled
data.
Unsupervised
clustering
Semi-supervised
learning
Supervised
classification
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Classification
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Classification
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Classification
• Algorithms:
–
–
–
–
Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
Co-training [Blum:COLT98].
Transductive SVM’s [Vapnik:98,Joachims:ICML99].
Graph based algorithms
• Assumptions:
– Known, fixed set of categories given in the labeled data.
– Goal is to improve classification of examples into these
known categories.
• More discussions next week
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Second Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Second Semi-Supervised Clustering
Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
CS685 : Special Topics in Data Mining, UKY
Semi-supervised clustering:
problem definition
• Input:
– A set of unlabeled objects, each described by a set of attributes
(numeric and/or categorical)
– A small amount of domain knowledge
• Output:
– A partitioning of the objects into k clusters (possibly with some
discarded as outliers)
• Objective:
– Maximum intra-cluster similarity
– Minimum inter-cluster similarity
– High consistency between the partitioning and the domain
knowledge
CS685 : Special Topics in Data Mining, UKY
Why semi-supervised clustering?
• Why not clustering?
– The clusters produced may not be the ones required.
– Sometimes there are multiple possible groupings.
• Why not classification?
– Sometimes there are insufficient labeled data.
• Potential applications
–
–
–
–
Bioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Clustering
• Domain knowledge
– Partial label information is given
– Apply some constraints (must-links and cannot-links)
– Search-based Semi-Supervised Clustering
• Alter the clustering algorithm using the constraints
• Alter the similarity measure based on the constraints
– Combination of both
Next class
– Similarity-based Semi-Supervised Clustering
This class
• Approaches
CS685 : Special Topics in Data Mining, UKY
Search-Based Semi-Supervised Clustering
• Alter the clustering algorithm that searches for a good
partitioning by:
– Modifying the objective function to give a reward for
obeying labels on the supervised data [Demeriz:ANNIE99].
– Enforcing constraints (must-link, cannot-link) on the
labeled data during clustering [Wagstaff:ICML00,
Wagstaff:ICML01].
– Use the labeled data to initialize clusters in an iterative
refinement algorithm (kMeans,) [Basu:ICML02].
CS685 : Special Topics in Data Mining, UKY
Overview of K-Means Clustering
• K-Means is a partitional clustering algorithm based on
iterative relocation that partitions a dataset into K clusters.
Algorithm:
Initialize K cluster centers
Repeat until convergence:
{l }
K
l 1
randomly.
– Cluster Assignment Step: Assign each data point x to the cluster Xl,
such that L2 distance of x from l (center of Xl) is minimum
– Center Re-estimation Step: Re-estimate each cluster center l as the
mean of the points in that cluster
CS685 : Special Topics in Data Mining, UKY
K-Means Objective Function
• Locally minimizes sum of squared distance between the data
points and their corresponding cluster centers:
 
K
l 1
xi X l
|| xi  l ||
2
• Initialization of K cluster centers:
– Totally random
– Random perturbation from global mean
– Heuristic to ensure well-separated centers
CS685 : Special Topics in Data Mining, UKY
K Means Example
CS685 : Special Topics in Data Mining, UKY
K Means Example
Randomly Initialize Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re-assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re-assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re-estimate Means and Converge
x
x
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised K-Means
• Partial label information is given
– Seeded K-Means
– Constrained K-Means
• Constraints (Must-link, Cannot-link)
– COP K-Means
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised K-Means for partially labeled
data
• Seeded K-Means:
– Labeled data provided by user are used for initialization: initial center
for cluster i is the mean of the seed points having label i.
– Seed points are only used for initialization, and not in subsequent
steps.
• Constrained K-Means:
– Labeled data provided by user are used to initialize K-Means
algorithm.
– Cluster labels of seed data are kept unchanged in the cluster
assignment steps, and only the labels of the non-seed data are reestimated.
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means
Use labeled data to find
the initial centroids and
then run K-Means.
The labels for seeded
points may change.
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means Example
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means
Example
Initialize Means Using Labeled Data
x
x
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means Example
Assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means Example
Re-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means Example
Assign points to clusters and Converge
x
x
the label is changed
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means
Use labeled data to find
the initial centroids and
then run K-Means.
The labels for seeded
points will not change.
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means Example
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means Example
Initialize Means Using Labeled Data
x
x
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means Example
Assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means Example
Re-estimate Means and Converge
x
x
CS685 : Special Topics in Data Mining, UKY
COP K-Means
• COP K-Means [Wagstaff et al.: ICML01] is K-Means with mustlink (must be in same cluster) and cannot-link (cannot be in
same cluster) constraints on data points.
• Initialization: Cluster centers are chosen randomly, but as
each one is chosen any must-link constraints that it
participates in are enforced (so that they cannot later be
chosen as the center of another cluster).
• Algorithm: During cluster assignment step in COP-K-Means, a
point is assigned to its nearest cluster without violating any of
its constraints. If no such assignment exists, abort.
CS685 : Special Topics in Data Mining, UKY
COP K-Means Algorithm
CS685 : Special Topics in Data Mining, UKY
Determine
its label
Illustration
Must-link
x
x
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
Determine
its label
x
x
Cannot-link
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
Determine
its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
CS685 : Special Topics in Data Mining, UKY
Summary
• Seeded and Constrained K-Means: partially labeled data
• COP K-Means: constraints (Must-link and Cannot-link)
•
Constrained K-Means and COP K-Means require all the
constraints to be satisfied.
– May not be effective if the seeds contain noise.
• Seeded K-Means use the seeds only in the first step to
determine the initial centroids.
– Less sensitive to the noise in the seeds.
• Experiments show that semi-supervised k-Means outperform
traditional K-Means.
CS685 : Special Topics in Data Mining, UKY
Reference
• Semi-supervised Clustering by Seeding
– http://www.cs.utexas.edu/users/ml/papers/semiicml-02.pdf
• Constrained K-means clustering with
background knowledge
– http://www.litech.org/~wkiri/Papers/wagstaffkmeans-01.pdf
CS685 : Special Topics in Data Mining, UKY
Next class
• Topics
– Similarity-based semi-supervised clustering
• Readings
– Distance metric learning, with application to
clustering with side-information
• http://ai.stanford.edu/~ang/papers/nips02-metric.pdf
– From Instance-level Constraints to Space-Level
Constraints: Making the Most of Prior Knowledge
in Data Clustering
• http://www.cs.berkeley.edu/~klein/papers/constrained
_clustering-ICML_2002.pdf CS685 : Special Topics in Data Mining, UKY