The Semi-supervised Clustering Problem

Transcript The Semi-supervised Clustering Problem

DB Seminar Series: The
Semi-supervised
Clustering Problem
By: Kevin Yip
2 Jan 2004
(Happy New Year!)
First of all, an example…

Where are the 2 clusters?
2
First of all, an example…

Where are the 2 clusters?
What if I tell you that these
two are in the same cluster
…and these two are in the
same cluster?
3
First of all, an example…

Where are the 2 clusters?
4
Observations




The clustering accuracy was greatly
improved (from ~50% accuracy to 100%).
All objects were unlabeled =>
classification not possible.
Only a small amount of knowledge was
added (the relationships between 2 pairs
of objects, out of the total 105 pairs).
=> Semi-supervised clustering
5
Observations

Is it more natural to form 4 clusters
instead?
– True, but then how many clusters are there?
– It is most natural to set k to the number of
known classes.
6
Outline








The problem (no formal definitions…)
Why not clustering? Classification?
Potential applications of semi-supervised
clustering
Supervision – what to input?
When to input?
How to alter the clustering process?
Some examples
Future Work
7
The problem

Input:
– A set of unlabeled objects, each described by a set of
attributes (numeric and/or categorical)
– A small amount of domain knowledge

Output:
– A partitioning of the objects into k clusters (possibly
with some discarded as outliers)

Objective:
– Maximum intra-cluster similarity
– Minimum inter-cluster similarity
– High consistency between the partitioning and the
domain knowledge
8
Why not clustering?

The clusters produced may not be the ones
required.
(Guha et al., 1998)



Algorithms that fit the cluster model may not be
available.
Sometimes there are multiple possible
groupings.
There is no way to utilize the domain
knowledge that is accessible (active learning
v.s. passive validation).
9
Why not classification?

Sometimes there are insufficient labeled
data:
– Objects are not labeled.
– The amount of labeled objects is statistically
insignificant.
– The labeled objects do not cover all classes.
– The labeled objects of a class do not cover
all cases (e.g. they are all found at one side
of a class).

The underlying class model may not fit
the data (e.g. pattern-based similarity).
10
Potential applications
Bioinformatics (gene and protein
clustering)
 Document hierarchy construction
 News/email categorization
 Image categorization
 Road lane detection
 XML document clustering?
=> Cluster properties not well known, but
some domain knowledge is available

11
What to input?


A small amount of labeled data
Instance-level constraints
– Must-link: two objects must be put to the same
cluster.
– Cannot-link: two objects must not be put to the same
cluster.
– Objects that are similar/dissimilar to each other.




First-order logic language
Good cluster, bad cluster
General comments (e.g. clustering too fine)
Others
12
When to input?



At the beginning of clustering
During cluster validation, to guide the
next round of clustering
When the algorithm requests
– The user may give a “don’t know” response.
13
How to alter the
clustering process?





Guide the formation of seed clusters.
Force/recommend some objects to be
put in the same cluster/different clusters.
Modify the objective function.
Modify the similarity function.
Modify the distance matrix.
14
Examples
Who and
where?
What?
When?
How?
Wagstaff et al.
ICML 2001
Must-links,
cannot-links
Beginning
Force object
relationships
Basu et al.
ICML 2002
Labeled objects
Beginning
Create seed
clusters
Klein et al.
ICML 2002
Must-links,
cannot-links
Beginning/on
request
Modify the
distance matrix
Xing et al.
ANIPS 15, 2003
Examples of
similar objects
Beginning
Modify the
distance function
Basu et al.
Not yet
published, 2003
Must-links,
cannot-links
Beginning/on
request
Modify the
objective
function, create
seed clusters
15
Wagstaff et al. 2001


Input knowledge: must-link and cannotlink constraints
Algorithm: constrained K-means (COPKMeans):
1. Generate cluster seeds.
2. Assign each object to the closest seed in
turn, such that no constraints are violated
(otherwise, report error and halt).
3. Make the centroid of each cluster as the
new seed.
4. Repeat 2 and 3 until convergence.
16
Wagstaff et al. 2001

Some results:
– Soybean (47 x 35, 4 classes)
• At most 47 x
46 / 2 = 1081
constraints
• 100%
coverage can
be achieved
by 43 mustlinks.
17
Wagstaff et al. 2001

Some results:
– Mushroom (50 x 21, 2 classes)
18
Wagstaff et al. 2001

Some results:
– Tic-tac-toe (100 x 9, 2 classes)
19
Basu et al. 2002


Input knowledge: a small set of labeled
objects for each class
Algorithm: Seeded-KMeans:
1. Use the centroid of each set of objects as
an initial seed.
2. Assign each object to the closest seed.
3. Make the centroid of each cluster as the
new seed.
4. Repeat 2 and 3 until convergence.
20
Basu et al. 2002

Algorithm: Constrained-KMeans:
1. Use the centroid of each set of objects as
an initial seed.
2. Assign each object to the closest seed in
turn, such that all labeled objects of each
class are put to the same cluster.
3. Make the centroid of each cluster as the
new seed.
4. Repeat 2 and 3 until convergence.
21
Basu et al. 2002

Some results:
– Same-3 Newsgroups (300 x 21631, 3 classes)
22
Basu et al. 2002

Some results:
– Different-3 Newsgroups (300 x 21631, 3 classes)
23
Klein et al. 2002


In the previous 2 approaches,
unconstrained objects are not much
affected if the amount of input is small.
Is it possible to propagate the influence
to the neighboring objects?
24
Klein et al. 2002

An illustration:
25
Klein et al. 2002


Idea: modify the distance matrix such that
must-link objects have zero distances and
cannot-link objects have large distances.
Two problems:
– The effects are still restricted to constrained objects.
– The resulting distance matrix may not satisfy the
triangle inequality.

Solutions:
– Propagate the must-link influence by running an allpairs shortest paths algorithm.
– Propagate the cannot-link influence implicitly during
clustering.
26
Klein et al. 2002

Must-link propagation:
C
5
B
4
3
D
2
1
A
0
0
Manhattan
A
B
C
D
1
A
0
4
6
6
2
3
B
4
0
2
4
4
C
6
2
0
4
5
D
6
4
4
0
27
Klein et al. 2002

Must-link propagation:
C
5
B
4
3
D
2
1
A
0
0
Manhattan
A
B
C
D
1
A
0
0
6
6
2
3
B
0
0
2
4
4
C
6
2
0
4
5
D
6
4
4
0
28
Klein et al. 2002

Must-link propagation:
C
5
B
4
3
D
2
1
A
0
0
Manhattan
A
B
C
D
1
A
0
0
2
4
2
3
B
0
0
2
4
4
C
2
2
0
4
5
D
4
4
4
0
29
Klein et al. 2002

Cannot-link propagation:
– Similarly, if there is a cannot-link constraint
on a pair of objects, the distance between
the objects is set to a certain large value.
– But it is harder to fix the triangle inequality.
 If
A and B must link, they will be given the
same coordinates.
 But if A and B cannot link, what coordinates
should be given to them?
 In general, it is desirable to have the new
distance matrix as similar as the original one,
but this poses a difficult optimization problem.
30
Klein et al. 2002

Cannot-link propagation:
– Solution: not to propagate, but choose an
algorithm that can implicitly produce the
same effect during the clustering process.
– E.g.: complete-link hierarchical algorithm
 Distance
between two clusters = longest
distance between two objects each taken
from one of the clusters
 If A is cannot-linked with B, when A merges
with C, C will also be cannot-lined with B.
31
Klein et al. 2002

Active learning:
– Up to now, we assume all constraints are
inputted at the beginning.
– Is it possible for the algorithm to request for
useful constraints?
– Advantages:
 Need
fewer constraints
 Constraints being input must be the most
useful ones
– Disadvantages:
 Need
to answer queries on the fly
 Not all queries can be answered
32
Klein et al. 2002

Active learning:
– Allow the algorithm to ask at most m
questions
– Observations:
 In
hierarchical clustering, errors usually occur
at later stages.
 In complete-link clustering, the mergedcluster distance must be monotonically nondecreasing.
– Idea: estimate the distance  such that all
merges that involve clusters with distance 
 are guided by the m questions.
33
Klein et al. 2002

Experiments:
– Algorithms:
 CCL
(distance matrix learning)
 CCL-Active (CCL + active learning)
 COP-KMeans (forced relationships)
34
Klein et al. 2002

Results on synthetic data (held-out acc.):
35
Klein et al. 2002

Results on synthetic data (held-out acc.):
36
Klein et al. 2002

Results on real data (held-out acc.):
a) Iris (150 x 4, 3 classes)
b) Crabs (200 x 5, 2 classes)
c) Soybean (562 x 35, 15 classes)
37
Klein et al. 2002

Effect of constraints:
38
Xing et al. 2003


Is it possible to propagate the effects of
constraints by transforming the distance
function instead of the distance matrix?
Mahalanobis distance:
d ( x, y)  d A ( x, y)  x  y
T

(
x

y
)
A( x  y)
A
– When A=I (identity matrix), d(x,y) is the
Euclidean distance between x and y.
– When A is a diagonal matrix, d(x,y) is the
weighted Euclidean distance between x and
y.
39
Xing et al. 2003


Input knowledge: a set S of object pairs that are
similar.
Objective: find matrix A that minimizes
 x x
( xi , x j )S
i
2
j A
subject to the constraints that
–

xi  x j
( xi , x j )D
A
1
(to avoid the trivial solution of A=0)
– A is positive semi-definite: (to ensure triangle
inequality and d(x,y)  0 for all x, y)
– (Is it also needed to minimize the deviation of the
resulting distance matrix from the original one?)
40
Xing et al. 2003

Case 1: A is a diagonal matrix
– It can be shown that the problem is
equivalent to minimizing the following
function:
g ( A)  g ( A11, Ann ) 

( xi , x j )S
xi  x j
2
A
 log(
 x x
( xi , x j )D
i
j A
)
which can be solved by the NewtonRaphson method.
41
Xing et al. 2003

Case 2: A is not a diagonal matrix
– Newton’s method becomes very expensive
(O(n6))
– Use a gradient descent method to optimize
the functions of an equivalent problem.

In both cases, metric learning is
performed before the clustering process.
42
Xing et al. 2003

Visualization of the power of metric
learning (with 100% input):
43
Xing et al. 2003

Experiments:
– Algorithms:






KMeans
KMeans + learn diagonal A
KMeans + learn full A
COP-KMeans
COP-KMeans + learn diagonal A
COP-KMeans + learn full A
– Input size:


Little domain knowledge: number of resulting
connected components (Kc) = 90% dataset size
Much domain knowledge: Kc = 70% data size
44
Xing et al. 2003

Some results on real data:
45
Basu et al. 2003

Main contributions:
– Combine both metric learning and
constraints enforcing.
– Soft constraints: constraint violations cause
penalties instead of algorithm halt.
– Introduce a new active learning method.
46
Basu et al. 2003

The generalized K-Means model:
d ( x, y)  d A ( x, y)  x  y
T

(
x

y
)
A( x  y)
A
– Maximizing the complete data log-likelihood
is equivalent to minimizing the following
objective function:

xi D
xi   li
2
A
 log(det( A))
– Can be optimized (locally) by an EM
algorithm.
47
Basu et al. 2003
– Adding constraint violation penalties:
2
 xi  li A  log(det( A))
xi D


( xi , x j )M
ij
f M ( xi , x j )1[li  l j ] 

( xi , x j )C
f ( xi , x j )1[li  l j ]
ij C
 M/C:
a set of must-link/cannot-link constraints
 , : weights of constraints based on external
knowledge
 1[true]=1, 1[false]=0
 FM, FC: weights of constraints based on the
distance between the objects (should FM(x, y)
be larger or smaller when d(x, y) is larger?)
48
Basu et al. 2003

A simplified objective function:
1 k
2
xi   li    ij1[li  l j ]    ij1[li  l j ]


2 h 1 xi  h
( xi , x j )M
( xi , x j )C
– Algorithm (PCKMeans):
 From
M and C, identify the pre-specified
connected components.
 Use the centroids of the largest connected
components as the initial seeds. If there are
not enough seeds, use objects that are
cannot-linked to all connected components. If
still not enough, use random seeds.
 Perform KMeans with the modified objective
function.
49
Basu et al. 2003

A new active learning algorithm: ask
questions as early as possible
– Explore step: to form k initial neighborhoods.
Observation: k-means centers are usually
remote from each other.
Procedure:
 Draw
a point that is farthest to all drawn
points.
 Ask at most k questions to determine which
neighborhood the point should be assigned
to.
 Repeat until k neighborhoods are formed (or
no more questions are allowed).
50
Basu et al. 2003
– Consolidate step: to form k clusters.
Observation: each object is close to its
corresponding seed.
Procedure:
 Use
the centroids of the k neighborhoods as
the initial seeds (all objects in a neighborhood
is immediately assigned to the cluster).
 Pick a non-assigned object x.
 Starting from the closest seed, ask for the
relationship between x and an object in the
cluster until the response is a must-link.
 Repeat to assign all objects (if running out of
questions, resume to normal KMeans).
51
Basu et al. 2003

Some experiment results:
– News-all20 (2000 x 16089, 20 classes): left
– News-diff3 (300 x 3251, 3 classes): right
52
Discussions and Future Work

Special clustering problems:
– How can input knowledge help dimension
selection in projected clustering / patternbased clustering?
– Is it possible to refine produced clusters
when new knowledge becomes available?

Other uses of input knowledge:
– Can input knowledge help determine
algorithm parameter values / stopping
conditions?
53
Discussions and Future Work

Clustering algorithm
– What kind of algorithms could potentially benefit most
from knowledge inputs?
– Should we work on constrained objects first, or to mix
them with other objects at the beginning? (Induction
v.s. transduction)

Performance concerns
– How is the speed/space performance affected by the
introduction of external knowledge?
– How will the paradigm be affected if there is a limited
amount of memory?
54
Discussions and Future Work

Active learning
– What to ask? Questions that are difficult for
algorithms but easy for users.
– When to ask? When there is a major consequence.

Noise
– How to handle contradicting constraints/ constraints
that may not be 100% correct?
– How much should be charged for a constraint
violation?
– What is the significance of knowledge inputs when
the data is noisy?
55
Discussions and Future Work

Debate questions (from ICML 2003 Workshop
homepage)
– Unlabeled data is only useful when there are a large
number of redundant features.
– Why doesn't The No Free Lunch Theorem apply
when working with unlabeled data?
– Unlabeled data has to come from the same
underlying distribution as the labeled data.
– Can unlabeled data be used in temporal domains?
– Feature engineering is more important than algorithm
design for semi-supervised learning.
– All the interesting problems in semi-supervised
learning have been identified.
– Active learning is an interesting "academic" problem.
56
Discussions and Future Work

Debate questions (from ICML 2003 Workshop
homepage)
– Active learning research without user interface
design is only solving half the problem.
– Using Unlabeled data in Data Mining is no different
than using it in Machine Learning.
– Massive data sets pose problems when using current
semi-supervised algorithms.
– Off-the-shelf data mining software incorporating
labeled and unlabeled data is a fantasy.
– Unlabeled data is only useful when the classes are
well separated.
57
References





Ayhan Demiriz, K. P. Bennett and M. J. Embrechts,
Semi-supervised Clustering using Genetic Algorithms,
ANNIE 1999.
Luis Talavera and Javier Bejar, Integrating Declarative
Knowledge in Hierarchical Clustering Tasks, ISIDA 1999.
David Cohn, Rich Caruana and Andrew McCallum,
Semi-supervised Clustering with User Feedback,
Unpublished Work 2000.
Kiri Wagstaff and Claire Cardie, Clustering with
Instance-level Constraints, ICML 2000.
Kiri Wagstaff, Claire Cardie, Seth Rogers and Stefan
Schroedl, Constrained K-means Clustering with
Background Knowledge, ICML 2001.
58
References



Sugato Basu, Arindam Banerjee and Raymond Mooney,
Semi-supervised Clustering by Seeding, ICML 2002.
Dan Klein, Sepandar D. Kamvar and Christopher D.
Manning, From Instance-level Constraints to Spacelevel Constraints: Making the Most of Prior Knowledge in
Data Clustering, ICML 2002.
Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart
Russell, Distance Metric Learning, with Application to
Clustering with Side-information, Advances in Neural
Information Processing Systems 15, 2003.
59
References


Sugato Basu, Mikhail Bilenko and Raymond J. Mooney,
Comparing and Unifying Search-Based and SimilarityBased Approaches to Semi-Supervised Clustering,
ICML-2003 Workshop on the Continuum from Labeled
to Unlabeled Data in Machine Learning and Data
Mining.
Sugato Basu, Arindam Banerjee and Raymond J.
Mooney, Active Semi-Supervision for Pairwise
Constrained Clustering, not yet published, 2003.
60

The Semi-supervised Clustering Problem

Transcript The Semi-supervised Clustering Problem

Directory