Transcript Slide 1

Clustering I
The Task
• Input: Collection of instances
– No special class label attribute!
• Output: Clusters (Groups) of instances where
members of a cluster are more ‘similar’ to
each other than they are to members of
another cluster
– Similarity between instances need to be defined
• Because there are no predefined classes to
predict, the task is to find useful groups of
instances
2
Example – Iris Data
3
Example Clusters?
Petal
Width
Petal Length
4
Different Types of Clusters
• Clustering is not an end in itself
– Only a means to an end defined by the goals of the parent
application
– A garment manufacturer might be looking for grouping people
into different sizes and shapes (segmentation)
– A biologist might be looking for gene groups with similar
functions (natural groupings)
• Application dependent notions of clusters possible
• This means, you need different clustering techniques
to find different types of clusters
• Also you need to learn to find a clustering technique
to match your objective
– Sometimes an ideal match is not possible; therefore you
adapt the existing
– Sometimes, a wrongly selected technique might show
unexpected clusters – which is what data mining is all about
5
Cluster Representations
• Several representations are possible
e
d
c
j
a
k h
g
d
b
e
h
j
a
k
f
i
c
i f
g
Partitioning 2D space showing
instances
1
2
3
a
0.4
0.1
0.5
b
0.1
0.8
0.1
c
0.3
0.3
0.4
d
0.1
0.1
0.8
Table of Cluster Probabilities
b
Venn Diagram
a
b
c
d
e
f
g
Dendrogram
6
Several Techniques
• Partitioning Methods
– K-means and k-medoids
• Hierarchical Methods
– Agglomerative
• Density-based Methods
– DBSCAN
7
K-means Clustering
• Input: the number of clusters K and the collection of
n instances
• Output: a set of k clusters that minimizes the
squared error criterion
• Method:
– Arbitrarily choose k instances as the initial cluster centers
– Repeat
• (Re)assign each instance to the cluster to which the instance is
the most similar, based on the mean value of the instances in
the cluster
• Update cluster means (compute mean value of the instances for
each cluster)
– Until no change in the assignment
• Squared Error Criterion
– E = ∑i=1 k ∑ pЄCi |p-mi|2
– where mi are the cluster means and p are points in clusters
8
Interactive Demo
• http://home.dei.polimi.it/matteucc/Clus
tering/tutorial_html/AppletKM.html
9
Choosing the number of Clusters
• K-means method expects number of clusters k
to be specified as part of the input
• How to choose appropriate k?
• Options
– Start with k = 1 and work up to small number
• Select the result with lowest squared error
• Warning: Result with each instance in a different cluster
on its own would be the best
– Create two initial clusters and split each cluster
independently
• When splitting a cluster create two new ‘seeds’
• One seed at a point one standard deviation away from the
cluster center and the second one standard deviation in
the opposite direction
• Apply k-means to the points in the cluster with the two
new seeds
10
Choosing Initial Centroids
• Randomly chosen centroids may produce poor
quality clusters
• How to choose initial centroids?
• Options
– Take a sample of data, apply hierarchical clustering
and extract k cluster centroids
• Works well – small sample sizes are desired
– Take well separated k points
• Take a sample of data and select the first centroid as the
centroid of the sample
• Each of the subsequent centroid is chosen to be a point
farthest from any of the already selected centroids
11
Additional Issues
• K-means might produce empty clusters
• Need to choose an alternative centroid
• Options
– Choose the point farthest from the existing centroids
– Choose the point from the cluster with the highest squared error
• Squared error may need to be reduced further
– Because K-means may converge to local minimum
– Possible to reduce squared error without increasing K
• Solution
– Apply alternate phases of cluster splitting and merging
– Splitting – split the cluster with the largest squared error
– Merging – redistribute the elements of a cluster to its neighbours
• Outliers in the data cause problems to k-means
• Options
– Remove outliers
– Use K-Medoids instead
12
K-medoid
• Input: the number of clusters K and the collection of
n instances
• Output: A set of k clusters that minimizes the sum of
the dissimilarities of all the instances to their
nearest medoids
• Method:
– Arbitrarily choose k instances as the initial medoids
– Repeat
• (Re)assign each remaining instance to the cluster with the
nearest medoid
• Randomly select a non-medoid instance, or
• Compute the total cost, S, of swapping Oj with Or
• If S<0 then swap Oj with Or to form the new set of k medoids
– Until no change
13
K-means Vs K-medoids
•
•
•
•
Both require K to be specified in the input
K-medoids is less influenced by outliers in the data
K-medoids is computationally more expensive
Both methods assign each instance exactly to one
cluster
– Fuzzy partitioning techniques relax this
• Partitioning methods generally are good at finding
spherical shaped clusters
• They are suitable for small and medium sized data
sets
– Extensions are required for working on large data sets
14