Basic Clustering Concepts & Algorithms
Download
Report
Transcript Basic Clustering Concepts & Algorithms
Clustering
Basic Concepts and Algorithms
Bamshad Mobasher
DePaul University
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of data (or objects) in a
set of meaningful sub-classes, called clusters
Helps users understand the natural grouping or structure in a data set
Cluster:
a collection of data objects that are
“similar” to one another and thus
can be treated collectively as one
group
but as a collection, they are
sufficiently different from other
groups
Clustering
unsupervised classification
no predefined classes
2
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA,
classification, and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection: Outliers are often viewed as those “far
away” from any cluster
3
Basic Steps to Develop a Clustering Task
Feature selection / Preprocessing
Select info concerning the task of interest
Minimal information redundancy
May need to do normalization/standardization
Distance/Similarity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
Validation of the results
Interpretation of the results with applications
4
Distance or Similarity Measures
Common Distance Measures:
Manhattan distance:
Euclidean distance:
Cosine similarity:
dist ( X , Y ) 1 sim ( X , Y )
sim ( X , Y )
( xi y i )
i
i
xi y i
2
2
i
5
More Similarity Measures
In vector-space model many similarity measures can be used
in clustering
Simple Matching
Cosine Coefficient
Dice’s Coefficient
Jaccard’s Coefficient
6
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used
its implementation, and
Its ability to discover some or all of the hidden patterns
7
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSCAN, OPTICS, DenClue
Model-based:
A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
Typical methods: EM, SOM, COBWEB
8
Partitioning Approaches
The notion of comparing item similarities can be extended to
clusters themselves, by focusing on a representative vector
for each cluster
cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
this methodology reduces the number of similarity computations in
clustering
clusters are revised successively until a stopping condition is satisfied, or
until no more changes to clusters can be made
Reallocation-Based Partitioning Methods
Start with an initial assignment of items to clusters and then move items
from cluster to cluster to obtain an improved partitioning
Most common algorithm: k-means
9
The K-Means Clustering Method
Given the number of desired clusters k, the k-means
algorithm follows four steps:
1. Randomly assign objects to create k nonempty initial
partitions (clusters)
2. Compute the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point,
of the cluster)
3. Assign each object to the cluster with the nearest
centroid (reallocation step)
4. Go back to Step 2, stop when the assignment does not
change
10
K-Means Example: Document Clustering
T1
0
4
0
0
0
2
1
3
4/2
0/2
2/2
D1
D2
D3
D4
D5
D6
D7
D8
C1
C2
C3
Initial (arbitrary)
assignment:
C1 = {D1,D2},
C2 = {D3,D4},
C3 = {D5,D6}
Cluster Centroids
T2
3
1
4
3
1
2
0
1
4/2
7/2
3/2
T3
3
0
0
0
3
0
3
0
3/2
0/2
3/2
T4
0
1
0
3
0
0
2
0
1/2
3/2
0/2
T5
2
2
2
3
1
4
0
2
4/2
5/2
5/2
Now compute the similarity (or distance) of each item to each cluster, resulting a clusterdocument similarity matrix (here we use dot product as the similarity measure).
C1
C2
C3
D1
29/2
31/2
28/2
D2
29/2
20/2
21/2
D3
24/2
38/2
22/2
D4
27/2
45/2
24/2
D5
17/2
12/2
17/2
D6
32/2
34/2
30/2
D7
15/2
6/2
11/2
D8
24/2
17/2
19/2
11
Example (Continued)
C1
C2
C3
D1
29/2
31/2
28/2
D2
29/2
20/2
21/2
D3
24/2
38/2
22/2
D4
27/2
45/2
24/2
D5
17/2
12/2
17/2
D6
32/2
34/2
30/2
D7
15/2
6/2
11/2
D8
24/2
17/2
19/2
For each document, reallocate the document to the cluster to which it has the highest
similarity (shown in red in the above table). After the reallocation we have the following
new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that
D1 and D6 have been reallocated from their original assignment.
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
This is the end of first iteration (i.e., the first reallocation).
Next, we repeat the process for another reallocation…
12
Example (Continued)
Now compute new
cluster centroids using
the original documentterm matrix
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
D1
D2
D3
D4
D5
D6
D7
D8
C1
C2
C3
This will lead to a new
cluster-doc similarity matrix
similar to previous slide.
Again, the items are
reallocated to clusters with
highest similarity.
C1
C2
C3
D1
7.67
16.75
14.00
New assignment
T1
0
4
0
0
0
2
1
3
8/3
2/4
0/1
D2
15.01
11.25
3.00
D3
5.34
17.50
6.00
T2
3
1
4
3
1
2
0
1
2/3
12/4
1/1
D4
9.00
19.50
6.00
D5
5.00
8.00
11.00
T3
3
0
0
0
3
0
3
0
3/3
3/4
3/1
D6
12.00
6.68
9.34
T4
0
1
0
3
0
0
2
0
3/3
3/4
0/1
D7
7.67
4.25
9.00
T5
2
2
2
3
1
4
0
2
4/3
11/4
1/1
D8
11.34
10.00
3.00
C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}
Note: This process is now repeated with new clusters. However, the next iteration in this example
Will show no change to the clusters, thus terminating the algorithm.
13
K-Means Algorithm
Strength of the k-means:
Relatively efficient: O(tkn), where n is # of objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n
Often terminates at a local optimum
Weakness of the k-means:
Applicable only when mean is defined; what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Variations of K-Means usually differ in:
Selection of the initial k means
Distance or similarity measures used
Strategies to calculate cluster means
14
A Disk Version of k-means
k-means can be implemented with data on disk
In each iteration, it scans the database once
The centroids are computed incrementally
It can be used to cluster large datasets that do not fit in
main memory
We need to control the number of iterations
In practice, a limited is set (< 50)
There are better algorithms that scale up for large data
sets, e.g., BIRCH
15
BIRCH
Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance
Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes
16
Hierarchical Clustering Algorithms
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k
clusters)
• Traditional hierarchical algorithms use a similarity or
distance matrix
– Merge or split one cluster at a time
Hierarchical Clustering Algorithms
Use dist / sim matrix as clustering criteria
does not require the no. of clusters as input, but needs a termination condition
Step 0
Step 1
Step 2
Step 3
Step 4
Agglomerative
a
ab
b
abcde
c
cd
d
cde
e
Divisive
Step 4
Step 3
Step 2
Step 1
Step 0
18
Hierarchical Agglomerative Clustering
Basic procedure
1. Place each of N items into a cluster of its own.
2. Compute all pairwise item-item similarity coefficients
Total of N(N-1)/2 coefficients
3. Form a new cluster by combining the most similar pair of
current clusters i and j
(methods for determining which clusters to merge: single-link,
complete link, group average, etc.);
update similarity matrix by deleting the rows and columns
corresponding to i and j;
calculate the entries in the row corresponding to the new
cluster i+j.
4. Repeat step 3 if the number of clusters left is great than 1.
19
Hierarchical Agglomerative Clustering
:: Example
4
1
2
5
5
0.4
0.35
2
0.3
0.25
3
3
6
1
4
0.2
0.15
0.1
0.05
0
Nested Clusters
3
6
4
1
Dendrogram
2
5
Distance Between Two Clusters
The basic procedure varies based on the method used to
determine inter-cluster distances or similarities
Different methods results in different variants of the
algorithm
Single link
Complete link
Average link
Ward’s method
Etc.
21
Single Link Method
The distance between
two clusters is the
distance between two
closest data points in the
two clusters, one data
point from each cluster
It can find arbitrarily
shaped clusters, but
It may cause the undesirable
“chain effect” due to noisy
points
Two natural clusters are
split into two
22
Distance between two clusters
Single-link distance between clusters Ci and Cj is the minimum
distance between any object in Ci and any object in Cj
The distance is defined by the two most similar objects
D sl C i , C j min
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
x,y
I4
0.65
0.60
0.40
1.00
0.80
d ( x , y ) x C , y C
i
I5
0.20
0.50
0.30
0.80
1.00
1
j
2 3
4
5
Complete Link Method
The distance between two clusters is the distance of
two furthest data points in the two clusters
It is sensitive to outliers because they are far away
24
Distance between two clusters
Complete-link distance between clusters Ci and Cj is the
maximum distance between any object in Ci and any object in Cj
The distance is defined by the two least similar objects
D cl C i , C j max
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
x,y
I4
0.65
0.60
0.40
1.00
0.80
d ( x , y ) x C , y C
I5
0.20
0.50
0.30
0.80
1.00
i
1
j
2
3 4
5
Average link and centroid methods
Average link: A compromise between
the sensitivity of complete-link clustering to outliers and
the tendency of single-link clustering to form long chains that do not
correspond to the intuitive notion of clusters as compact, spherical
objects
In this method, the distance between two clusters is the average
distance of all pair-wise distances between the data points in two
clusters.
Centroid method: In this method, the distance between two
clusters is the distance between their centroids
26
Distance between two clusters
Group average distance between clusters Ci and Cj is the average
distance between objects in Ci and objects in Cj
The distance is defined by the average similarities
D avg C i , C j
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
1
Ci C j
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
d ( x, y )
x C i , y C
1
j
2
3
4
5
Clustering
Basic Concepts and Algorithms
Bamshad Mobasher
DePaul University