Steven F. Ashby Center for Applied Scientific Computing

Transcript Steven F. Ashby Center for Applied Scientific Computing

What is Cluster Analysis?

Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Applications of Cluster Analysis

Discovered Clusters
Understanding
– Group related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations

Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
1
2
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
3
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and
partitional sets of clusters

Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Partitional Clustering
Original Points
© Tan,Steinbach, Kumar
A Partitional Clustering
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
© Tan,Steinbach, Kumar
p3 p4
Non-traditional Dendrogram
Introduction to Data Mining
4/18/2004
‹#›
Clustering Algorithms

K-means and its variants

Density-based clustering

Hierarchical clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
K-means Clustering

Partitional clustering approach

Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest
centroid

Number of clusters, K, must be specified

The basic algorithm is very simple
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
K-means: Example
Iteration 6
1
2
3
4
5
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
K-means: Example
Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
3
y
3
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
1
1.5
2
-2
Iteration 4
Iteration 5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
0
0.5
x
© Tan,Steinbach, Kumar
1
1.5
2
0
0.5
1
1.5
2
1
1.5
2
y
2.5
y
2.5
y
3
-1
-0.5
Iteration 6
3
-1.5
-1
x
3
-2
-1.5
x
-2
-1.5
-1
-0.5
0
0.5
1
x
Introduction to Data Mining
1.5
2
-2
-1.5
-1
-0.5
0
0.5
x
4/18/2004
‹#›
Importance of Choosing Initial Centroids …
3
2.5
Original Points
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Optimal Clustering
© Tan,Steinbach, Kumar
-2
Introduction to Data Mining
Sub-optimal Clustering
4/18/2004
‹#›
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Importance of Choosing Initial Centroids …
Iteration 1
Iteration 2
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
Iteration 3
2.5
2
2
2
1.5
1.5
1.5
y
2.5
y
2.5
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-1
-0.5
0
0.5
x
© Tan,Steinbach, Kumar
2
Iteration 5
3
-1.5
1.5
Iteration 4
3
-2
1
x
1
1.5
2
-2
-1.5
-1
-0.5
0
0.5
1
x
Introduction to Data Mining
1.5
2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
4/18/2004
‹#›
Problems with Selecting Initial Points

If there are K ‘real’ clusters then the chance of selecting
one centroid from each cluster is small.
–
Chance is relatively small when K is large
–
If clusters are the same size, n, then
–
For example, if K = 10, then probability = 10!/1010 = 0.00036
–
Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t
–
Consider an example of five pairs of clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Solutions to Initial Centroids Problem

Multiple runs
– Helps, but probability is not on your side
Sample and use hierarchical clustering to
determine initial centroids
 Select more than k initial centroids and then
select among these initial centroids

– Select most widely separated

Postprocessing
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means

K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes

K-means has problems when the data contains
outliers.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means: Differing Sizes
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means: Differing Density
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of K-means: Non-globular Shapes
Original Points
© Tan,Steinbach, Kumar
K-means (2 Clusters)
Introduction to Data Mining
4/18/2004
‹#›
Overcoming K-means Limitations
Original Points
K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Overcoming K-means Limitations
Original Points
© Tan,Steinbach, Kumar
K-means Clusters
Introduction to Data Mining
4/18/2004
‹#›
Overcoming K-means Limitations
Original Points
© Tan,Steinbach, Kumar
K-means Clusters
Introduction to Data Mining
4/18/2004
‹#›
Clustering Algorithms

K-means and its variants

Density-based clustering

Hierarchical clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Density based clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Density based clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Density-based clustering: DBSCAN

DBSCAN is a density-based algorithm.
–
Density = number of points within a specified radius (Eps)
–
A point is a core point if it has more than a specified number
of points (MinPts) within Eps

These are points that are at the interior of a cluster
–
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
–
A noise point is any point that is not a core point or a border
point.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
DBSCAN: Core, Border, and Noise Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
DBSCAN Algorithm
Eliminate noise points
 Perform clustering on the remaining points

– Connect all core points with an edge that are less
than Eps from each other.
– Make each group of connected core points into a
separate cluster.
– Assign each border point to one of its associated
clusters.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Example: DBSCAN
First point is selected
All points that are
density-reachable from
this point form a new
cluster
Second point is selected
Second cluster is formed
© Tan,Steinbach, Kumar
Introduction to Data Mining
Third point selected and
cluster is formed
4/18/2004
‹#›
DBSCAN: Core, Border and Noise Points
Original Points
Point types: core,
border and noise
Eps = 10, MinPts = 4
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
When DBSCAN Works Well
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92)
Original Points
• Varying densities
• High-dimensional data
© Tan,Steinbach, Kumar
(MinPts=4, Eps=9.75).
Introduction to Data Mining
4/18/2004
‹#›
DBSCAN: Determining EPS and MinPts



Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
Noise points have the kth nearest neighbor at farther
distance
So, plot sorted distance of every point to its kth
nearest neighbor
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Clustering Algorithms

K-means and its variants

Density-based clustering

Hierarchical clustering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
 Can be visualized as a dendrogram

– A tree like diagram that records the sequences of
merges or splits
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
© Tan,Steinbach, Kumar
3
2
5
4
1
6
Introduction to Data Mining
4/18/2004
‹#›
Strengths of Hierarchical Clustering

Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level

They may correspond to meaningful taxonomies
– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering

Two main types of hierarchical clustering
– Agglomerative:

Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

– Divisive:

Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)


Traditional hierarchical algorithms use a similarity or
distance matrix
– Merge or split one cluster at a time
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward
1.
Compute the proximity matrix
2.
Let each data point be a cluster
3.
Repeat
4.
Merge the two closest clusters
5.
Update the proximity matrix
6.

Until only a single cluster remains
Key operation is the computation of the proximity of
two clusters
–
Different approaches to defining the distance between
clusters distinguish the different algorithms
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Starting Situation

Start with clusters of individual points and a
proximity matrix
p1 p2
p3
p4 p5
...
p1
p2
p3
p4
p5
.
.
Proximity Matrix
.
...
p1
© Tan,Steinbach, Kumar
Introduction to Data Mining
p2
p3
p4
p9
p10
4/18/2004
p11
p12
‹#›
Intermediate Situation

After some merging steps, we have some clusters
C1
C2
C3
C4
C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
...
p1
© Tan,Steinbach, Kumar
Introduction to Data Mining
p2
p3
p4
p9
p10
4/18/2004
p11
p12
‹#›
Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C1 C2
C3
C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
...
p1
© Tan,Steinbach, Kumar
Introduction to Data Mining
p2
p3
p4
p9
p10
4/18/2004
p11
p12
‹#›
After Merging

The question is “How do we update the proximity matrix?”
C1
C1
C4
C3
C4
?
?
?
?
C2 U C5
C3
C2
U
C5
?
C3
?
C4
?
Proximity Matrix
C1
C2 U C5
...
p1
© Tan,Steinbach, Kumar
Introduction to Data Mining
p2
p3
p4
p9
p10
4/18/2004
p11
p12
‹#›
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1
p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
...
p1


p2
p3
p4





p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two clusters,
it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or
more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex
shapes
– Breaking large clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Cluster Validity

For supervised classification we have a variety of
measures to evaluate how good our model is
– Accuracy, precision, recall

For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?
–
–
–
–
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Using Similarity Matrix for Cluster Validation

Order the similarity matrix with respect to cluster
labels and inspect visually.
1
1
0.9
0.8
0.7
Points
y
0.6
0.5
0.4
0.3
0.2
0.1
0
10
0.9
20
0.8
30
0.7
40
0.6
50
0.5
60
0.4
70
0.3
80
0.2
90
0.1
100
0
0.2
0.4
0.6
0.8
1
x
© Tan,Steinbach, Kumar
Introduction to Data Mining
20
40
60
80
0
100 Similarity
Points
4/18/2004
‹#›
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
Points
y
Points
1
0
0
0.2
0.4
0.6
0.8
1
x
DBSCAN
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
y
Points
1
0
0
0.2
0.4
0.6
0.8
1
x
Points
K-means
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Using Similarity Matrix for Cluster Validation
1
0.9
500
1
2
0.8
6
0.7
1000
3
0.6
4
1500
0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000
500
1000
1500
2000
2500
3000
0
DBSCAN
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Final Comment on Cluster Validity
“The validation of clustering structures is the most
difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”
Algorithms for Clustering Data, Jain and Dubes
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›

Steven F. Ashby Center for Applied Scientific Computing

Transcript Steven F. Ashby Center for Applied Scientific Computing

Directory