Transcript Lecture 14

Clustering methods

Partitional clustering in which clusters are represented by their
centroids (proc FASTCLUS)

Agglomerative hierarchical clustering in which the closest
clusters are repeatedly merged (proc CLUSTER)

Density-based clustering in which core points and associated
border points are clustered (proc MODECLUS)
Data mining and statistical learning lecture 14
Proc FASTCLUS

Select k initial centroids

Repeat the following until the clusters remain unchanged:

Form k clusters by assigning each point to its nearest
centroid

Update the centroid of each cluster
Data mining and statistical learning lecture 14
Identification of water samples with incorrect
total nitrogen levels
Total nitrogen (Kjeldahl) m g/l
25000
20000
15000
10000
5000
0
0
5000
10000 15000 20000
25000 30000
Total nitrogen (persulfate) mg/l
Data mining and statistical learning lecture 14
Identification of water samples with incorrect total nitrogen levels
- 2-means clustering
Cluster 1
Cluster 2
Total nitrogen (Kjeldahl)
25000
20000
15000
Initialization
problems?
10000
5000
0
0
5000
10000
15000
20000
25000
Total nitrogen (persulfate digestion)
Data mining and statistical learning lecture 14
30000
Limitations of K-means clustering
1. Difficult to detect clusters with non-spherical
shapes
2. Difficult to detect clusters of widely different sizes
3. Difficult to detect clusters of different densities
Data mining and statistical learning lecture 14
Proc MODECLUS

Use a smoother to estimate the (local) density of the given
dataset

A cluster is loosely defined as a region surrounding a local
maximum of the probability density function
Data mining and statistical learning lecture 14
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, R = 1000
Smoothing parameter R = 1000
Tot_N (ps) mg/l
25000
20000
Cluster 1
Cluster 2
15000
Cluster 3
Cluster 4
10000
Cluster 5
5000
Other clusters
0
0
10000
20000
30000
Tot_N (Kj) mg/l
Data mining and statistical learning lecture 14
What will
happen if R is
increased?
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, R = 4000
Smoothing parameter R = 4000
Tot_N (ps) mg/l
25000
20000
15000
Cluster 1
Cluster 2
10000
5000
0
0
10000
20000
Tot_N (Kj) mg/l
Data mining and statistical learning lecture 14
30000
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, method 6
Total nitrogen (Kjeldahl)
25000
20000
Cluster 1
Cluster 2
15000
Cluster 3
Cluster 4
Cluster 5
10000
Clusters 6 - 18
No cluster assigned
5000
0
0
5000
10000
15000
20000
25000
30000
Total nítrogen (persulfate digestion)
Data mining and statistical learning lecture 14
Why did the
clustering
fail?
Limitations of density-based clustering
1. Difficult to control (requires repeated runs)
2. Collapses in high dimensions
Data mining and statistical learning lecture 14
Strength of density-based clustering
Given a sufficiently large sample, nonparametric
density-based clustering methods are capable of
detecting clusters of unequal size and dispersion
and with highly irregular shapes
Data mining and statistical learning lecture 14
Identification of water samples with incorrect
total nitrogen levels
- transformed data
Total N (ps) -Total N (Kj)
15000
10000
5000
0
-5000
-10000
0
5000
10000
15000
20000
Total N (Kj)
Data mining and statistical learning lecture 14
25000
Identification of water samples with incorrect
total nitrogen levels
- proc MODECLUS, R = 2000, transformed data
Total N (ps) -Total N (Kj)
15000
10000
Cluster 1
5000
Cluster 2
Cluster 3-6
0
-5000
-10000
0
5000
10000
15000
20000
Total N (Kj)
Data mining and statistical learning lecture 14
25000
Preprocessing
1. Standardization
2. Linear transformation
3. Dimension reduction
Data mining and statistical learning lecture 14
Postprocessing
1. Split a cluster
•
Usually, the cluster with the largest SSE is split
2. Introduce a new cluster centroid
•
Often the point that is farthest from any cluster center is
chosen
3. Disperse a cluster
•
Remove one centroid and reassign the points to other
clusters
4. Merge two clusters
•
Typically, the clusters with the closest centroids are chosen
Data mining and statistical learning lecture 14
Profiling website visitors
1.
A total of 296 pages at a Microsoft website are grouped
into 13 homogenous categories
•
•
•
•
•
•
•
•
Initial
Support
Entertainment
Office
Windows
Othersoft
Download
…..
2.
For each of 32711 visitors we have recorded how many
times they have visited the different categories of pages
3.
We would like to make a behavioural segmentation of the
users ( a cluster analysis) that can be used in future
marketing decisions
Data mining and statistical learning lecture 14
Profiling website visitors
- the dataset
client_codeinitial
10001
10002
10003
10004
10005
10006
10007
10008
10009
10010
10011
10012
10013
10014
10015
10016
10017
10018
10019
10020
10021
help
1
1
2
0
0
2
0
1
0
1
2
0
0
0
0
0
1
1
4
0
3
1
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
1
entertainment
office
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
1
windows
0
0
0
0
0
0
1
0
0
1
3
0
0
0
0
0
0
0
1
1
1
othersft
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
1
download otherint
development
hardware business information area
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
2
0
0
1
1
0
1
0
0
0
0
0
1
3
2
0
1
1
Why is it necessary to group
the pages into categories?
Data mining and statistical learning lecture 14
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
Profiling website visitors
- 10-means clustering
Data mining and statistical learning lecture 14
Profiling website visitors
- cluster proximities
Data mining and statistical learning lecture 14
Profiling website visitors
- profiles
Data mining and statistical learning lecture 14
Profiling website visitors
- Kohonen Map of cluster frequencies
Data mining and statistical learning lecture 14
Profiling website visitors
- Kohonen Maps of means by variable and grid cell
Data mining and statistical learning lecture 14
Characteristics of Kohonen maps

The centroids vary smoothly over the map
•
The set of clusters having unusually large (or small) values
of a given variable tend to form connected spatial patterns

Clusters with similar centroids need not be close to
each other in a Kohonen map

The sizes of the clusters in Kohonen maps tend to
be less variable than those obtained by K-means
clustering
Data mining and statistical learning lecture 14