Cluster Analysis - Federal University of Rio de Janeiro
Download
Report
Transcript Cluster Analysis - Federal University of Rio de Janeiro
Clustering: Introduction
Adriano Joaquim de O Cruz ©2002
NCE/UFRJ
[email protected]
Introduction
What is cluster analysis?
The process of grouping a set of
physical or abstract objects into classes
of similar objects.
The class label of each class is
unknown.
Classification separates objects into
classes when the labels are known.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
What is cluster analysis? cont.
Clustering is a form of learning by
observations.
Neural Networks learn by examples.
Unsupervised learning.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Applications
In business helps to discover distinct
groups of customers.
In data mining used to gain insight into
the distribution of data, to observe the
characteristics of each cluster.
Pre-processing step for classification.
Pattern recognition.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Requirements
Scalability: work with large databases.
Ability to deal with different types of
attributes (not only interval based data).
Clusters of arbitrary shape, not only
spherical.
Minimal requirements about domain.
Ability do deal with noisy data.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Requirements cont.
Insensitivity to the order of input records.
Work with samples of high
dimensionality.
Constrained-based clustering
Interpretability and usability: results
should be easily interpretable.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Sensitivity to Input Order
Some algorithms are sensitive to the
order of input data
Leader algorithm is an example
Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Clustering Techniques
Heuristic Clustering Techniques
Incomplete or heuristic clustering: geometrical
methods or projection techniques.
Dimension reduction techniques (e.g. PCA) are
used obtain a graphical representation in two or
three dimensions.
Heuristic methods based on visualisation are
used to determine the clusters.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Deterministic Crisp Clustering
Each datum will be assigned to only one
cluster.
Each cluster partition defines a ordinary
partition of the data set.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Overlapping Crisp Clustering
Each datum will be assigned to at least
one cluster.
Elements may belong to more than one
cluster at various degrees.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Probabilistic Clustering
For each element, a probabilistic distribution
over the clusters is determined.
The distribution specifies the probability with
which a datum is assigned to a cluster.
If the probabilities are interpreted as degree
of membership then these are fuzzy
clustering techniques.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Possibilistic Clustering
Degrees of membership or possibility
indicate to what extent a datum belongs
to the clusters.
Possibilistic cluster analysis drops the
constraint that the sum of memberships
of each datum to all clusters is equal to
one.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Hierarchical Clustering
Descending techniques: they divide the
data into more fine-grained classes.
Ascending techniques: they combine
small classes into more coarse-grained
ones.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Objective Function Clustering
An objective function assigns to
each cluster partition values that
have to be optimised.
This is strictly an optimisation
problem.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data Types
Data Types
Interval-scaled variables are continuous
measurements of a linear scale. Ex.
height, weight, temperature.
Binary variables have only two states.
Ex. smoker, fever, client, owner.
Nominal variables are a generalisation
of a binary variable with m states. Ex.
Map colour, Marital state.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data Types
cont.
Ordinal variables are ordered nominal
variables. Ex. Olympic medals,
Professional ranks.
Ratio-scaled variables have a non-linear
scale. Ex. Growth of a bacteria
population
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Interval-scaled variables
Interval-scaled variables are continuous
measurements of a linear scale. Ex.
height, weight, temperature.
Interval-scaled variables are dependent
on the units used.
Measurement unit can affect analysis,
so standardisation should be used.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Problems
Person
Age (yr)
Height (cm)
A
35
190
B
40
190
C
35
160
D
40
160
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Standardisation
Converting original measurements to
unitless values.
Attempts to give all variables the equal
weight.
Useful when there is no prior knowledge
of the data.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Standardisation algorithm
Z-scores indicate how far and in what direction
an item deviates from its distribution's mean,
expressed in units of its distribution's standard
deviation.
The transformed scores will have a mean of zero
and standard deviation of one.
It is useful when comparing relative standings of
items from distributions with different means
and/or different standard deviation.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Standardisation algorithm
Consider n values of a variable x.
1
x
n
Calculate the mean value.
Calculate the standard deviation.
x
Calculate the z-score.
z xi
*@2001 Adriano Cruz
n
x
i 1
i
n
( x x)
i 1
2
i
n
xi x
x
*NCE e IM - UFRJ
Cluster ‹#›
Z-scores example
Sample Heights Ages z-heights z-ages
1
137,16
10
-0,45
-0,61
2
195,58
25
1,58
0,39
3
170,18
55
0,7
2,39
4
172,73
32
0,79
0,86
5
116,84
8
-1,16
-0,74
6
162,56
11
0,43
-0,54
7
157,48
9
0,26
-0,67
8
142,24
15
-0,28
-0,27
9
96,52
7
-1,87
-0,81
Means
150,14 19,11
Std Dev 28,67 15,01
*@2001 Adriano Cruz
0
1
*NCE e IM - UFRJ
0
1
Cluster ‹#›
Real heights and ages charts
Real Heights and Ages
200
Heights and Ages
175
150
125
Heights
Ages
100
75
50
25
0
Row
2
Row Row
3
4
Row Row
5
6
Row Row
7
8
Row Row
9
10
Samples
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Z-scores for heights and ages
Z-scores for heights and ages
2,5
Heights and ages
2
1,5
1
0,5
Z-heights
Z-ages
0
-0,5
-1
-1,5
-2
Row
2
Row
3
Row
4
Row
5
Row
6
Row
7
Row
8
Row
9
Row
10
Samples
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data chart
Real data
Ages
60
40
20
Ages
0
0,00
50,00 100,00 150,00 200,00 250,00
Heights
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Data chart
Z-score data
3,00
Ages
2,00
1,00
-3,00
0,00
-1,00
-1,00
Seqüência1
1,00
heights
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Similarities
Data Matrices
Data matrix: represents n objects with p
characteristics.
Ex. person = {age, sex, income, ...}
Dissimilarity matrix: represents a
collection of dissimilarities between all
pairs of objects.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Dissimilarities
Dissimilarity measures some form of
distance between objects.
Clustering algorithms use dissimilarities
to cluster data.
How can dissimilarities be measured?
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
How to calculate dissimilarities?
The most popular methods are based on the
distance between pairs of objects.
Minkowski distance:
p
d (xi , x k ) ( xij xkj )
q
1
q
j 1
p is the number of characteristics
q is the distance type
q=2 (Euclides distance), q=1 (Manhattan)
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Similarities
It is also possible to work with
similarities [s(xi,xj)]
0<=s(xi,xj)<=1
s(xi,xi)=1
s(xi,xj)=s(xj,xi)
It is possible to consider that d(xi,xj)=1s(xi,xj)
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Distances
Sample Heights Ages Z-heights Z-ages
1 137,16
10
-0,45 -0,61
2 195,58
25
1,58
0,39
3 170,18
55
0,70
2,39
4 172,73
32
0,79
0,86
5 116,84
8
-1,16 -0,74
6 162,56
11
0,43 -0,54
7 157,48
9
0,26 -0,67
8 142,24
15
-0,28 -0,27
9 96,52
7
-1,87 -0,81
Means
Std Dev
150,14 19,11
28,67 15,01
*@2001 Adriano Cruz
0,0000 0,0000
1,0000 1,0000
Euclides Manhatann
15,8613
22,0944
45,8167
51,3256
41,1033
55,9256
26,0054
35,4756
35,1080
44,4144
14,8312
20,5278
12,4924
17,4478
8,9086
12,0144
54,9740
65,7344
2
*NCE e IM - UFRJ
1
Euclides Manhatann
0,7574
1,0599
1,6325
1,9770
2,4915
3,0903
1,1654
1,6466
1,3774
1,9018
0,6926
0,9735
0,7207
0,9296
0,3886
0,5496
2,0368
2,6771
2
1
Cluster ‹#›
Dissimilarities
There are other ways to obtain
dissimilarities.
So we no longer speak of distances.
Basically dissimilarities are nonnegative
numbers (d(i,j)) that are small (close to
0) when i and j are similar.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Pearson
Pearson product-moment correlation
between variables f and g
Coefficients lie between –1 and +1
n
R( f , g )
(x
i 1
n
m f )( xig mg )
2
(
x
m
)
if f
i 1
*@2001 Adriano Cruz
if
n
2
(
x
m
)
ig f
i 1
*NCE e IM - UFRJ
Cluster ‹#›
Pearson - cont
A correlation of +1 means that there is a
perfect positive linear relationship
between variables.
A correlation of -1 means that there is a
perfect negative linear relationship
between variables.
A correlation of 0 means there is no
linear relationship between the two
variables.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Pearson - ex
ryz = 0.9861; ryw = -0.9551; ryr= 0.2770
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Correlation and dissimilarities 1
d(f,g)=(1-R(f,g))/2
(1)
Variables with a high positive correlation
(+1) receive a dissimilarity close to 0
Variables with strongly negative
correlation will be considered very
dissimilar
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Correlation and dissimilarities 2
d(f,g)=1-|R(f,g)|
(2)
Variables with a high positive correlation
(+1) and negative correlation will
receive a dissimilarity close to 0
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Numerical Example
Name
Ilan
Jack
Kim
Lieve
Leon
Peter
Talia
Tina
Weight
15
49
13
45
85
66
12
10
*@2001 Adriano Cruz
Height
95
156
95
160
178
176
90
78
*NCE e IM - UFRJ
Month
1
5
11
7
6
6
12
1
Year
82
55
81
56
48
56
83
84
Cluster ‹#›
Numerical Example
Name
Ilan
Jack
Kim
Lieve
Leon
Peter
Talia
Tina
Weight
15
49
13
45
85
66
12
10
*@2001 Adriano Cruz
Height
95
156
95
160
178
176
90
78
*NCE e IM - UFRJ
Month
1
5
11
7
6
6
12
1
Year
82
55
81
56
48
56
83
84
Cluster ‹#›
Numerical Example 1
Quanti
Corr
Weight
Height
Month
Weight
1
Height
0.957
1
Month
-0.036
0.021
1
Year
-0.953
-0.985
0.013
Diss
Weight
0
(1)
Height
0.021
0
Month
0.518
0.489
0
Year
0.977
0.992
0.493
Diss
Weight
0
(2)
Height
0.043
0
Month
0.964
0.979
0
Year
0.047
0.015
0.987
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Year
1
0
0
Cluster ‹#›
Binary Variables
Binary variables have only two states.
States can be symmetric or asymmetric.
Binary variables are symmetric if both
states are equally valuable. Ex. gender
When the states are not equally
important the variable is asymmetric.
Ex. disease tests (1-positive; 0-negative)
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Contingency tables
Consider objects described by p binary
variables
q variables are equal to one on i and j
r variables are 1 on i and 0 on object j
Object i
*@2001 Adriano Cruz
1
0
Sum
Object j
1
0
Sum
q
r
q+r
s
t
s+t
q+s
r+t
p
*NCE e IM - UFRJ
Cluster ‹#›
Symmetric Variables
Dissimilarity based on symmetric
variables is invariant.
The result should not change when
variables are interchanged.
Simple dissimilarity coefficient:
rs
d ( xi , x j )
qr st
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Symmetric Variables
Dissimilarity
rs
d ( xi , x j )
qr st
Similarity
qt
s( xi , x j )
qr st
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Asymmetric Variables
Similarity based on asymmetric
variables is not invariant.
Two ones are more important than two
zeros
rs
Jacard coefficient: d ( xi , x j )
qr s
q
s( xi , x j )
qrs
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
Name fever cough Test1 Test2 Test3 Test4
Jack
Y
N
P
N
N
N
Mary
Y
N
P
N
P
N
Jim
Y
Y
N
N
N
N
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing Dissimilarities
Fever
Jack Mary q 1,1 r 1,0 s 0,1 t 0,0
Y
Y
1
0
0
0
Cough
N
N
0
0
0
1
Test1
Test2
Test3
P
N
N
P
N
P
1
0
0
0
0
0
0
0
1
0
1
0
Test4
N
N
0
0
0
1
2
0
1
3
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
r s
0 1
d ( jack , mary )
0.33
q r s 2 0 1
11
d ( jack , jim )
0.67
111
12
d ( jim , mary )
0.75
112
•Jim and Mary have the highest dissimilarity
value, so they have low probability of
having the same disease.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Nominal Variables
A nominal variable is a generalisation of
the binary variable.
A nominal variable can take more than
two states
Ex. Marital status: married, single,
divorced
Each state can be represented by a
number or letter
There is no specific ordering
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
Consider two objects i and j, described
by nominal variables
Each object has p characteristics
pm
d (i, j )
p
m is the number of matches
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Binarising nominal variables
An nominal variable can encoded to create a
new binary variable for each state
Example:
Marital state = {married, single, divorced}
Married: 1=yes – 0=no
Single: 1=yes – 0=no
Divorced: 1=yes – 0=no
Ex. Marital state = {married}
married = 1, single = 0, divorced = 0
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Ordinal variables
A discrete ordinal variable is similar to a
nominal variable, except that the states
are ordered in a meaningful sequence
Ex. Bronze, silver and gold medals
Ex. Assistant, associate, full member
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Computing dissimilarities
Consider n objects defined by a set of
ordinal variables
f is one of these ordinal variables and
have Mf states.
These states define the ranking
rf {1,…, Mf}.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Steps to calculate dissimilarities
Assume that the value of f for the ith object is xif.
Replace each xif by its corresponding rank rif g
{1,…,Mf}.
Since the number of states of each variable
differs, it is often necessary map the range onto
[0.0,1.0] using the equation
zif
rif 1
M f 1
Dissimilarity can be computed using distance
measures of interval-scaled variables
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Ratio-scaled variables
Variables on a non-linear scale, such as
exponential
To compute dissimilarities there are
three methods
•
•
•
Treat as interval-scaled. Not always good.
Apply a transformation like y=log(x) and
treat as interval-scaled
Treat as ordinal data and assume ranks as
interval-scaled
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Variables of mixed types
One technique is to bring all variables onto a
common scale of the interval [0.0.1.0]
Suppose that the data set contains p
variables of mixed type. Dissimilarity is
between i and j is
p
d (i, j )
(f) (f)
ij dij
f 1
p
(f)
ij
f 1
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Variables of mixed types
Dissimilarity is between i and j is
p
d (i, j )
(f) (f)
ij dij
f 1
p
(f)
ij
f 1
where
ij( f )
if xif or x jf does not exist
0
0 if xif x jf 0 and f is assymmetri c
1
otherwise
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Variables of mixed types cont
The contribution of each variable is dependent on its
type
f is binary or nominal:
dij( f ) 0 if xif x jf ; otherwise 1
f is interval-based:
d ij( f )
xif x jf
max( x f ) min( x f )
f is ordinal of ratio-scaled: compute ranks and treat
as interval-based
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Clustering Methods
Classification types
Clustering is an unsupervised method
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Clustering Methods
Partitioning
Hierarchical
Density-based
Grid-based
Model-based
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Partitioning Methods
Given n objects k partitions are created.
Each partition must contain at least one
element.
It uses an iterative relocation technique
to improve partitioning.
Distance is the usual criterion.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Partitioning Methods cont.
They work well for finding spherical-shaped
clusters.
They are not efficient on very large
databases.
K-means where each cluster is represented
by the mean value of the objects in the
cluster.
K-medoids where each cluster is represented
by an object near the centre of the cluster.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Hierarchical Methods
Creates a hierarchical decomposition of the set
Agglomerative approaches start with each object
forming a separate group
Merges objects or groups until all objects belong
to one group or a termination condition occurs
Divisive approaches starts with all objects in the
same cluster
Each successive iteration splits a cluster until all
objects are on separate clusters or a termination
condition occurs
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Hierarchical Clustering cont
Definition of cluster proximity.
Min: most similar (sensitive to noise)
Max: most dissimilar (break large
clusters
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Density-based methods
Method creates clusters until the density
in the neighbourhood exceeds some
threshold
Able to find clusters of arbitrary shapes
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Grid-based methods
Grid methods divide the object space into finite
number of cells forming a grid-like structure.
Cells that contain more than a certain number
of elements are treated as dense.
Dense cells are connected to form clusters.
Fast processing time, independent of the
number of objects.
STING and CLIQUE are examples.
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Model-based methods
Model-based methods hypothesise a
model for each cluster and find the best
fit of the data to the given model.
Statistical models
SOM networks
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›
Partition methods
Given a database of n objects a partition
method organises them into k clusters
(k<= n)
The methods try to minimise an objective
function such as distance
Similar objects are “close” to each other
*@2001 Adriano Cruz
*NCE e IM - UFRJ
Cluster ‹#›