Transcript cluster
Tutorial On Fuzzy Clustering
Jan Jantzen
Technical University of Denmark
[email protected]
Abstract
Problem:
To extract rules from data
Method: Fuzzy c-means
Results: e.g., finding cancer cells
Cluster (www.m-w.com)
A number
of similar individuals that
occur together as a: two or more
consecutive consonants or vowels in a
segment of speech b: a group of
houses (...) c: an aggregation of stars or
galaxies that appear close together in
the sky and are gravitationally
associated.
Cluster analysis (www.m-w.com)
A statistical
classification technique for
discovering whether the individuals of a
population fall into different groups by
making quantitative comparisons of
multiple characteristics.
Vehicle Example
Vehicle
V1
V2
V3
V4
V5
V6
V7
V8
V9
Top speed
km/h
220
230
260
140
155
130
100
105
110
Colour
red
black
red
gray
blue
white
black
red
gray
Air
resistance
0.30
0.32
0.29
0.35
0.33
0.40
0.50
0.60
0.55
Weight
Kg
1300
1400
1500
800
950
600
3000
2500
3500
Vehicle Clusters
3500
3000
Lorries
Weight [kg]
2500
Sports cars
2000
1500
Medium market cars
1000
500
100
150
200
Top speed [km/h]
250
300
Terminology
Object or data point
feature space
3500
label
3000
Lorries
2500
Weight [kg]
cluster
Sports cars
2000
1500
Medium market cars
feature
1000
500
100
150
200
Top speed [km/h]
feature
250
300
Example: Classify cracked tiles
475Hz 557Hz Ok?
-----+-----+--0.958 0.003 Yes
1.043 0.001 Yes
1.907 0.003 Yes
0.780 0.002 Yes
0.579 0.001 Yes
0.003 0.105 No
0.001 1.748 No
0.014 1.839 No
0.007 1.021 No
0.004 0.214 No
Table 1: frequency
intensities for ten
tiles.
Tiles are made from clay moulded into the right shape, brushed, glazed, and
baked. Unfortunately, the baking may produce invisible cracks. Operators can
detect the cracks by hitting the tiles with a hammer, and in an automated system
the response is recorded with a microphone, filtered, Fourier transformed, and
normalised. A small set of data is given in TABLE 1 (adapted from MIT, 1997).
Algorithm: hard c-means (HCM)
(also known as k means)
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
0
2
log(intensity) 475 Hz
Plot of tiles by frequencies (logarithms). The whole tiles (o) seem well
separated from the cracked tiles (*). The objective is to find the two
clusters.
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
0
2
log(intensity) 475 Hz
1.
2.
Place two cluster centres (x) at random.
Assign each data point (* and o) to the nearest cluster centre (x)
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
1.
2.
Compute the new centre of each class
Move the crosses (x)
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
Iteration 2
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
Iteration 3
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
0
2
log(intensity) 475 Hz
Iteration 4 (then stop, because no visible change)
Each data point belongs to the cluster defined by the nearest centre
M =
0.0000
1.0000
0.0000
1.0000
0.0000
1.0000
0.0000
1.0000
0.0000
1.0000
1.0000
0.0000
1.0000
0.0000
1.0000
0.0000
1.0000
0.0000
1.0000
0.0000
The membership matrix M:
1. The last five data points (rows) belong to the first cluster (column)
2. The first five data points (rows) belong to the second cluster (column)
Membership matrix M
data point k
cluster centre i
2
1 if uk ci uk c j
mik
0 otherwise
distance
cluster centre j
2
c-partition
All clusters C
together fills the
whole universe U
Clusters do not
overlap
c
C
i
U
i 1
A cluster C is never
empty and it is
smaller than the
whole universe U
Ci C j Ø
for all i j
Ø Ci U
for all i
2cK
There must be at least 2
clusters in a c-partition and
at most as many as the
number of data points K
Objective function
Minimise the total sum of
all distances
J J i u k c i
i 1
i 1 k ,u k Ci
c
c
2
Algorithm: fuzzy c-means (FCM)
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
0
log(intensity) 475 Hz
Each data point belongs to two clusters to different degrees
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
0
2
log(intensity) 475 Hz
1.
Place two cluster centres
2.
Assign a fuzzy membership to each data point depending on
distance
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
1.
2.
Compute the new centre of each class
Move the crosses (x)
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
Iteration 2
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
Iteration 5
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
log(intensity) 475 Hz
Iteration 10
0
2
Tiles data: o = whole tiles, * = cracked tiles, x = centres
2
1
log(intensity) 557 Hz
0
-1
-2
-3
-4
-5
-6
-7
-8
-8
-6
-4
-2
0
2
log(intensity) 475 Hz
Iteration 13 (then stop, because no visible change)
Each data point belongs to the two clusters to a degree
M =
0.0025
0.9975
0.0091
0.9909
0.0129
0.9871
0.0001
0.9999
0.0107
0.9893
0.9393
0.0607
0.9638
0.0362
0.9574
0.0426
0.9906
0.0094
0.9807
0.0193
The membership matrix M:
1. The last five data points (rows) belong mostly to the first cluster (column)
2. The first five data points (rows) belong mostly to the second cluster (column)
Fuzzy membership matrix M
Point k’s membership
of cluster i
Fuzziness
exponent
mik
1
d ik
j 1 d jk
c
dik uk ci
2 / q 1
Distance from point k to
current cluster centre i
Distance from point k to
other cluster centres j
Fuzzy membership matrix M
mik
1
d ik
j 1 d jk
c
d ik
d1k
1
d1k
2 / q 1
2 / q 1
2 / q 1
1
2 / q 1
d
d
ik
ik
d 2k
d ck
1
2 / q 1
d ik
1
1
2 / q 1
2 / q 1
d 2k
d ck
2 / q 1
Gravitation to
cluster i relative
to total gravitation
Electrical Analogy
U RI
R
I
U
R1
i1
R2
i2
R
R
1
1
1
1
R1 R2
Rc
1
Ri
1
1
1
1
Ri
R1 R2
Rc
1 U 1 ii
U
Ri I
I
ii
Same form as
mik
Fuzzy Membership
o is with q = 1.1, * is with q = 2
Membership of test point
1
0.5
01
Data point
Cluster centres
2
3
4
5
Fuzzy c-partition
All clusters C together fill the
whole universe U.
Remark: The sum of
memberships for a data point
is 1, and the total for all
points is K
Not valid: Clusters
do overlap
c
C
i
U
i 1
A cluster C is never
empty and it is
smaller than the
whole universe U
Ci C j Ø
for all i j
Ø Ci U
for all i
2cK
There must be at least 2
clusters in a c-partition and
at most as many as the
number of data points K
Example: Classify cancer cells
Normal smear
Using a small brush, cotton stick, or wooden
stick, a specimen is taken from the uterin cervix
and smeared onto a thin, rectangular glass plate,
a slide. The purpose of the smear screening is to
diagnose pre-malignant cell changes before they
progress to cancer. The smear is stained using
the Papanicolau method, hence the name Pap
smear. Different characteristics have different
colours, easy to distinguish in a microscope. A
cyto-technician performs the screening in a
microscope. It is time consuming and prone to
error, as each slide may contain up to 300.000
cells.
Severely dysplastic smear
Dysplastic cells have undergone precancerous changes.
They generally have longer and darker nuclei, and they
have a tendency to cling together in large clusters. Mildly
dysplastic cels have enlarged and bright nuclei.
Moderately dysplastic cells have larger and darker
nuclei. Severely dysplastic cells have large, dark, and
often oddly shaped nuclei. The cytoplasm is dark, and it
is relatively small.
Possible Features
Nucleus
and cytoplasm area
Nucleus and cyto brightness
Nucleus shortest and longest diameter
Cyto shortest and longest diameter
Nucleus and cyto perimeter
Nucleus and cyto no of maxima
(...)
Classes are nonseparable
Hard Classifier (HCM)
moderate
A cell is either one
or the other class
defined by a colour.
Ok
light
Ok
severe
Fuzzy Classifier (FCM)
moderate
Ok
light
Ok
severe
A cell can belong to
several classes to a
Degree, i.e., one column
may have several colours.
Function approximation
1.5
1
Output1
0.5
0
-0.5
-1
-1.5
0
0.1
0.2 0.3 0.4
0.5 0.6 0.7 0.8
Input
0.9
1
Curve fitting in a multi-dimensional space is also called function
approximation. Learning is equivalent to finding a function that best
fits the training data.
Approximation by fuzzy sets
2
1
0
-1
-2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
0.8
0.6
0.4
0.2
0
0
Procedure to find a model
1.
Acquire data
2.
Select structure
3.
Find clusters, generate model
4.
Validate model
Conclusions
Compared
to neural networks, fuzzy
models can be interpreted by human
beings
Applications: system identification,
adaptive systems
Links
J. Jantzen: Neurofuzzy Modelling. Technical University of Denmark:
Oersted-DTU, Tech report no 98-H-874 (nfmod), 1998. URL
http://fuzzy.iau.dtu.dk/download/nfmod.pdf
PapSmear tutorial. URL http://fuzzy.iau.dtu.dk/smear/
U. Kaymak: Data Driven Fuzzy Modelling. PowerPoint, URL
http://fuzzy.iau.dtu.dk/tutor/ddfm.htm
Exercise: fuzzy clustering (Matlab)
Download and follow the instructions in this text file:
http://fuzzy.iau.dtu.dk/tutor/fcm/exerF5.txt
The exercise requires Matlab (no special toolboxes
are required)