Clustering Algorithms by Michael Smaili

Download Report

Transcript Clustering Algorithms by Michael Smaili

1
Clustering Algorithms
Presented by
Michael Smaili
CS 157B
Spring 2009
3
Terminology





Cluster: a small group or bunch of something
Clustering: the unsupervised classification of
patterns (observations, data items, or feature
vectors) into clusters. Also referred to as data
clustering, cluster analysis, typological analysis
Centroid: center of a cluster
Distance Measure: determines how the similarity
between two elements is calculated.
Dendrogram: a tree diagram frequently used to
illustrate the arrangement of the clusters
4
Applications

Data Mining

Pattern Recognition

Machine Learning

Image Analysis

Bioinformatics

and many more…
5
Simple Example
6
Idea Behind Unsupervised Learning


You walk into a bar.
A stranger approaches and tells you:


So far, looks straightforward.


“I’ve got data from k classes. Each class produces
observations with a normal distribution and variance σ2I.
Standard simple multivariate Gaussian assumptions. I can
tell you all the P(wi)’s .”
“I need a maximum likelihood estimate of the μi’s .“
No problem:

“There’s just one thing. None of the data are labeled. I
have datapoints, but I don’t know what class they’re from
(any of them!)
7
Classifications

Exclusive Clustering

Overlapping Clustering

Hierarchical Clustering

Probabilistic Clustering
8
Exclusive Clustering

Data that is grouped in an exclusive way, so that
if a particular data belongs to a definite cluster
then it could not be included in another cluster.
Separation of points are achieved by a straight
line on a bi-dimensional plane.
9
Overlapping Clustering

Uses fuzzy sets to cluster data, so that each point
may belong to two or more clusters with different
degrees of membership.
Various data belongs to multiple clusters.
10
Hierarchical Clustering

Builds (agglomerative), or breaks up (divisive), a
hierarchy of clusters. Agglomerative algorithms
begin at the leaves of the tree, whereas divisive
algorithms begin at the root.
Agglomerative Clustering
11
Probabilistic Clustering

Typically shown as a model in an attempt to optimize
the fit between the data and the model using a
probabilistic approach. Each cluster can be represented
by a parametric distribution, like a Gaussian
(continuous) or a Poisson (discrete) and the entire data
set is therefore modeled by a mixture of these
distributions.
Mixture of Multivariate Gaussian
12
Distance Measure

Common Measures
 Euclidean distance
 Manhattan distance
 Maximum norm
 Mahalanobis distance
 Hamming distance
 Minkowski distance (higher dimensional data)
13
4 Most Commonly Used Algorithms

K-means

Fuzzy C-means

Hierarchical

Mixture of Gaussians
14
K-means

An Exclusive Clustering algorithm whose steps are
as follows:
1) Let k be the number of clusters
2) Randomly generate k clusters and determine the
cluster centers, or generate k random points as
temporary cluster centers
3) Assign each point to the nearest cluster center
4) Recompute the new cluster centers
5) Repeat steps 3 and 4 until the centroids no longer
move
15
K-means

Suppose: n vectors x1, x2, ..., xn where each fall into k
compact clusters, k < n. Let mi be the center points of
the vectors in cluster i.
 Make initial guesses for the points m1, m2, ..., mk
 Until there are no changes in any point
 Use the estimated points to classify the samples
into clusters
 For every cluster, replace mi with the point of all
of the samples for cluster i
Sample points m1 and m2 moving towards the center
of two clusters
16
Fuzzy C-means

An Exclusive Clustering algorithm whose steps are as follows:
1) Initialize U = [uij] matrix, U(0)
2) At k-step: calculate center vectors C(k)=[cj] with U(k)
3) Update U(k), U(k+1)
4) Repeat steps 2 and 3 until ||U(k+1) – U(k)|| < ε where ε is
the given sensitivity threshold
17
Fuzzy C-means

Suppose: 20 data and 3 clusters are used to
initialize the algorithm and to compute the U
matrix. Color of the data in the graph below is
that of the nearest cluster. Assume a fuzzyness
coefficient m = 2 and ε = 0.3.
Initial graph
Final condition reached after 8 steps
18
Fuzzy C-means

Can we do better? Yes, but the result is more
computations. Assume same conditions as before
except ε = 0.01.
Result? Final condition reached after 37 steps!
19
Hierarchical

A Hierarchical Clustering algorithm whose steps are as
follows (agglomerative):
1) Assign each item to a cluster
2) Find the closest (most similar) pair of clusters and merge
them into a single cluster
3) Compute distances (similarities) between the new cluster
and the old clusters using:
 single-linkage
 complete-linkage
 average-linkage
4) Repeat steps 2 and 3 until there is just a single cluster
19
Hierarchical

There is also a divisive hierarchical clustering which does the
reverse by starting with all objects in one cluster and
subdividing , however divisive methods are generally not
available, and rarely have been applied.
20
Single-Linkage

Definitions: proximity matrix D=[d(i,j)] and L(k) is the level
of the kth clustering. d[(r),(s)] = proximity between clusters
(r) and (s). Follow these steps:
1) Begin with level L(0) = 0 and sequence number m = 0
2) Find a pair (r), (s), such that d[(r),(s)] = min d[(i),(j)] where the
minimum is over all pairs of clusters in the current clustering.
3) Increment the sequence number: m = m + 1 and merge clusters
(r) and (s) into a single cluster, L(m) = d[(r),(s)]
4) Update D by deleting the rows and columns for clusters (r) and
(s) and adding a row and column for the newly formed cluster.
The proximity between new cluster (r,s) and old cluster (k) is:
d[(k), (r,s)] = min d[(k), (r)], d[(k), (s)]
5) Repeat steps 2 thru 4 until all objects are in one cluster
21
Single-Linkage

Suppose we want a hierarchical clustering of distances
between some Italian cities using single-linkage.
Nearest pair is MI and TO at distance 138  merge into cluster
“MI/TO” with its level L(MI/TO) = 138. Sequence number m = 1
22
Single-Linkage

Next we compute the distance from this new compound
object to all other objects. In single-linkage clustering
the rule is that the distance from the compound object
to another object is equal to the shortest distance from
any member of the cluster to the outside object.
Distance from "MI/TO" to RM is
chosen to be 564, which is the
distance from MI to RM
23
Single-Linkage
After some computation we have:
Finally, we merge the last 2 clusters
Hierarchical Tree
24
Mixture of Gaussians

A Probabilistic Clustering algorithm whose steps are as
follows:
1) Assume there are k components where ωi
represents the i’th component and has a mean
vector μi with a covariance matrix σ2I
25
Mixture of Gaussians
2) Select a random component i with probability P(ωi)
3) A datapoint can then be generated as N(μi,σ2I)
26
Mixture of Gaussian
4) We can define the general datapoint as N(μi, Σi)
where each component generates data from a
Gaussian with mean μi and covariance matrix Σi.
5) Next, we are interested in calculating the probability
that an observation from class ωi, would have the
data x given means μi,..., μx:
P(x|ωi,μi,..., μk)
27
Mixture of Gaussian
6) Goal: maximize the probability of a datum given
the centers of the Gaussians
P(x|μi) = Σi P(ωi) P(x|ωi,μ1, μ2,…, μk)
N

P(data|μi) = ∏ Σi P(ωi) P(x|ωi,μ1, μ2,…, μk)
i=1
The most popular and simple algorithm that is used is
the Expectation-Maximization (EM) algorithm
28
Expectation-Maximization (EM)

An iterative algorithm for finding maximum likelihood
estimates of parameters in probabilistic models whose
steps are as follows:
1) Initialize the distribution parameters
2) Estimate the Expected value of the unknown variables
3) Re-estimate the distribution parameters to Maximize the
likelihood of the data
4) Repeat steps 2 and 3 until convergence
29
Expectation-Maximization (EM)


Given probabilities of grades in a class:
P(A) = ½, P(B) = μ, P(C) = 2μ, P(D) =½ - 3μ where 0<=μ<=1/6
What is the maximum likelihood estimate of μ?
We begin with a guess for μ, iterating between E and M to
improve our estimates of μ and a and b (a = # of A’s and
= #’s of B’s)
Define μ(t) as the estimate of μ on the t’th iteration
b(t) as the estimate of b on the t’th iteration
b
30
EM Convergence

Prob(data|μ) must increase or remain the same
between each iteration but it can never exceed 1,
therefore it must converge.
31
Mixture of Gaussian

Now let us see the affects of the probabilistic
approach over several iterations using the EM
algorithm.
32
Mixture of Gaussian

After the first iteration
33
Mixture of Gaussian

After the second iteration
34
Mixture of Gaussian

After the third iteration
35
Mixture of Gaussian

After the fourth iteration
36
Mixture of Gaussian

After the fifth iteration
37
Mixture of Gaussian

After the sixth iteration
38
Mixture of Gaussian

After the 20th iteration
39
Questions?
40
References






“A Tutorial on Clustering Algorithms”,
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
“Cluster Analysis”, http://en.wikipedia.org/wiki/Data_clustering
“Clustering”, http://www.cs.cmu.edu/afs/andrew/course/15/381f08/www/lectures/clustering.pdf
“Clustering with Gaussian Mixtures”,
http://autonlab.org/tutorials/gmm14.pdf.
“Data Clustering, A Review”,
http://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf
“Finding Communities by Clustering a Graph into Overlapping
Subgraphs”, www.cs.rpi.edu/~magdon/talks/clusteringIADIS05.ppt