Clustering (Chap 7)
Download
Report
Transcript Clustering (Chap 7)
Clustering (Chap 7)
ALGORITHMS AND APPLICATIONS
Introduction
Clustering is an important data mining task.
Clustering makes it possible to “almost
automatically” summarize large amounts of highdimensional data.
Clustering (aka segmentation) is applied in many
domains: retail, web, image analysis, bioinformatics,
psychology, brain science, medical diagnosis etc.
Key Idea
Given: (i) a data set D and (ii) similarity function s
Goal: break up D into groups such that items which are
similar to each other end in the same group AND items
which are dis-similar to each other end up in different
groups.
Example: customers with similar buying patterns end up in the
same cluster.
People with similar “biological signals” end in the same
cluster.
Regions with similar “weather” patterns.
Clustering Example: Iris Data
Iris data set is a “very small” data set consisting of
attributes of 3 flower types.
Flower types: Iris_setosa; Iris_versicolor; Iris-virginica
4 attributes: sepal-length; sepal-width; petal-length; petalwidth
Thus each flower is “data vector” in a four-dimensional space.
The “fifth” dimension is a label.
Example: [5.1,3.5,1.4,0.2,iris-setosa]
It is hard to visualize in four-dimensions!
Example Iris Data Set
Which is the “best” pair of attributes to distinguish the three flower types ?
Attributes and Features
In almost all cases the labels are not given but we can
acquire attributes and construct new features.
Notice the terminology: attributes, features, variables,
dimensions – they are often used interchangeably.
I prefer to distinguish between attributes and features.
We acquire attributes and construct features. Thus there are finitely
many attributes but infinitely many possible features.
For example, Iris data set had four attributes. Lets construct a new
feature: sepal-length/sepal-width.
Similarity and Distance Functions
We have already seen the cosine similarity function.
sim(x,y) = (x.y)/||x||||y||
We can also define a “distance” function which in
some sense is like an “opposite” of similarity.
Entities which are highly similar to each other have a small
distance between them; while items which are less similar have
a large distance between them.
Example: dist(x,y) = 1 – sim(x,y)
“identical” objects: sim(x,y) = 1 dist(x,y) = 0
sim(x,y) = 0 dist(x,y) = 1
Euclidean Distance
Take two data vectors d1 and d2
D1 = (x1,x2,…,xn)
D2 = (y1,y2,…yn)
Then Euclidean Distance between D1 and D2 is
dist(D1, D2) = (x1- y1)2 + (x2 - y2)2 +... + (xn - yn)2
Euclidean Distance: Examples
D1=(1,1,1); D2=(1,1,1)
dist(D1,D2) = 0
D1=(1,2,1); D2= (2,1,1)
dist(D1, D2) = (1- 2) + (2 -1) + (1-1) =1.34
2
2
2
Simple Clustering:
One-dimensional/one-centroid
1
6
2
3
Three data points: 1,2 and 6
Want to summarize with one point
The average of (1,2,6) = 9/3 = 3
3 is called the centroid
High-Dimensional/one-centroid
Let (2,4), (4,6) and (8,10) be three data points in two
dimensions.
The average (centroid) is:
æ 2 + 4 + 8 4 + 6 +10 ö æ 14 20 ö
,
ç
÷ = ç , ÷ = ( 4.6, 6.6)
è 3
ø è3 3ø
3
Different interpretation of centroid
Let x1, x2, x3 be data points. Define a function f as:
f (y) = ((x1- y)2 + (x2 - y)2 + (x3- y)2 )
Now, it turns out that the minimum of the function f
occurs at the centroid of the three points.
This interpretation can be easily be generalized
across dimensions and centroids.
High-dimension/many centroids
Let X be a set of d-dimensional data points (data
vectors).
Find k data points (not necessarily in X, lets call them
cluster centers), such that the sum of the square
Euclidean distances of each point in X to its nearest
cluster center is minimized.
This is called the K-means problem.
The data points which get assigned to their nearest
cluster center end up as the k-clusters.
K-means algorithm
Let C = initial k cluster centroids (often selected randomly)
Mark C as unstable
While <C is unstable>
Assign all data points to their nearest centroid in C.
Compute the centroids of the points assigned to each
element of C.
Update C as the set of new centroids.
Mark C as stable or unstable by comparing with
previous set of centroids.
End While
Complexity: O(nkdI)
n:num of points; k: num of clusters; d: dimension; I: num of iterations
Take away: complexity is linear in n.
An important question ?
Does the K-means algorithm “solve” the K-means
problem.
Answer: only partially.
In fact nobody has yet been able to design an algorithm
to “solve” the K-means problem.
Part of the “P =\= NP” problem.
A million dollar prize if you have a “solution” or show
that the “solution” cannot exist.
Example: 2 Clusters
c A(-1,2)
c B(1,2)
4
(0,0)
c C(-1,-2)
c D(1,-2)
2
K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and
{C,D}
K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then
{A,C} and {B,D} end up as the two clusters.
Kmeans algorithm on Iris Data
Note: The K-means algorithms does not know the labels!
Kmeans in Matlab
In the tutorial you will be working with Matlab
Matlab is a very easy to use data processing language
For example, the K-means algorithm can be run
using one line
>> [IDX,C]= kmeans(irisdata,3);
The cluster labels (1,2,3) are stored in the variable
IDX.
Summary
Clustering is a way of compressing data into information
from which we can derive “knowledge”
The first step is to map data into “data vectors” and then
use a “distance” function.
We formalized clustering as a K-means problem and then
used the K-means algorithm to obtain a solution.
The applications of clustering are almost limitless.
Almost every data mining exercise begins with clustering – to get
initial understanding of the data.