Clustering (Chap 7)

Download Report

Transcript Clustering (Chap 7)

Clustering (Chap 7)
ALGORITHMS AND APPLICATIONS
Introduction
 Clustering is an important data mining task.
 Clustering makes it possible to “almost
automatically” summarize large amounts of highdimensional data.
 Clustering (aka segmentation) is applied in many
domains: retail, web, image analysis, bioinformatics,
psychology, brain science, medical diagnosis etc.
Key Idea
 Given: (i) a data set D and (ii) similarity function s
Goal: break up D into groups such that items which are
similar to each other end in the same group AND items
which are dis-similar to each other end up in different
groups.



Example: customers with similar buying patterns end up in the
same cluster.
People with similar “biological signals” end in the same
cluster.
Regions with similar “weather” patterns.
Clustering Example: Iris Data
 Iris data set is a “very small” data set consisting of
attributes of 3 flower types.

Flower types: Iris_setosa; Iris_versicolor; Iris-virginica
4 attributes: sepal-length; sepal-width; petal-length; petalwidth
Thus each flower is “data vector” in a four-dimensional space.
The “fifth” dimension is a label.
Example: [5.1,3.5,1.4,0.2,iris-setosa]

It is hard to visualize in four-dimensions!



Example Iris Data Set
Which is the “best” pair of attributes to distinguish the three flower types ?
Attributes and Features
 In almost all cases the labels are not given but we can
acquire attributes and construct new features.
 Notice the terminology: attributes, features, variables,
dimensions – they are often used interchangeably.
 I prefer to distinguish between attributes and features.


We acquire attributes and construct features. Thus there are finitely
many attributes but infinitely many possible features.
For example, Iris data set had four attributes. Lets construct a new
feature: sepal-length/sepal-width.
Similarity and Distance Functions
 We have already seen the cosine similarity function.
 sim(x,y) = (x.y)/||x||||y||
 We can also define a “distance” function which in
some sense is like an “opposite” of similarity.

Entities which are highly similar to each other have a small
distance between them; while items which are less similar have
a large distance between them.
 Example: dist(x,y) = 1 – sim(x,y)
 “identical” objects: sim(x,y) = 1  dist(x,y) = 0
 sim(x,y) = 0  dist(x,y) = 1
Euclidean Distance
 Take two data vectors d1 and d2
 D1 = (x1,x2,…,xn)
 D2 = (y1,y2,…yn)
 Then Euclidean Distance between D1 and D2 is
dist(D1, D2) = (x1- y1)2 + (x2 - y2)2 +... + (xn - yn)2
Euclidean Distance: Examples
 D1=(1,1,1); D2=(1,1,1)
 dist(D1,D2) = 0
 D1=(1,2,1); D2= (2,1,1)
dist(D1, D2) = (1- 2) + (2 -1) + (1-1) =1.34
2
2
2
Simple Clustering:
One-dimensional/one-centroid
1
6
2
3
 Three data points: 1,2 and 6
 Want to summarize with one point
 The average of (1,2,6) = 9/3 = 3
 3 is called the centroid
High-Dimensional/one-centroid
 Let (2,4), (4,6) and (8,10) be three data points in two
dimensions.
 The average (centroid) is:
æ 2 + 4 + 8 4 + 6 +10 ö æ 14 20 ö
,
ç
÷ = ç , ÷ = ( 4.6, 6.6)
è 3
ø è3 3ø
3
Different interpretation of centroid
 Let x1, x2, x3 be data points. Define a function f as:
f (y) = ((x1- y)2 + (x2 - y)2 + (x3- y)2 )
 Now, it turns out that the minimum of the function f
occurs at the centroid of the three points.
 This interpretation can be easily be generalized
across dimensions and centroids.
High-dimension/many centroids
 Let X be a set of d-dimensional data points (data
vectors).
 Find k data points (not necessarily in X, lets call them
cluster centers), such that the sum of the square
Euclidean distances of each point in X to its nearest
cluster center is minimized.
 This is called the K-means problem.
 The data points which get assigned to their nearest
cluster center end up as the k-clusters.
K-means algorithm
Let C = initial k cluster centroids (often selected randomly)
Mark C as unstable
While <C is unstable>
Assign all data points to their nearest centroid in C.
Compute the centroids of the points assigned to each
element of C.
Update C as the set of new centroids.
Mark C as stable or unstable by comparing with
previous set of centroids.
End While
Complexity: O(nkdI)
n:num of points; k: num of clusters; d: dimension; I: num of iterations
Take away: complexity is linear in n.
An important question ?
 Does the K-means algorithm “solve” the K-means
problem.
 Answer: only partially.
 In fact nobody has yet been able to design an algorithm
to “solve” the K-means problem.

Part of the “P =\= NP” problem.
 A million dollar prize if you have a “solution” or show
that the “solution” cannot exist.
Example: 2 Clusters
c A(-1,2)
c B(1,2)
4
(0,0)
c C(-1,-2)
c D(1,-2)
2
K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and
{C,D}
K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then
{A,C} and {B,D} end up as the two clusters.
Kmeans algorithm on Iris Data
Note: The K-means algorithms does not know the labels!
Kmeans in Matlab
 In the tutorial you will be working with Matlab
 Matlab is a very easy to use data processing language
 For example, the K-means algorithm can be run
using one line
>> [IDX,C]= kmeans(irisdata,3);
The cluster labels (1,2,3) are stored in the variable
IDX.
Summary
 Clustering is a way of compressing data into information
from which we can derive “knowledge”
 The first step is to map data into “data vectors” and then
use a “distance” function.
 We formalized clustering as a K-means problem and then
used the K-means algorithm to obtain a solution.
 The applications of clustering are almost limitless.

Almost every data mining exercise begins with clustering – to get
initial understanding of the data.