Clustering178winter07

Download Report

Transcript Clustering178winter07

Clustering
Instructor: Max Welling
ICS 178 Machine Learning & Data Mining
Unsupervised Learning
• In supervised learning we were given attributes & targets (e.g. class labels).
In unsupervised learning we are only given attributes.
• Our task is to discover structure in the data.
• Example: the data may be structured in clusters:
Is this a good clustering?
Why Discover Structure ?
• Often, the result of an unsupervised learning algorithm is a new representation
for the same data. This new representation should be more meaningful
and could be used for further processing (e.g. classification).
• Clustering: The new representation is now given by the label of a
cluster to which the data-point belongs.
This tells us which data-cases are similar to each other.
• The new representation is smaller and hence more convenient computationally.
• Clustering: Each data-case is now encoded by its cluster label. This is a lot
cheaper than its attribute values.
• CF: We can group the users into user-communities or/and the movies into
movie genres. If we need to predict something we simply pick the average
rating in the group.
Clustering: K-means
• We iterate two operations:
1. Update the assignment of data-cases to clusters
2. Update the location of the cluster.
• Denote zi  [1,2,3,..., K ] the assignment of data-case “i” to cluster “c”.
d
• Denote c 
• Denote xi 
d
the position of cluster “c” in a d-dimensional space.
the location of data-case i
• Then iterate until convergence:
1. For each data-case, compute distances to each cluster and pick the closest one:
zi  argmin || xi  c ||
c
i
2. For each cluster location, compute the mean location of all data-cases
assigned to it:
1
c 
Nr. of data-cases in cluster c
Nc
x
i Sc
i
c
Set of data-cases assigned to cluster c
K-means
N
• Cost function:
C  || xi   zi ||2
i 1
• Each step in k-means decreases this cost function.
• Often initialization is very important since there are very many local minima in C.
Relatively good initialization: place cluster locations on K randomly chosen data-cases.
• How to choose K?
Add complexity term: C  C  1  [# parameters ]  log(N )
2
and minimize also over K
Vector Quantization
• K-means divides the space up in a Voronoi tesselation.
• Every point on a tile is summarized by the code-book vector “+”.
This clearly allows for data compression !
Mixtures of Gaussians
• K-means assigns each data-case to exactly 1 cluster. But what if
clusters are overlapping?
Maybe we are uncertain as to which cluster it really belongs.
• The mixtures of Gaussians algorithm assigns data-cases to cluster with
a certain probability.
MoG Clustering
N [x ; , ] 
1
2 
d /2
1
exp[  (x   )T  1 (x   )]
2
det()
Covariance determines
the shape of these contours
• Idea: fit these Gaussian densities to the data, one per cluster.
EM Algorithm: E-step
• “r” is the probability that data-case “i” belongs to cluster “c”.
•  c is the a priori probability of being assigned to cluster “c”.
• Note that if the Gaussian has high probability on data-case “i”
(i.e. the bell-shape is on top of the data-case) then it claims high
responsibility for this data-case.
• The denominator is just to normalize all responsibilities to 1:
K
r
c 1
ric 
K
 c N [xi ; c , c ]

c '1
N [xi ; c ', c ']
c'
ic
1
i
EM Algorithm: M-Step
Nc   ric
total responsibility claimed by cluster “c”
i
Nc
c 
N
c 
c 
1
Nc
1
Nc
expected fraction of data-cases assigned to this cluster
r x
ic
i
r
i
ic
i
weighted sample mean where every data-case is weighted
according to the probability that it belongs to that cluster.
(xi  c )(xi  c )T
weighted sample covariance
EM-MoG
• EM comes from “expectation maximization”. We won’t go through the derivation.
• If we are forced to decide, we should assign a data-case to the cluster which
claims highest responsibility.
• For a new data-case, we should compute responsibilities as in the E-step
and pick the cluster with the largest responsibility.
• E and M steps should be iterated until convergence (which is guaranteed).
• Every step increases the following objective function (which is the total
log-probability of the data under the model we are learning):
N
K

L  log    c N [xi ; c , c ] 
i 1
 c 1

Agglomerative Hierarchical Clustering
Every data-case is a cluster
• Define a “distance” between clusters (later).
• Initially, every data-case is its own cluster.
• At each iteration, compute the distances
between all existing clusters (you can store
distances and avoid their re-computation).
• Merge the closest clusters into 1 single cluster.
• Update you “dendrogram”.
Iteration 1
Iteration 2
Iteration 3
• This way you build a hierarchy.
• Complexity Order N 2 (why?)
Dendrogram
Distances
Dmin (Ci , C j ) 
min
x Ci ,x 'C j
|| x  x '||
produces minimal spanning tree.
Dmax (Ci , C j ) 
Davg (Ci , C j ) 
max || x  x '||
x Ci ,x 'C j
1
Ni N j

x Ci ,x 'C j
Dmean (Ci , C j ) || i   j ||
|| x  x '||
avoids elongated clusters.
Gene Expression Data
Micro-array Data
• The expression level of genes is
tested under different experimental
conditions.
• We like to find the genes which
co-express in a subset of conditions.
• Both genes and conditions are
clustered and shown as dendrograms.
Exercise I
Imagine I have run a clustering algorithm on some data describing 3
attributes of cars: height, weight, length.
I have found two clusters. An expert comes by and tells you that class 1 is
really Ferrari’s while class 2 is Hummers.
• A new data-case (car) is presented, i.e. you get to see the height, weight, length.
Describe how you can use the output of your clustering, including the information
obtained from the expert to classify the new car as a Ferrari or a Hummer.
Be very precise: use an equation or pseudo-code to describe what to do.
• You add the new car to the dataset and run the K-means starting at its converged
assignments and cluster means obtained from before. Is it possible that the
assignments of the old data change due to the addition of the new data-case?
Exercise II
• We classify data according to the 3-nearest neighbors (3-NN) rule.
Explain in detail how this works.
• Which decision surface do you think is smoother: the one for 1-NN or for
100-NN? Explain.
• Is k-NN a parametric or non-parametric method.
Give an important property of non-parametric classification method.
• We will do linear regression on data of the form (Xn,Yn) where Xn and Yn are
real values: Yn = AXn+b+n
where A,b are parameters and n is the noise variable.
• Provide the equation for the total Error of the data-items.
• We want to minimize the Error. With respect to what ?
• You are given a new attribute Xnew. What would you predict for Ynew.