Transcript Document
Clustering
Petter Mostad
Clustering vs. class prediction
Class prediction:
A learning set of objects with known classes
Goal: put new objects into existing classes
Also called: Supervised learning, or classification
Clustering:
No learning set, no given classes
Goal: discover the ”best” classes or groupings
Also called: Unsupervised learning, or class discovery
Overview
General clustering theory
Clustering microarray data
Steps, methods, algorithms, issues...
Recommendations for this kind of data
Programs for clustering
Some other visualization techniques
Issues in clustering
Used to explore and visualize data, with
few preconceptions
Many subjective choices must be made, so
a clustering output tends to be subjective
It is difficult to get truly statistically
”significant” conclusions
Algorithms will always produce clusters,
whether any exist in the data or not
Steps in clustering
1.
2.
3.
4.
Feature selection and extraction
Defining and computing similarities
Clustering or grouping objects
Assessing, presenting, and using the
result
1. Feature selection and extraction
Deciding which measurements matter for
similarity
Data reduction
Filtering away objects
Normalization of measurements
The data matrix
Every row contains
the measurements for
one object.
Similarities are
computed between all
pairs of rows
If measurements are
of same type, one can
instead cluster them!
measurements
objects
2. Defining and computing
similarities
Similarity measures for continuous data
vectors: x ( x1,...,xn ) y ( y1,..., yn )
Euclidean distance
n
2
(
x
y
)
i i
i 1
Minkowski distance (including Manhattan
1/ p
n
metric)
p
xi yi
i 1
Mahalanobis distance
covariance matrix
xS 1 y
where S is a
Centered and non-centered (absolute)
Pearson correlation ( x x )( y y )
n
centered:
n
( x x ) ( y y)
2
i
non-centered:
where
i
n
i 1
i
i 1
x
n
1
xi
n i 1
i 1
2
n
i
x y
i
i 1
n
y
n
1
yi
n i 1
i
n
x y
i 1
2
i
i 1
2
i
Spearman rank correlation
Compute the ranking of the numbers in each
vector
Find correlation between ranking numbers
....
6
4
2
If measurements are
coordinates, objects
become points in some
space
If the simiarity measure is
Euclidean distance, the
goal is to group nearby
points
Note: When we have only
2 or 3 measurements per
object, we can do better
than most algorithms
using visual inspection
0
8
Geometrical view of clustering
2
3
4
5
6
7
Similarity measures for discrete
data
Comparing two binary vectors, count the
numbers a,b,c,d of 1-1’s, 1-0’s, 0-1’s, and 0-0’s,
respectively
Construct different similarity measurements
based on these numbers:
ad
abcd
a
abcd
a
abc
2(a b)
2(a b) c d
Similarity of for example trees or other objects
can be defined in reasonable ways
...
Similarities using contexts
Mutual Neighbour Distance:
MND( x, y) NN ( x, y) NN ( y, x)
where NN ( x, y) is the neighbour number of x
with respect to y
This is not a metric, but similarities do not
need to be based on metrics.
3. Clustering or grouping
Hierarchical clusterings
Divisive: Starts with one big cluster and
subdivides on cluster in each step
Agglomerative: Starts with each object in
separate cluster. In each step, joins the two
closest clusters
Partitional clusterings
Probabilistic or fuzzy clusterings
Hierarchical clustering
Agglomerative clustering depends on type of linkage,
i.e., how to compute the distance between merged
cluster (UV) and old cluster (W):
d(UV, W) = min(d(U, W), d(V,W)) (single linkage)
d(UV, W) = max(d(U,W), d(V,W)) (complete linkage)
d(UV, W) = average over all distances between objects in (UV)
and objects in W (average linkage, or UPGMA: Unweighted Pair
Group Method with Arithmetic mean)
The output is a dendrogram
A simplification of average linkage is often implemented
(“average group linkage”): It may lead to inverted
dendrograms!
Dendrograms, visualizations
The data matrix is often visualized using three
colors, representing positive, negative, and zero
values.
Hierarchical clustering results often represented
with a dendrogram. The similarity at which
clusters merge should correspond to height of
corresponding horizontal line in dendrogram!
To display the dendrogram, the objects (lines or
columns) need to be sorted, this can be done in
two ways at every time when two clusters are
merged.
Ward’s hierarchical clustering
Agglomerative.
Goal: minimize ”Error Sum of Squares” (ESS) at
every step.
ESS = The sum over all clusters, of the sum of the
squares of the distances from the objects to the
cluster centroid.
When joining two clusters, find the pair that
results in the smallest increase in ESS.
Partitional clusterings
The number of desired clusters is fixed at the
start
K-means clustering:
Partition into k initial clusters
Iteratively, reassign points to groups with the closest
centroid. Recompute centroids.
Repeat until stability
The result may depend on initial clusters
May include a procedure joining or splitting clusters
according to size
The choice of number of clusters may not be
obvious
Probabilistic or fuzzy clustering
The output is, for each object and each cluster, a
probability or weight that the object belongs to
the cluster
Example: The observations are modelled as
produced by drawing from a number of
probability densities (often multivariate normal).
Parameters are then estimated with Maximum
Likelihood (for example using EM algorithm).
Example: A ”fuzzy” version of k-means, where
weights for objects are changed iteratively
Neural networks for clustering
Neural networks are mathematical models
made to be similar to actual neural
networks
They consist of layers of nodes that send
out ”signals” based probabilistically on
input signals
Most known uses are classifications, i.e.,
with learning sets
Self-Organising Maps (SOM)
Clustering as optimization
Given similarity definition and definition of what
is an ”optimal” clustering, it can often be a huge
algorithmic challenge to find the optimum.
Example: Subdivide many thousand objects into
50 clusters, minimizing e.g. the sum of the
squared distances to centroids.
Then, algorithms for optimization are central.
Genetic algorithms
Tries to use ”evolution” to obtain good solutions
to a problem
A number of solutions are kept at every step:
They may then mate or mutate, to produce new
solutions. The ”fittest” solutions are kept.
Can be seen as an optimization algorithm
A great challenge to design ways of mating and
mutating that produce an efficient algorithm
Simulated annealing
A general optimization technique
Iterative: At every step, nearby solutions are
chosen with probabilities depending on their
optimality (so even less optimal solutions may
be chosen)
As the algorithm proceeds, and the
”temperature” sinks, the probability of choosing
less optimal solutions also sinks.
Is a good general way to avoid local optima.
4. Assessing and using the result
Visualization and summarization of the
clusters
Note: You should always investigate the
dependence of your results on the choices
you have made for the clustering!
Examples of applications of
clustering
Image analysis
Speech recognition
Data mining
....
Clustering microarray data
Samples are columns,
genes are rows, in data
matrix
What values to cluster?
What is a biologically
relevant measure of
similarity?
One can cluster genes
and/or samples
samples
genes
Clustering microarray data
Use logged data, usually
Data should be on same scale (but usually is if you use data that is
already normalized)
You may have to filter away genes that show too little variation over
samples.
Use an appropriate distance measure for the question you want to
focus on (Pearson correlation often works OK).
Use appropriate clustering algorithm (Hierarchical average linkage
usually works OK).
If you draw some conclusion from the clustering results, try to vary
your clustering choices to see how stable these results are.
Clustering works best as a tool to generate hypotheses and ideas,
which may then be tested in other ways.
Clustering tumor samples
Clustering to confirm or reject
hypotheses?
A clustering may appear to validate, or be validated by, a
grouping derived by using other data
Caution: The many different ways to do a clustering may
make it possible to tweak it to produce the clusters you
want
There is a huge and complex multiple testing problem
Note that small changes in data can change result
dramatically
If you insist on trying to get ”significance”:
Using permutations of data
Using resampling of data (bootstrapping)
How to do clustering: Programs
A good program for clustering and visualization: HCE
Great visualization options
Adapted to microarray data
http://www.cs.umd.edu/hcil/hce/
Can import similarity matrices
Classic for microarray data: Cluster & TreeView (Eisen)
R/BioConductor: package cluster, hclust function,
heatmap function, ...
Many other programs/packages
Other visualization techniques:
Principal Components
The principal components can be viewed as the axes of
a “better” coordinate system for the data.
“Better” in the sense that the data is maximally spread
out along the first principal components.
The principal components correspond to eigenvectors of
the covariance matrix of the data.
The eigenvalues represent the part of the total variance
explained by each of the principal components.
Principal component analysis of
expression data
Principal component analysis of
expression data
Other visualization techniques:
Multidimensional scaling
Start with some points in a very high dimension.
Goal: Display these points in a lower dimension,
so that distances between them are similar to
distances in original dimension.
May also try to preserve only the ranking of the
pairwise distances.
Makes it possible to use powerful visual
inspection, in 2 or 3 dimensions.
Can sometimes give very convincing pictures
separating samples in a predicted way.