Transcript Document

Microarray Data
Analysis II:
Clustering
1
Copyright notice
• Many of the images in this power point
presentation of other people. The
Copyright belong to the original
authors. Thanks!
2
Aim of clustering: Group objects
according to their similarity
Cluster:
a set of objects
that are similar
to each other
and separated
from the other
objects.
Example: green/
red data points
were generated
from two different
normal distributions
3
What is clustering?
• Clustering: A way of grouping together data samples that
are similar in some way - according to some criteria that
you pick
• A form of unsupervised learning – you generally don’t
have examples demonstrating how the data should be
grouped together
• It’s a method of data exploration – a way of looking for
patterns or structure in the data that are of interest
• The results vary drastically depending on
– clustering method
– similarity or dissimilarity metric
– additional parameters specific to each clustering method (e.g.
number of centres for the k-mean, agglomeration rule for
hierarchical clustering, ...)
4
Why cluster genes microarray data?
• Identify groups of possibly co-regulated
genes (e.g. in conjunction with sequence
data).
• Identify typical temporal or spatial gene
expression patterns (e.g. cell cycle data).
• Arrange a set of genes in a linear order
that is at least not totally meaningless.
5
Distance and similarity metrics
6
Distance/similarity measures
• A crucial parameter for clustering is the choice of
an appropriate metrics to measure the similarity
or dissimilarity between objects
– The similarity measure is often more
important than the clustering algorithm used –
don’t overlook this choice!
• Jagota defines a dissimilarity measure as a
function f(x,y) such that f(x,y) > f(w,z) if and only
if x is less similar to y than w is to z
• This is always a pair-wise measure
7
Choice of a distance/similarity metric
• There are plenty of distance/similarity metrics:
– Euclidian distance
– Manhattan distance
– Pearson's coefficient of correlation
– Mahalanobis distance
– 2 distance
– Mutual information.
• The choice of the metrics depends on the data type
8
Euclidian distance
• You are probably familiar with the calculation of Euclidian
distance in a 2-dimensional space
d
yb-ya
Y
xb-xa
a
dE 
x a  x b   y a  y b 
2
2
b
X

The concept naturally extends to spaces with higher

dimension (p-dimensional space)
p
dE 
x
 x bi 
2
ai
i1
9
deuc=0.5846
deuc=2.6115
deuc=1.1345
These examples of
Euclidean distance
match our intuition of
dissimilarity pretty
well…
10
deuc=1.41
deuc=1.22
…But what about these?
What might be going on with the expression profiles
on the left? On the right?
11
Correlation
• We might care more about the overall shape of
expression profiles rather than the actual magnitudes
• That is, we might want to consider genes similar
when they are “up” and “down” together
• When might we want this kind of measure? What
experimental issues might make this appropriate?
12
Pearson's coefficient of correlation
• Pearson's coefficient of correlation is a classical
metric of similarity between two objects.
x ai  mi x bi  mi  p
c P  

  za zb
  b  i1
i1   a
p

Where
a
is the index of an object (e.g. a gene)
b
is the index of another object (e.g. a gene)
i
is an index of dimension (e.g. a chip)
mi
is the mean value of the ith dimension
Note the correspondence with z-scores.
The correlation is comprised between -1 and 1
It can be converted to a distance metric by a simple operation :




dP  1 c P
13
Pearson's coefficient of correlation
• Pearson correlation only measures the degree of a
linear relationship between two expression profiles!
• If you want to measure General relationship, using
Mutual information
 = 0.0249, so dp = 0.4876
The green curve is the
square of the blue curve –
this relationship is not
captured with Pearson
correlation
14
Impact of the distance metrics
A
B
5
d
e
f
c
4
3
2
a
1
b
0
0
1
2
3
4
5
Euclidian distances
• a close to b
Correlation coefficient
• a close to c
0
1
2
3
4
5
6
7
Euclidian distances
• d close to f
Correlation coefficient
• d close to e
8
9
10
15
Dot product
• The dot product combines advantages of
the Euclidian distance and of the
coefficient of correlation
p
•
dp  x ai * x bi 
i1
– It takes positive values to represent co-regulation, and
negative values to represent anti-regulation (as the

coefficient
of correlation)
– It reflects the strength of the regulation of both genes,
since it uses the real values (as the Euclidian
distance) rather than the standardized ones (as the
16
coefficient of correlation).
Mutual Information
Entropy based distances:
Mutual Information
(semi-metric distance)
• Mutual Information (MI) is a
statistical representation of the
correlation of two signals A and B.
• MI is a measure of the additional
information known about one
expression pattern when given
another.
• MI is not based on linear models and
can therefore also see non-linear
dependencies (see picture).
17
Clustering Algorithms
18
Generic Clustering Tasks
• Estimating number of clusters
• Assigning each object to a cluster
• Assessing strength/confidence of cluster
assignments for individual objects
• Assessing cluster homogeneity
19
How hard is clustering?
• One idea is to consider all possible
clustering, and pick the one that has best
inter and intra cluster distance properties
• Suppose we are given n points, and would
like to cluster them into k-clusters
– How many possible clustering?
• Too hard to do it brute force or optimally
• Solution: Heuristic method
n
k
k!
20
Classical clustering methods
• Partitioning methods
• Hierarchical methods
21
Partitioning Methods
• Partition the objects into a prespecified number of
groups K
• Iteratively reallocate objects to clusters until some
criterion is met (e.g. minimize within cluster sums
of squares)
• Examples: k-means, self-organizing maps (SOM),
partitioning around medoids (PAM), model-based
clustering
22
Hierarchical Clustering
• Produce a dendrogram
• Avoid prespecification of
the number of clusters K
• The tree can be built in
two distinct ways:
– Bottom-up:
agglomerative
clustering
– Top-down: divisive
clustering
23
Agglomerative Methods
• Start with n mRNA sample (or G gene)
clusters
• At each step, merge the two closest
clusters using a measure of betweencluster dissimilarity which reflects the
shape of the clusters
24
Divisive Methods
• Start with only one cluster
• At each step, split clusters into two parts
• Advantage: Obtain the main structure of the data
(i.e. focus on upper levels of dendrogram)
• Disadvantage: Computational difficulties when
considering all possible divisions into two groups
25
Partitioning vs. Hierarchical
• Partitioning
– Advantage: Provides clusters that satisfy some
optimality criterion (approximately)
– Disadvantages: Need initial K, long computation
time
• Hierarchical
– Advantage: Fast computation (agglomerative)
– Disadvantages: Rigid, cannot correct later for
erroneous decisions made earlier
26
K-means clustering
27
K Means Clustering Algorithm
• The number of centres (k) has to be specified
a priori
• Algorithm
– (1) Arbitrarily select k initial centres
– (2) Assign each element to the closest centre
– (3) Re-calculate centres (mean position of the
assigned elements)
– (4) Repeat (2) and (3) until one of the stopping
conditions is reached
• the clusters are the same as in the previous iteration
• the difference between two iterations is smaller than a
specified threshold
• the max number of iterations has been reached
28
K means example - initial conditions
• Two sets of
random points
are randomly
generated
– 200 points
centred on (0,0)
– 50 points
centred on (1,1)
• Two points are
randomly
chosen as seeds
(blue dots)
29
K means example - first iteration
• Step 1
– Each dot is
assigned to the
cluster with the
closest centre
– Centres are recalculated (blue
star) on the
basis of the new
clusters
30
K means example - second iteration
• At each step,
– points are reassigned to
clusters
– centres are recalculated
• Cluster
boundaries and
centre positions
evolve at each
iteration
31
K means example - after 3 iterations
• At each step,
– points are reassigned to
clusters
– centres are recalculated
• Cluster
boundaries and
centre positions
evolve at each
iteration
32
K means example - after 4 iterations
• At each step,
– points are reassigned to
clusters
– centres are recalculated
• Cluster
boundaries and
centre positions
evolve at each
iteration
33
K means example - after 5 iterations
• At each step,
– points are reassigned to
clusters
– centres are recalculated
• Cluster
boundaries and
centre positions
evolve at each
iteration
34
K means example - after 6 iterations
• At each step,
– points are reassigned to
clusters
– centres are recalculated
• Cluster
boundaries and
centre positions
evolve at each
iteration
35
K means example - after 6 iterations
• After some
iterations (6 in
this case), the
clusters and
centres do not
change anymore
36
K-means clustering - summary
• Strengths
– Simple to use
– Fast
– Can be used with very large data sets
• Weaknesses
– The choice of the number of groups is arbitrary
– The results vary depending on the initial positions of centres
• Solutions
– Try different values for k and compare the result
– For each value of k, run repeatedly to sample different initial conditions
• Weakness of the solution
– Instead of one clustering, you obtain hundreds of different clustering,
how to decide among them
37
Cell cycle data: K-means
clustering
38
Determining the “correct”
number of clusters
• We’d like to have a measure of cluster quality Q
and then try different values of k until we get an
optimal value for Q
• But, since clustering is an unsupervised learning
method, we can’t really expect to find a “correct”
measure Q…
• So, once again there are different choices of Q
and our decision will depend on what
dissimilarity measure we’re using and what
types of clusters we want
39
Cluster Quality Measures
• Jagota suggests a measure that emphasizes
cluster tightness or homogeneity:
k
1
Q
d (x , i )

i 1 | Ci | x Ci
• |Ci | is the number of data points in cluster i
• Q will be small if (on average) the data points in
each cluster are close
40
Cluster Quality (cont.)
This is a plot of the
Q measure as given
in Jagota for kmeans clustering on
the data shown
earlier
Q
How many clusters
do you think there
actually are?
k
41
Cluster Quality
• The Q measure given in Jagota takes into account
homogeneity within clusters, but not separation between
clusters
• Other measures try to combine these two characteristics
(i.e., the Davies-Bouldin measure)
• An alternate approach is to look at cluster stability:
– Add random noise to the data many times and count
how many pairs of data points no longer cluster
together
– How much noise to add? Should reflect estimated
variance in the data
42
Self-Organizing Maps
43
Self-Organizing Maps
• Based on work of Kohonen on learning/memory in the
human brain
• As with k-means, we specify the number of clusters
• However, we also specify a topology – a 2D grid that
gives the geometric relationships between the clusters
(i.e., which clusters should be near or distant from each
other)
• The algorithm learns a mapping from the high
dimensional space of the data points onto the points of
the 2D grid (there is one grid point for each cluster)
44
Self-Organizing Maps
10,10
Grid points map to
cluster means in
high dimensional
space (the space
of the data points)
11,11
Each grid point
corresponds to a
cluster (11x11 =
121 clusters in this
example)
45
Self-Organizing Maps
• Suppose we have a r x s grid with each grid
point associated with a cluster mean 1,1,… r,s
• SOM algorithm moves the cluster means around
in the high dimensional space, maintaining the
topology specified by the 2D grid (think of a
rubber sheet)
• A data point is put into the cluster with the
closest mean
• The effect is that nearby data points tend to map
to nearby clusters (grid points)
46
SOM Algorithm
Initialize Map
For t from 0 to 1
//t is the learning factor
Randomly select a sample
Get best matching unit
Scale neighbors
Increase t a small amount
//decrease learning factor
End for
From: http://davis.wpi.edu/~matt/courses/soms/
47
An Example Using Color
Three dimensional data: red, blue, green
Will be converted into 2D image map with clustering
of Dark Blue and Greys together and Yellow close to
Both the Red and the Green
From http://davis.wpi.edu/~matt/courses/soms/
48
An Example Using Color
Each color in
the map is
associated with
a weight
From http://davis.wpi.edu/~matt/courses/soms/
49
An Example Using Color
1. Initialize the weights
Random
Values
Colors in the
Corners
Equidistant
From http://davis.wpi.edu/~matt/courses/soms/
50
An Example Using Color Continued
2.
Get best matching unit
After randomly selecting a sample, go through all
weight vectors and calculate the best match (in this
case using Euclidian distance)
Think of colors as 3D points each component (red,
green, blue) on an axis
From http://davis.wpi.edu/~matt/courses/soms/
51
An Example Using Color Continued
2.
Getting the best matching unit continued…
For example, lets say we chose green as the
sample. Then it can be shown that light
green is closer to green than red:
Green: (0,6,0) Light Green: (3,6,3) Red(6,0,0)
LightGreen 32 02 32  4.24
Re d  62 (6) 2 02  8.49
This step is repeated for entire map, and the weight with
the shortest distance is chosen as the best match
From http://davis.wpi.edu/~matt/courses/soms/
52
An Example Using Color Continued
3.
Scale neighbors
1.
2.
Determine which weights are considred
nieghbors
How much each weight can become more
like the sample vector
1. Determine which weights are considered neighbors
In the example, a Gaussian function is used where every
point above 0 is considered a neighbor
f ( x, y )  e
 6.66666667 x 2  y 2
From http://davis.wpi.edu/~matt/courses/soms/
53
An Example Using Color Continued
2. How much each weight can become more like the sample
When the weight with the smallest distance is chosen
and the neighbors are determined, it and its neighbors
‘learn’ by changing to become more like the
sample…The farther away a neighbor is, the less it
learns
From http://davis.wpi.edu/~matt/courses/soms/
54
An Example Using Color Continued
NewColorValue = CurrentColor*(1-t)+sampleVector*t
For the first iteration t=1 since t can range from 0 to 1, for
following iterations the value of t used in this formula
decreases because there are fewer values in the range (as
t increases in the for loop)
From http://davis.wpi.edu/~matt/courses/soms/
55
Conclusion of Example
Samples continue to be chosen
at random until t becomes 1
(learning stops)
At the conclusion of the
algorithm, we have a nicely
clustered data set. Also note
that we have achieved our goal:
Similar colors are grouped
closely together
From http://davis.wpi.edu/~matt/courses/soms/
56
A SOFM Example With Yeast
57
“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic
differentiation” by Tamayo et al.
Benefits of SOM
• SOM contains the set of features extracted
from the input patterns (reduces
dimensions)
• SOM yields a set of clusters
• A gene will always be most similar to a
gene in its immediate neighborhood than a
gene further away
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
58
SOM Issues
• The algorithm is complicated and there are a lot
of parameters (such as the “learning rate”) these settings will affect the results
• The idea of a topology in high dimensional gene
expression spaces is not exactly obvious
– How do we know what topologies are appropriate?
– In practice people often choose nearly square grids
for no particularly good reason
• As with k-means, we still have to worry about
how many clusters to specify…
59
Hierarchical clustering
60
Hierarchical Agglomerative
Clustering (HAC) [Hartigan 1975]
• Agglomerative (bottom-up)
• Algorithm:
dendrogram
– Initialize: each item a cluster
– Iterate:
• select two most similar clusters
• merge them
– Halt: when required number
of clusters is reached
61
HAC (cont.)
• This produces a
binary tree or
dendrogram
• The final cluster is
the root and each
data item is a leaf
• The height of the
bars indicate how
close the items
are
62
Hierarchical Clustering Demo
63
Linkage in Hierarchical
Clustering
• We already know about distance measures
between data items, but what about between a
data item and a cluster or between two clusters?
• We just treat a data point as a cluster with a
single item, so our only problem is to define a
linkage method between clusters
• As usual, there are lots of choices…
64
Hierarchical: Single Link
• cluster similarity = similarity of two most
similar members
- Potentially
long and skinny
clusters
+ Fast
65
Example: single link
1 2 3 4 5
1 0
2  2
3 6

4 10
5  9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2) 0


3 3 0

4 9 7 0 


5 8 5 4 0 
d (1, 2 ), 3  min{d1 ,3 , d 2, 3 }  min{6,3}  3
5
d (1, 2 ), 4  min{d1, 4 , d 2 , 4 }  min{10,9}  9
4
3
2
1
d (1, 2 ), 5  min{d1,5 , d 2, 5 }  min{9,8}  8
66
Example: single link
1 2 3 4 5
1 0
2  2
3 6

4 10
5  9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2) 0


3 3 0

4 9 7 0 


5 8 5 4 0 
d (1, 2 , 3 ), 4  min{d(1, 2 ), 4 , d 3 , 4 }  min{9,7}  7
d (1, 2 , 3 ),5  min{d (1, 2 ), 5 , d3 , 5 }  min{8,5}  5
(1,2,3) 4 5
(1,2,3) 0

4 7 0 
5 5 4 0
5
4
3
2
1
67
Example: single link
1 2 3 4 5
1 0
2  2
3 6

4 10
5  9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2) 0


3 3 0

4 9 7 0 


5 8 5 4 0 
(1,2,3) 4 5
(1,2,3) 0

4 7 0 
5 5 4 0
5
d(1, 2, 3),( 4, 5)  min{d(1, 2, 3), 4 , d(1, 2, 3),5 }  5
4
3
2
1
68
Hierarchical: Complete Link
• cluster similarity = similarity of two least
similar members
+ tight clusters
- slow
69
Example: complete link
1 2 3 4 5
1 0
2  2
3 6

4 10
5  9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2)  0


3  6 0

4 10 7 0 


5  9 5 4 0
d (1, 2 ), 3  max{d1, 3 , d 2, 3 }  max{6,3}  6
d (1, 2 ), 4  max{d1 , 4 , d 2 , 4}  max{10,9}  10
d (1, 2 ), 5  max{d1, 5 , d 2 ,5 }  max{9,8}  9
5
4
3
2
1
70
Example: complete link
1 2 3 4 5
1 0
2  2
3 6

4 10
5  9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2)  0


3  6 0

4 10 7 0 


5  9 5 4 0
d (1, 2 ), ( 4,5)  max{d (1, 2 ), 4 , d (1, 2), 5}  max{10,9}  10
d 3,( 4 , 5 )  max{d3, 4 , d 3, 5}  max{7,5}  7
(1,2) 3 (4,5)
(1,2)  0

3  6 0 
(4,5) 10 7 0
5
4
3
2
1
71
Example: complete link
1 2 3 4 5
1 0
2  2
3 6

4 10
5  9
0
3
9
8




0

7 0 
5 4 0
(1,2) 3 4 5
(1,2)  0


3  6 0

4 10 7 0 


5  9 5 4 0
(1,2) 3 (4,5)
(1,2)  0

3  6 0 
(4,5) 10 7 0
5
d(1, 2, 3),( 4, 5)  max{d(1, 2), ( 4, 5) , d3, ( 4, 5)} 10
4
3
2
1
72
Hierarchical: Average Link
• cluster similarity = average similarity of all
pairs
+ tight clusters
- slow
73
Software: TreeView
[Eisen et al. 1998]
• Fig 1 in Eisen’s PNAS 99
paper
• Time course of serum
stinulation of primary
human fibrolasts
• cDNA arrays with approx
8600 spots
• Similar to average-link
• Free download at:
http://rana.lbl.gov/EisenSoftware.htm
74
Hierarchical Clustering Issues
• Distinct clusters are not produced –
sometimes this can be good, if the data
has a hierarchical structure w/o clear
boundaries
• There are methods for producing distinct
clusters, but these usually involve
specifying somewhat arbitrary cutoff
values
75
Motivation Of New Hierarchical
Clustering Algorithm
• Traditional hierarchical clustering
algorithms have two drawbacks
– Topological structure of the hierarchy is fixed
– Once data is improperly clustered in the early
stage it cannot be re-evaluated later.
76
Dynamically Growing SelfOrganizing Tree (DGSOT)
• DGSOT is a divisive hierarchical clustering algorithm
• DGSOT grows in two directions: vertical and horizontal.
– In each vertical growth, the DGSOT adds two children to the leaf
whose heterogeneity is greater than a threshold and turn it to a
node.
– In each horizontal growth, the DGSOT dynamically finds the
proper number of children (sub-clusters) of the lowest level
nodes.
• Proper number of clusters is determined by cluster validation
criteria.
• Each vertical growth step is followed by a horizontal growth step.
• After each vertical and horizontal growth, a learning process is
applied.
77
Clustering
Results
A) The whole tree
constructed by SOTA
B) The whole tree
constructed by DGSOT
C) An enlarged picture
represents part of the
DGSOT tree that is
indicated by the red bar
78
Clustering
Results
A) Six highest-level clusters (1-6).
B-G) show the sub-clusters of the
six clusters in A.
79
Comparison DGSOT with SOTA
Cluster Number
Number of ORFs
DGSOT
SOTA
DGSOT
SOTA
11
39
250
184
12
26
182
142
MIPS
Functional
Category
ORFs within
Functional
Category
P-value
(-log10)
DGSOT
SOTA
DGSOT
SOTA
Ribosomal
proteins
122
50
124
32
Organization
of cytoplasm
141
67
87
25
Translation
10
6
4
2
Ribosomal
Proteins
29
49
12
37
Organization
of cytoplasm
43
64
9
30
SOTA is available at http://gepas.bioinfo.cnio.es/
80
Comparison DGSOT with SOTA
• The hierarchical
structure of
SOTA (A) and
DGSOT (B)
when each has
25 leaves.
A) Result of SOTA B) Result of DGSOT
81