Transcript 6892-3-12

Examples of Clustering Biological
Data
6.892 / 7.93 Spring Term 2002
March 12, 2002
David Gifford
Student Projects
• Ideally groups of four
– Two biologists
– Two computer scientists or quantitative experts
• Pick a research project related to the class
• You will have access to data, Matlab, etc.
• Work product
– Preliminary outline of plans (5 minutes in class)
– Final presentation (30 minutes)
– Talk and slides are graded
Ordering effect
Initial structure
Permuted
Hierarchical clustering
The problem
“There are 2n-1 linear orderings consistent with the structure of the
tree. …
An optimal linear ordering, one that maximizes the similarity of
adjacent elements in the ordering, is impractical to compute.”
[Eisen et al, PNAS 98]
Problem definition
Denote by  the space of the possible linear orderings consistent with
the tree.
Denote by v1 …vn the tree leaves.
Our goal is to find an ordering that maximizes the similarity of
adjacent elements:
n 1
max  S (v , v

i 1
where S is the similarity matrix

i

)
i 1
Computing the optimal similarity
Recursively compute the
optimal similarity OT(u,w) for
any pair of leaves (u,w) which
could be on different corners
(leftmost and rightmost) of T.
T
For a leaf uT, CT(u) is the set
of all possible corner leaves of
T when u is on one corner of T.
u
OT(u,w) = maxmCT
(u),kCT (w)
1
2
T1
T2
m
k
OT1(u,m) + OT2(k,w) + S(m,k)
w
Improvement
worst time is still O(n4) but …
num of leaves
clustering time
ordering time
with improvement
100
0
0
0
300
1
22
2
500
3
200
10
1000
32
4000 (66 min)
102
1500
115
24800 (7 hours)
420
Running time – biological datasets
type of dataset
num of
experiments
num of genes
clustering time
optimal
ordering
Different sources (Eisen)
79
979
26
50
Cell cycle (Spellman)
59
800
15
27
Cell cycle – cdc15
24
800
15
180
Environment response*
(Yang)
45
3684
910
(15 min)
257
(4 min)
*Results obtained on 700 Mhz Pentium pc with 512M memory running Linux
Does it help ?
Recall the statement we started with -
“An optimal linear ordering, one that maximizes the similarity of
adjacent elements in the ordering,
?
is impractical to compute.”
Results – hand generated data
Hierarchical clustering
Input
Optimal ordering
Hierarchical clustering
Input
Optimal ordering
Biological results
• Spellman identified 800 genes as cell cycle regulated in Saccharomyces
cerevisiae.
• Genes were assigned to five groups termed G1,S,S/G2,G2/M and M/G1
which approximate the commonly used cell cycle groups in the literature.
• This assignment was performed using a ‘phasing’ method which is a
supervised classification algorithm.
• In addition to the phasing method, the authors clustered these genes using
hierarchical clustering, noting:
“There is no simple relationship between these two [phasing and
clustering] methods, although there are common features in the results.”
Cell Cycle – 24
experiments of cdc15
temperature sensitive
mutant
Hierarchical clustering
Optimal ordering
24 experiments of cdc15
temperature sensitive
mutant
59 experiments,
combining cdc15,
cdc28 and  factor
arrest
Clustering of the 79
experiments in Eisen’s
paper. The numbers to
the right of each gene
represents the
complex to which it
belongs according to
the MIPS database.
Hierarchical clustering
Optimal ordering
Using optimal
ordering to
identify the
different
clusters. 24
experiments
of cdc15
mutant from
Spellman’s
paper.
0
1
0
1
Clustering Demos
• K-means
– Demo 1: 3 clusters in data, k = 3
– Demo 2: 1 cluster in data, k = 2
• Mixture Models
– Demo 3: 1 cluster in data, k = 2 (same data as Demo 2)
– Demo 4: 3 clusters in data, k = 2
– Demo 5: 3 clusters in data, k = 3
• I’ll put the code on the web site
Clustering Summary
• Clustering allows you to organize data and see
patterns
• Can reduce the dimensionality of data (e.g. PCA)
• Ultimately, we would like to use clusters to explain
biological phenomenon
• The first step is classification