Transcript ppt

The use of Minimum Spanning Trees
in microarray expression data
Gkirtzou Ekaterini
University of Crete
CS483
1
Introduction

Classic clustering algorithms, like Kmeans, self-organizing maps, etc., have
certain drawbacks


No guarantee for global optimal results
Depend on geometric shape of cluster
boundaries (K-means)
University of Crete
CS483
2
Introduction

MST clustering algorithms



Expression data clustering analysis
(Xu et al -2001)
Iterative clustering algorithm
(Varma et al - 2004)
Dynamically growing self-organizing tree
(DGSOT) (Luo et al - 2004)
University of Crete
CS483
3
Definitions

A minimum spanning tree (MST) of a
weighted, undirected graph G  (V , E )
with weights w(e) is an acyclic subset
T  G that contains all of the vertices
and whose total weight
w(T )   w(e)
eT
is minimum.
University of Crete
CS483
4
Definitions

The DNA microarray technology enables
the massive parallel measurement of
gene expression of thousands genes
simultaneously. Its usefulness:



compare the activity of genes in diseased
and healthy cells
categorize a disease into subgroups
discover new drug and toxicology studies.
University of Crete
CS483
5
Definitions

Clustering is a common technique for
data analysis. Clustering partitions the
data set into subsets (clusters), so that
the data in each subset share some
common trait.
University of Crete
CS483
6
MST clustering algorithms



Expression data clustering analysis
(Xu et al -2001)
Iterative clustering algorithm
(Varma
et al - 2004)
Dynamically growing self-organizing tree
(DGSOT) (Luo et al - 2004)
University of Crete
CS483
7
Expression data clustering
analysis

Let D  {di }be a set of expression
data with each di  (ei1 , , eit )
representing the expression levels at
time 1 through time t of gene i. We
define a weighted, undirected graph
G  (V , E ) as follows. The vertex set
V  {di | di  D}and the edge set
E  {(di , d j ) | for di , d j  D and i  j}.
University of Crete
CS483
8
Expression data clustering
analysis




G is a complete graph.
The weight of its edge is the distance of
the two vertices e.g. Euclidean distance,
Correlation coefficient, etc.
Each cluster corresponds to one subtree
of the MST.
No essential information is lost for
clustering.
University of Crete
CS483
9
Clustering through removing
long MST-edges


Based on intuition of
the cluster
Works very well
when inter-cluster
edges are larger
than intra-cluster
ones
University of Crete
CS483
10
An iterative Clustering

Minimize the distance between the center
of a cluster and its data
K
  (d , center (T ))
i
i 1 d Ti


(1)
Starts with K arbitrary clusters of the MST
for each pair of adjacent clusters finds the
edge to cut, which optimizes (1)
University of Crete
CS483
11
A globally optimal clustering


Tries to partition the tree into K subtrees
Select K representatives to optimize (2)
K
   (d , d )
i 1 d Ti
University of Crete
i
CS483
(2)
12
MST clustering algorithms



Expression data clustering analysis
(Xu et al -2001)
Iterative clustering algorithm
(Varma et al - 2004)
Dynamically growing self-organizing tree
(DGSOT) (Luo et al - 2004)
University of Crete
CS483
13
Iterative clustering algorithm

The clustering measure used here is
Fukuyama-Sugeno
2
Nk
FS ( S )    x kj  k
k 1 j 1 
2
2
 k   

where S1, S 2 are the two partitions of the set
S, with each S k contains N k samples, denote
by  k the mean of the samples in S k and  the
k
x
global mean of all samples. Also denote by j
the j-th sample in the cluster S k
University of Crete
CS483
14
Iterative clustering algorithm



Feature selection counts the gene’s
support to a partition
Feature selection used here is t-statistic
with pooled variance. T-statistic is
heuristic measure
Genes with absolute t-statistic greater
than a threshold are selected
University of Crete
CS483
15
Iterative clustering algorithm



Create an MST from all genes
Delete edges from MST and obtain
binary partitions. Select the one with
minimum F-S clustering measure
The feature selection is used to select a
subset of genes that single out between
the clusters
University of Crete
CS483
16
Iterative clustering algorithm



In the next iteration the clustering is
done in this selected set of genes
Until the selected gene subset converges
Remove them form the pool and
continue.
University of Crete
CS483
17
MST clustering algorithms



Expression data clustering analysis
(Xu et al -2001)
Iterative clustering algorithm
(Varma et al - 2004)
Dynamically growing self-organizing
tree (DGSOT) (Luo et al - 2004)
University of Crete
CS483
18
Dynamically growing selforganizing tree (DGSOT)

In the previous algorithms the MST is
constructed on the original set of data
and used to test the intra-cluster
quantity, while here the MST is used as a
criterion to test the inter-cluster
property.
University of Crete
CS483
19
DGSOT algorithm




Tree structure self-organizing neural
network
Grows vertically and horizontally
Starts with a root-leaf node
In every vertical growing every leaf node
with heterogeneity H et  TR two
descendents are created and the learning
process take place
University of Crete
CS483
20
DGSOT algorithm

Heterogeneity


Variability (maximum distance between input
data and node)
Average distortion d of a leaf
D
d ( x j , ni )
j 1
D
di  
D: total number of input data of lead i
d ( x j , ni ): distance between data j and leaf i
ni: reference vector of leaf i
University of Crete
CS483
21
DGSOT algorithm


In every horizontal growing for every
lowest non-leaf node a child is added
until the validation criterion is satisfied
and the learning process take place
The learning process distributes the data
to the leaves in the best way. The best
matching node has the minimum
distance to the input data
University of Crete
CS483
22
The validation criterion of
DGSOT



Calculated without human intervention
Based on geometric characteristics of
the clusters
Create the Voronoi diagram for the input
data. The Voronoi diagram divides the
set D data into n regions V(p):
V(p) = {x  D | dist ( x, p)  dist ( x, q)q}
University of Crete
CS483
23
The validation criterion of
DGSOT

Let’s define a weighted, undirected graph
G (V , E ) .The vertices is the set of the
centroids of the Voronoi cell V ( p) and the
edge set is defined as
E  {( pi , p j ) | pi , p j  V ( p) and i  j}

Create the MST for the graph G (V , E )
University of Crete
CS483
24
Voronoi diagram of 2D dataset


In A, the dataset is partitioned into three Voronoi cells.
The MST of the centroid is ‘even’.
In B, the dataset is partitioned into four Voronoi cells.
The MST of the centroid is not ‘even’.
University of Crete
CS483
25
The validation criterion of
DGSOT

Emin
CS 
Emax
Cluster separation
`

where Emin is minimum length edge
and Emax is the maximum length edge
A low value of the CS means that the
two centroids are to close to each other
and the Voronoi partition is not valid,
while a high CS value means that the
Voronoi partition is valid.
University of Crete
CS483
26
Example of DGSOT
University of Crete
CS483
27
Conclusions

The tree algorithms presented in this
report have provided comparable result
to those obtained by classic clustering
algorithms, without their drawbacks, and
superior to those obtained by standard
hierarchical clustering.
University of Crete
CS483
28
Questions
University of Crete
CS483
29