An unsupervised conditional random fields approach for clustering

Download Report

Transcript An unsupervised conditional random fields approach for clustering

An unsupervised conditional
random fields approach for
clustering gene expression time
series
Chang-Tsun Li, Yinyin Yuan and
Roland Wilson
Bioinformatics, 2008
Outline
•
•
•
•
Introduction
Methods
Results and Discussion
Conclusions
2
Introduction
•
•
•
•
Gene expression
Gene expression time-series data
Conditional random fields model
Rand index
3
Gene expression
• Gene expression: the process by which
inheritable information from a gene(the DNA
sequence) is made into a functional gene
product(protein or RNA).
• It’s a temporal process, so we need to record
time-course data of gene expression.
4
Gene expression data
• For example, Cho: mitotic cell cycle: 17 time
points of expression data for synchronized
yeast cells.
Expression levels at time t=10 minutes after release from cell-cycle arrest.
5
Gene expression data
• The co-expression indicates co-regulation or
relation in functional pathways.
• We need an efficient clustering method to
identify genes.
6
Conditional Random Fields
• A type of discriminative probabilistic model
most often used for the labeling or parsing of
sequential data.
• Let X be a random variable over the
observations and Y be a random variable over
corresponding labels.
• A CRF model can be expressed as a joint
distribution of Y given X:
7
Rand index
• Rand index is a measure of the similarity
between two data clusters.
• The Rand index has a value between 0 and 1,
with 0 indicating that the two data clusters do
not agree on any pair of points and 1
indicating that the data clusters are exactly
the same.
8
Methods
• Two steps:
– Data transformation
– Model formulation
• The proposed algorithm: CRFs clustering
algorithm
9
Data Transformation
• How the gene expression time series are
transformed into a multi-dimensional feature
space?
• Consider that:
– temporal correlation
– time intervals
10
Data Transformation
• Let n denote the number of time series, i.e.
the number of genes.
• Each time series Ti of length τ is mapped to a
point xi:
Temporal correlation.
The length of interval.
Each τ-time-point time series is transformed
into a τ-1 dimension feature vector.
11
Model formulation
• A set of n observed time series, X = {x1, x2, …,
xn}.
• A set of the corresponding labels, Y = {y1, y2, …,
yn}.
• How to assign an optimal class label yi on tine
series i conditioned on observed data xi?
12
Model formulation
• Unsupervised clustering methods, we need
two requirements:
– Not using cluster centroids.
– Allow each time series to be a singleton cluster.
• Consider that:
– Voting pool
– Cost function
13
Voting pool
• When a time series i is visited, its voting pool
Ni is formed with k time series.
Voting pool: k
Most similar(MS) time series: s
Time series i
Most different(MD) time series: 1
Time series selected at random: k-s-1
14
Voting pool
• Only the class labels currently assigned to the
members of its voting pool are taken as
candidates.
• The class label of time series i is decided with
its voting pool.
15
Voting pool
• What purposes for MD, MS and ones selected
at random?
• MD: inform what label to avoid.
• MS: inform what labels to choose.
• Ones selected at random: introduce new
information.
16
Cost function
• Encourage ‘good labeling’ with low cost and
penalize ‘poor labeling’ with high cost.
• Like a CRF model:
Conditional probability distribution The prior
Gibbs form
17
Cost function
• The two cost functions:
They are dependent on the same set of arguments!
• We obtain a new model as:
18
Cost function
• How to set the new cost function?
• We don’t need to specify the number of
clusters and the information about centroids.
• So, we define:
The estimated threshold between intra- and inter-class distances.
: average of all the minimum distances
: average of all the maximum distances
time series i
minimum
distance
maximum distance
randomly picked: m
19
Cost function
• The cost function
is defined as:
• The assignment of a label is determined as:
20
CRFs clustering algorithm
• CRFs clustering algorithm:
Data
transformation
Voting pool
Cost
function
21
Results and Discussion
• Evaluation on toy cases
• Experiments on time-series datasets
– Simulated datasets
– Yeast galactose dataset
– Yeast cell-cycle dataset
• Evaluation of class prediction
22
Evaluation on toy cases
• How does the voting pool affect the
performance?
• Experimental setting:
– Three-dimensional features patterns.
– Five clusters, each having 30 patterns.
– Voting pool size: 4, 6, 10, 18.
– One MS, one MD, (k-2) random samples.
23
Evaluation on toy cases
• The five clusters are randomly created with
the five centroids fixed at:
– μ1: (40, 45, 100)
– μ2: (85, 60, 100)
– μ3: (65, 100, 80)
– μ4: (55, 120, 120)
– μ5: (120, 50, 120)
• The same variance of 64.
24
Evaluation on toy cases
• Three cases:
– MS and MD included
– MD excluded
– MS excluded
25
Evaluation on toy cases
• The results:
Without MD, the algorithm couldn’t
know which label is not suitable.
Without MS, the algorithm tends to guess correct labels.
26
Simulated datasets
• We synthesized five classes of 60, 60, 60, 80,
60 gene expression profiles:
• Gaussian noise
is added to all data.
27
Simulated datasets
• The results:
0.9903 , k = 12
Time complexity tends
to decline steadily.
The clustering accuracy and efficiency of the proposed method depend on
the choice of the pool size.
28
Simulated datasets
• Another experimental setting:
– Three datasets of 320 gene profiles consisting 4, 5
and 6 clusters.
– Three datasets of 400 gene profiles consisting 4, 5
and 6 clusters.
29
Simulated datasets
• The results:
Voting pool size is 6-12
A pool size in the range of 2~4%
of the dataset size is reasonable.
Voting pool size is 8-16
30
Yeast galactose dataset
• The Yeast galactose dataset consists of gene
expression measurements in galactose
utilization in Saccharomyces cerevisiae.
• We compared the adjusted Rand index
between our algorithm and CORE(Tjaden,
2006).
• CORE algorithm: a supervised k-means
clustering method.
31
Yeast galactose dataset
• The results:
Good performance with voting pool size is 5-11.
However, the optimal value of CORE is 0.7.
Optimal value: 0.9478
32
Yeast cell-cycle dataset
• In the Yeast cell-cycle(Y5) dataset, there are
more than 6000 yeast genes measured during
two cell-cycles at 17 time points.
The genes are identified based
on their peak times of five
phase of the cell-cycle: early
G1(G1E), late G1(G1L), S, G2, M.
33
Yeast cell-cycle dataset
• The results:
In the range, the adjusted Rand index are
better than the best result 0.467 of all the
methods without labeled data, like HMM,
k-means and splines, etc.
Optimal value is 0.4808
34
Evaluation of class prediction
• We need to assign genes to known classes, i.e.
class prediction.
• The datasets(Y5, Yeast galactose, Simulated)
are divided into training and testing sets.
The high-prediction error
rate is due to the high rate
of misclassification of the
training set.
35
Evaluation of class prediction
• The results:
Y5 Dataset:
high prediction error rate.
36
Conclusion
• An efficient unsupervised CRFs model for both
gene class discovery and class prediction
without a priori knowledge is presented.
37