Clustering Time-Series Gene Expression Data Using Smoothing

Download Report

Transcript Clustering Time-Series Gene Expression Data Using Smoothing

Clustering Time-Series Gene
Expression Data Using Smoothing
Spline Derivatives
S. DÉJEAN, P. G. P. MARTIN, A. BACCINI,
AND P. BESSE
EURASIP JOURNAL ON BIOINFORMATICS
AND SYSTEMS BIOLOGY 2007
Outline
 Introduction
 Biological Experiment
 Statistical Methodology
 Results
 Discussion
Introduction
 For time-series gene expression data, we would deal
with temporal profile clustering.
 In this paper, it focuses on the shapes of the curves
rather than on the absolute level of expression.
 The shapes of the curves may provide meaningful
information on coordinate gene regulation.
Introduction
 How to describe the shapes of the curves?
 We use splines for continuous representations, as
Bar-Joseph et al, and the derivative for the shapes.
Outline
 Introduction
 Biological Experiment
 Statistical Methodology
 Results
 Discussion
Biological Experiment
 Experimental design:
 44 mice were subjected to 11 different fasting periods ranging
from 0 to 72 hours.
 At each time points(0, 3, 6, 9, 12, 18, 24, 36, 48, 60, 70), 4 mice
were euthanized and their livers were used for RNA extraction.
 The experiment was measured with a decreasing
sampling rate, it was assumed that most of the gene
expression changes would occur at the beginning of
fasting.
Biological Experiment
 Data preprocessing:
 All data were log-transformed.
 Considering the missing expression value, only 130 of the total
200 genes were selected for analysis.
 So, the data set includes:
 130 genes
 11 time points
 4 samples at each time point
Outline
 Introduction
 Biological Experiment
 Statistical Methodology
 Results
 Discussion
Statistical Methodology
 The methodology was composed of two steps:
 Signal Extraction
 Clustering the derivatives of the smoothed curves.
 The data set:
Signal Extraction
 Consider the observed gene expression values.
 Two assumptions:
 The values are noisy observations of the “true” value.
 The biological phenomenon should be regular and so
differentiable, function of time.
Signal Extraction
 According to the assumptions, consider the model
for each gene expression:
where denotes the observation for the
mouse
at time .
 How to solve , a continuous and differentiable
function?
Signal Extraction
 Using cubic spline smoothing, the estimation of gene
expression curve is the solution of the optimization
problem:
Smoothing parameter.
Force the solutions to
be close to mean values.
Control the regularity
of the function.
Signal Extraction
 The solution shape and its smoothness depend
directly on .
 How to tune the smoothing parameter to extract
the informative part of the signal?
Tuning the smoothing parameter
 Considering the influence of
:
Tuning the smoothing parameter
 A unique
value for all genes.
 A heuristic approach combining two levels of
reflection:


Eigenelements of the PCA performed a posteriori.
Biological interpretations of results.
Eigenvalues and eigenvectors smoothness
 For each different values of :
 Each gene expression profile is smoothed according to the
same value.
 First derivatives are computed and discretized.
Eigenvalues and eigenvectors smoothness
 Then, a PCA is computed, leading to a scree graph.
When is large, the derivative is constant, so the PCA gave only
one large eigenvalue. As decreases, other eigenvalues are arisen.
Eigenvalues and eigenvectors smoothness
 The first two eigenvectors:
As decreases, the first two eigenvector become much more irregular and
more difficult to interpret.
Biological interpretation
 The consistency with biological relevance should be
considered.


For higher value, two or three time points could interpret the
phenomena.
As decreases, more oscillations in the eigenvectors could be
irrelevant.
 The consideration would be important for avoiding
misinterpretation.
Synthesis
 For two levels of reflection, we could yield the
smoothing parameter
.
 There are clearly two separate eigenvalues, and the
corresponding eigenvectors are smooth enough to
interpret the gene expression profile.
Clustering
 By interpreting gene expression profiles with the
derivative of smoothed curves, we choose 20 points
equally spaced between 0 and 72 hours.
 The data can be presented with 130 genes and 20
expression values for each gene.
Clustering
 Two kinds of clustering are applied:
 Hierarchical clustering
 K-means clustering
 Hierarchical clustering:
 Ward criterion: when fusing two clusters, it minimize the
increase in the total within-cluster sum of squares.
 K-means clustering:
 To avoid an improper fusion in hierarchical clustering, we use
k-means clustering with the k centroids from the results of
hierarchical clustering.
Outline
 Introduction
 Biological Experiment
 Statistical Methodology
 Results
 Discussion
Results
 Hierarchical clustering: the dendrogram with four
clusters.
Results
 The four clusters correspond to four temporal
expression profiles:




Weakly increasing(hc1)
Stationary(hc2)
Decreasing(hc3)
Strongly increasing(hc4)
Results
 To make the clustering more robust, the k-means
clustering are performed with the initial centers as
the centers of the classes obtained when cutting the
dendrogram.
 Clustering changes:
The main change.
Results
 The four clusters of K-means clustering:
Graphical display
 Representation of variables(time points) and
individuals(genes):
km4
km2
km1
km3
Outline
 Introduction
 Biological Experiment
 Statistical Methodology
 Results
 Discussion
Discussion
 Before clustering step, the spline smoothing is
applied as a de-noising method.
 According to this work, one could set time points
adequately depending on the scientific aims.