Clustering Time-Series Gene Expression Data Using Smoothing
Download
Report
Transcript Clustering Time-Series Gene Expression Data Using Smoothing
Clustering Time-Series Gene
Expression Data Using Smoothing
Spline Derivatives
S. DÉJEAN, P. G. P. MARTIN, A. BACCINI,
AND P. BESSE
EURASIP JOURNAL ON BIOINFORMATICS
AND SYSTEMS BIOLOGY 2007
Outline
Introduction
Biological Experiment
Statistical Methodology
Results
Discussion
Introduction
For time-series gene expression data, we would deal
with temporal profile clustering.
In this paper, it focuses on the shapes of the curves
rather than on the absolute level of expression.
The shapes of the curves may provide meaningful
information on coordinate gene regulation.
Introduction
How to describe the shapes of the curves?
We use splines for continuous representations, as
Bar-Joseph et al, and the derivative for the shapes.
Outline
Introduction
Biological Experiment
Statistical Methodology
Results
Discussion
Biological Experiment
Experimental design:
44 mice were subjected to 11 different fasting periods ranging
from 0 to 72 hours.
At each time points(0, 3, 6, 9, 12, 18, 24, 36, 48, 60, 70), 4 mice
were euthanized and their livers were used for RNA extraction.
The experiment was measured with a decreasing
sampling rate, it was assumed that most of the gene
expression changes would occur at the beginning of
fasting.
Biological Experiment
Data preprocessing:
All data were log-transformed.
Considering the missing expression value, only 130 of the total
200 genes were selected for analysis.
So, the data set includes:
130 genes
11 time points
4 samples at each time point
Outline
Introduction
Biological Experiment
Statistical Methodology
Results
Discussion
Statistical Methodology
The methodology was composed of two steps:
Signal Extraction
Clustering the derivatives of the smoothed curves.
The data set:
Signal Extraction
Consider the observed gene expression values.
Two assumptions:
The values are noisy observations of the “true” value.
The biological phenomenon should be regular and so
differentiable, function of time.
Signal Extraction
According to the assumptions, consider the model
for each gene expression:
where denotes the observation for the
mouse
at time .
How to solve , a continuous and differentiable
function?
Signal Extraction
Using cubic spline smoothing, the estimation of gene
expression curve is the solution of the optimization
problem:
Smoothing parameter.
Force the solutions to
be close to mean values.
Control the regularity
of the function.
Signal Extraction
The solution shape and its smoothness depend
directly on .
How to tune the smoothing parameter to extract
the informative part of the signal?
Tuning the smoothing parameter
Considering the influence of
:
Tuning the smoothing parameter
A unique
value for all genes.
A heuristic approach combining two levels of
reflection:
Eigenelements of the PCA performed a posteriori.
Biological interpretations of results.
Eigenvalues and eigenvectors smoothness
For each different values of :
Each gene expression profile is smoothed according to the
same value.
First derivatives are computed and discretized.
Eigenvalues and eigenvectors smoothness
Then, a PCA is computed, leading to a scree graph.
When is large, the derivative is constant, so the PCA gave only
one large eigenvalue. As decreases, other eigenvalues are arisen.
Eigenvalues and eigenvectors smoothness
The first two eigenvectors:
As decreases, the first two eigenvector become much more irregular and
more difficult to interpret.
Biological interpretation
The consistency with biological relevance should be
considered.
For higher value, two or three time points could interpret the
phenomena.
As decreases, more oscillations in the eigenvectors could be
irrelevant.
The consideration would be important for avoiding
misinterpretation.
Synthesis
For two levels of reflection, we could yield the
smoothing parameter
.
There are clearly two separate eigenvalues, and the
corresponding eigenvectors are smooth enough to
interpret the gene expression profile.
Clustering
By interpreting gene expression profiles with the
derivative of smoothed curves, we choose 20 points
equally spaced between 0 and 72 hours.
The data can be presented with 130 genes and 20
expression values for each gene.
Clustering
Two kinds of clustering are applied:
Hierarchical clustering
K-means clustering
Hierarchical clustering:
Ward criterion: when fusing two clusters, it minimize the
increase in the total within-cluster sum of squares.
K-means clustering:
To avoid an improper fusion in hierarchical clustering, we use
k-means clustering with the k centroids from the results of
hierarchical clustering.
Outline
Introduction
Biological Experiment
Statistical Methodology
Results
Discussion
Results
Hierarchical clustering: the dendrogram with four
clusters.
Results
The four clusters correspond to four temporal
expression profiles:
Weakly increasing(hc1)
Stationary(hc2)
Decreasing(hc3)
Strongly increasing(hc4)
Results
To make the clustering more robust, the k-means
clustering are performed with the initial centers as
the centers of the classes obtained when cutting the
dendrogram.
Clustering changes:
The main change.
Results
The four clusters of K-means clustering:
Graphical display
Representation of variables(time points) and
individuals(genes):
km4
km2
km1
km3
Outline
Introduction
Biological Experiment
Statistical Methodology
Results
Discussion
Discussion
Before clustering step, the spline smoothing is
applied as a de-noising method.
According to this work, one could set time points
adequately depending on the scientific aims.