PowerPoint - Department of Statistics

Download Report

Transcript PowerPoint - Department of Statistics

Periodicity Analysis
in a microarray time-course study
Xin Zhao1, J.S. Marron2, Martin T. Wells1
1Department
2Department
of Statistics, Cornell University
of Statistics, University of North Carolina
Abstract:
Microarray time-course genome-wide data are typically HDLSS (High
Dimension Low Sample Size). Gene expression profiles over time could be seen
as functional data. The functional approach could provide powerful new insights
for this type of data. Successful data reduction from a functional viewpoint is
used in an analysis of periodicities for a microarray gene expression data set. For
the purpose of analyzing periodicity, an appropriate Fourier Transformation
followed by PCA (Principal Component Analysis) reduces the dimension of data
from 18 to 2. The 2-dimensional Fourier subspace spanned by the sine and
cosine functions with 2 periods captures the main feature of periodicity in the
data. The distance to the origin in the subspace could be used to measure the
degree of periodicity for genes.
Introduction:
Identifying cell cycle-related genes is helpful in understanding the mechanisms
that maintain order during cell division and in studying cancer. Cell cycle-related
genes show periodic variation during the cell cycle.  factor-based
synchronization experiment was conducted by Spellman, et al (1998) to study
yeast genome-wide gene expression during two cell cycles. Gene expression
were measured for 6,178 genes over 18 equally spaced time points (cover 2 cell
cycles). After pre-processing of the data by removing observations with bad
quality, 4,489 genes have no missing values.
Objective:
Identify cell cycle-related genes in the yeast genome, i.e., genes that express
periodically over the cell cycle.
Methods and Results:
1. Missing data imputation: KNN method
For 1689 genes with missing values, missing data points were estimated using KNN if
method with k = 12. To impute the missing value at a time point for a gene, we selected12
genes with expression profiles similar to the gene. The weighted average of these 12 genes
expression value at the time point is used as an estimate for the missing value.
2. Analysis of periodicity
I. PCA (Principal Component Analysis) on raw data
- Figure 1 doesn’t reveal periodic structure in the raw data.
- Figure 2 shows the percentage of variation explained by each PC.
- Figure 3 and 4 show that PCA on the raw data doesn’t reveal the frequency 2 periodic
structure expected from the two cell cycle experiment design.
Fig 1: raw data -- 6178 gene expression
time series. x-axis is time point, y-axis is
log2(gene expression ratio)
Fig 3: projections of the data on the 1st PC
direction
Fig 2: power plot of PCA on raw data
Fig 4: projections of the data on the 2nd
PC direction
II. Project the data onto an appropriate Fourier subspace to reveal periodicity structure
Fourie basis B =
{sin( it ), cos(it ), i  2,4,6,8}
where  
2
, T = 18, t = 1, 2, …, 18
T
Projection matrix = B (BTB)–1 BT (data matrix)
18  6,178
18  6,178
III. PCA on the projected data
- Figure 5 revealed two cell cycle structure, but still no apparent periodic structure
- Figure 6 shows that the first two PCs explain about 60% of total variation.
- Figure 7 and 8 show that the 1st (2nd ) PC direction is similar to a sine (cosine) wave
over two periods.
Fig 5: projected data
Fig 7: projections of the projected data
on the 1st PC direction
Fig 6: power plot of PCA on the projected
data
Fig 8: projections of the projected data
on the 2nd PC direction
IV. Periodicity of genes in the 2-dim Fourier subspace spanned by {sin(2t),
cos(2t)}
Project data onto the 2-dim Fourier subspace with x (y)-axis representing cosine (sine)
direction.
The distance to the origin in the subspace is a metric for the periodicity of a gene.
Figure 9: scatter plot of genes in the subspace.
x-axis is proj_cos, y-axis is proj_sin
Conclusions:
•
Functional approach showed to be powerful in dimension reduction
for the purpose of finding interested pattern in a HDLSS data.
•
The main feature of periodicity in the data could be reserved by the
2-dimentional Fourier subspace spanned by periodic functions of
frequency 2.
•
The distance to the origin in the subspace is an appropriate metric
for the periodicity of genes.