Transcript Slide
1
Eigensolvers for analysis of
microarray gene expression data
Andrew Knyazev (speaker) and Donald McCuan
Image from http://www.biosci.utexas.edu/mgm/people/faculty/profiles/VIresearch.jpg
Supported by NSF DMS 0728941. In collaboration with CU MCD Biology.
2
Eigensolvers for DNA microarrays:
Crash course on gene expression
Microarrays---a massively parallel experiment
Clustering: why?
Clustering: how?
Spectral clustering
Connection to image segmentation
Eigensolvers for spectral clustering
3
Crash course on gene expression 1/3
Genes in DNA code for proteins
Protein formation in a cell involves:
Transcription of DNA to mRNA (messenger RNA)
Translation of mRNA to a protein
When proteins are being formed for a gene this is called
gene expression
DNA sense strand
antisense strand
.... ATA CGT ...
.... TAT GCA ...
transcription
mRNA
.... AUA CGU ...
translation
Protein
.... Ile Arg ...
4
Crash course on gene expression 2/3
Image Courtesy: cnx.org
5
Crash course on gene expression 3/3
Gene expression in a cell depends on many factors, e.g.,
developmental stage, nutrition, environment, and diseases,
so the level of gene expression may vary
Knowing how genes are expressed helps to understand
cellular processes and diagnose diseases
Measurement of the concentration of proteins in a cell is
complicated, so the concentration of mRNA is used
instead, assuming that most mRNA created is actually
translated to a protein
DNA Microarrays (e.g., Affymetrix GeneChip arrays)
measure the level of mRNA in a sample
Affymetrix GeneChip DNA Microarrays
Affymetrix GeneChip DNA Microarrays
Image Courtesy: Affymetrix
6
7
Microarrays-massively parallel experiment 2/5
GeneChip: oligonucleotide sequences are photo-lithographed on a
quartz wafer in a pattern of ~10 micrometers dots.
Oligonucleotide sequences (oligos) probes: 25 nucleotide chains for
selected parts of a gene complementary to mRNA.
For every gene there are 1120(depending on chip design) of
different oligo probes called perfect
matches (PM). In addition, there
are mismatch oligos (MM)
corresponding to each of the PMs
that differ in the middle base pair.
GeneChips are manufactured to include all currently known and
predicted genes of a particular organism, e.g., H. sapience. The
information about physical locations of oligo probes for each gene
on the chip is contained in the *.cdf file.
A sample of mRNA extracted from cells of an organism after preprocessing is hybridized with GeneChip giving PM and MM values
which characterize genes expressions in the cells.
8
Microarrays-massively parallel experiment 3/5
Labelled cRNA targets
derived from the mRNA of
an experimental sample
are hybridized to oligo
probes.
During hybridization,
complementary
nucleotides line up and
bind together via
hydrogen bonds in the
same way as two strands
of DNA bound together.
The chip is then scanned
with a laser giving the
amount of each mRNA
species represented.
Image Courtesy: cnx.org
9
Microarrays-massively parallel experiment 4/5
A pool of mRNA is extracted from the cells of an organism and
converted to a Biotin labelled strand (cRNA) that binds to the oligo
probes on the GeneChip during hybridization.
The higher the concentration of a particular mRNA in the testing pool--the greater the hybridization level of the PM probes and thus the
amount of the hybridized material on the processed GeneChip.
Then a fluorescent stain is applied that binds to the Biotin and the
GeneChip is processed through a scanner that illuminates each dot of
the GeneChip with a laser, causing dots to fluoresce.
The image data of the scanned probe array is stored in a *.dat file. The
Affymetrix GCOS software processes the *.dat file and generates a
*.cel file, containing all numerical data of the GeneChip experiment,
e.g., probe locations and PM and MM intensities. The processing
involves computing a square grid locating the dots for probes, intensity
normalization, using internal controls, and detecting the outliers.
More sophisticated *.dat-->*.cel algorithms, e.g., taking into account
the cRNA saturation, are being developed elsewhere.
10
Microarrays-massively parallel experiment 5/5
The PM and MM values are not normally used directly for highlevel statistical analysis, instead they are first converted into the
gene expression values, which involves:
Detecting unreliable data by comparing PM and MM
Adjustment for background and noise
Calculating the single array gene expression intensities, basically
by averaging adjusted PM values for each probe set
Alternatively, the Comparison Analysis (Experiment versus
Baseline arrays) detects and quantifies changes in gene
expressions between two arrays, applying normalization of data
and using the Signal Log Ratio algorithms.
Either way, the absolute or comparison gene expression values are
stored in a *.chp file, which serves as the input for high-level
statistical analysis. Typically, multiple GeneChip tests are
performed giving multiple *.chp files with gene expression values.
11
Clustering: why?
When conducting microarray experiments there are
multiple microarrays involved typically:
Studying a process over time, e.g., to measure the
response to a drug or food.
Looking for differences between states , e.g., normal
cells versus cancer cells.
A typical goal is Finding Gene Networks, i.e., groups of
genes that change expression inter-dependently across
samples. Having a significantly large number of
microarrays, we want to reverse engineer the regulatory
network that controls gene expressions. We need
computer clustering on the microarray data to select a
small (ideally) number of co-expressed genes of a gene
network. Separate experiments using gene knockout on
the selected genes can then be performed to confirm the
discovered regulatory network biologically.
12
Clustering: how?
There is no good widely accepted definition of clustering.
The traditional graph-theoretical definition is combinatorial
in nature and computationally infeasible. Heuristics rule!
Many clustering techniques and methods are known, e.g.,:
Hierarchical clustering/partitioning
K-means (centroids)
Self-organizing maps (partitioning vectors)
Force-directed placement
Principal Components Analysis (PCA)
Spectral clustering/partitioning using Fiedler vectors
Some good and popular free open source software, e.g.,
METIS and CLUTO (Karypis Lab).
We focus on PCA and spectral clustering.
Spectral clustering
A 4-degree-of-freedom system has 4
modes of vibration and 4 natural
frequencies: partition into 2 clusters
using the second eigenvector
13
Images Courtesy: Russell, Ketteriung U.
• A = adjacency matrix
Example Courtesy: Blelloch
CMU
1
2
• D = degree matrix
• Laplacian matrix L = D – A
5
3
www.cs.cas.cz/fiedler80/
Lx2 .83x2
4
3
0
L1
1
1
0
1
0
0
1
1
0
2
1
0
1
0
1
3
1
1
. 26
1
. 81
0 x 2 . 44
. 26
1
3
. 13
Rows sum to zero
•Fiedler eigenvectors Lx=λx
•N-cut eigenvectors Lx=λDx
(smallest) are the largest for
•PCA Markov walks Ax=µDx
with µ=1-λ. D-1A is rawstochastic and describes
the walk probabilities.
14
Connection to image segmentation
Image pixels serve as graph vertices. Weighted graph edges
are computed by comparing pixel colours.
Here is an example displaying 4 Fiedler vectors of an image:
We generate a sparse Laplacian, by comparing neighbouring
pixels here when computing the weights for the edges. Genes
correspond to vertices in microarrays, but we have to compare
all genes, possibly getting a Laplacian with a large fill-in.
15
Eigensolvers for spectral clustering
Our BLOPEX-LOBPCG software has proved to be
efficient for large-scale eigenproblems for Laplacians
from PDE's and for image segmentation using
multiscale preconditioning of hypre
The LOBPCG for massively parallel computers is
available in our Block Locally Optimal Preconditioned
Eigenvalue Xolvers (BLOPEX) package
BLOPEX is built-in in http://www.llnl.gov/CASC/hypre/
and is included as an external package in PETSc, see
http://www-unix.mcs.anl.gov/petsc/
On BlueGene/L 1024 CPU we can compute the
Fiedler vector of a 24 megapixel image in seconds
(including the hypre algebraic multigrid setup).
16
Work in progress/future work
Our current test cases are to analyze:
Affymetrix GeneChip data from Marina Kniazeva
et al. PLoS Biology, 2004.
Microarray data from Liang Zhang et al.,
Molecular Cell, 2007.
Our future work will involve developing prototype
spectral clustering software for Microarrays in the
Bioinformatics toolbox in MATLAB, writing a
Microarray analysis driver for our BLOPEX library,
and testing on large-scale publicly available
Microarray data.