Poster. - Stanford University

Download Report

Transcript Poster. - Stanford University

ICA-based Clustering of Genes from
Microarray Expression Data
Su-In
1
Lee ,
Serafim
2
Batzoglou
[email protected], [email protected]
1Department of Electrical Engineering, 2Department of Computer Science, Stanford University
1. ABSTRACT
2. GENE EXPRESSION MODEL
To cluster genes from DNA microarray, an unsupervised methodology
using independent component analysis (ICA) is proposed. Based on an
ICA mixture model of genomic expression patterns, linear and
nonlinear ICA finds components that are specific to certain biological
processes. Genes that exhibit significant up-regulation or downregulation within each component are grouped into clusters. We test
the statistical significance of enrichment of gene annotations within
each cluster. ICA-based clustering outperformed other leading methods
in constructing functionally coherent clusters on various datasets. This
result supports our model of genomic expression data as composite
effect of independent biological processes. Comparison of clustering
performance among various ICA algorithms including a kernel-based
nonlinear ICA algorithm shows that nonlinear ICA performed the best
for small datasets and natural-gradient maximization-likelihood worked
well for all the datasets.
Expression pattern of genes in a certain condition is a composite effect of
independent biological processes that are active in that condition. For
example, suppose that there are 9 genes and 3 biological processes
taking place inside a cell.
3. Microarray Data
Microarray Data display
expression levels of a set of genes
measured in various experimental
conditions.
Expression Patterns
of Genes under
an Experimental
Condition Expi
Expression Levels of aGene Gi
across Experimental Conditions
G1 G2
GN-1GN
Exp 1
Exp 2
Exp 3
Exp i
Examples
Heat shock, G phase in cell cycle, etc … conditions
Liver cancer patient, normal person, etc … samples
Exp M
Ribosome Biosynthesis
Gene 1
Genome
Gene 2
Gene 3 Gene 4
Gene 5
Gene 6 Gene 7 Gene 8 Gene 9
messenger RNA
Each biological process
becomes active by turning
on genes associated with
the processes.
Cell Cycle Regulation
Gene 1
Gene 2
Gene 3 Gene 4
Gene 5
Gene 6 Gene 7 Gene 8 Gene 9
Oxidative Phosphorylation
Observed genomic
expression pattern can be
seen as a combinational
effect of genomic
expression programs of
biological processes that
are active in that
condition.
Gene 1
Gene 2
Gene 3 Gene 4
Gene 5
Gene 6 Gene 7 Gene 8 Gene 9
Oxidative Phosphorylation
Cell Cycle Regulation
In an Experimental Condition
Ribosome Biosynthesis
4. Mathematical Modeling
The expression measurement of K genes observed in three conditions
denoted by x1, x2 and x3 can be expressed as linear combinations of
genomic expression programs of three biological processes denoted by
Unknown Mixing System
s1, s2 and s3.
Given a microarray
dataset, can we
recover genomic
expression programs
of biological
processes?
x  As
 x1   a11  a1n   s1 
 :  :



:
:
  
 
 xm  am1  amn  sn 
Ribosome Biogenesis
Oxidative Phosphorylation
Cell Cycle Regulation
Genomic Expression Programs of
Biological Processes
Heat Shock
Starvation
Hyper-Osmotic Shock
Genomic Expression Pattern in
Certain Experimental Conditions
In other words, can we decompose a
matrix X into A and S so that each row of
S represents a genomic expression
program of a biological process?
6. ICA-based Clustering
Step 1 Apply ICA to microarray data X to obtain Y
Step 2 Cluster genes based on independent components, rows of Y.
Based on our gene expression model, Independent Components y1,…,
yn are assumed to be expression programs of biological processes. For
each yi, genes are ordered based on activity levels on yi and C% (C=7.5)
showing significantly high/low level are grouped into each cluster.
We can measure
expression level of
genes using
Microarray.
Gene 1
Gene 2
Gene 3 Gene 4
Gene 5
Gene 6 Gene 7 Gene 8 Gene 9
5. ICA Algorithm
Using the log-likelihood
maximization approach, we can find
W that maximizes log-likelihood
L(y,W).
yi’s are assumed to be
statistically independent
y  Wx
n
p( x) | det(W ) | p( y)
p( y)   pi ( yi )
i 1
n
L( y,W )  log p( x)  log | det(W ) |  log pi ( yi )
Prior information on y
Super-Gaussian or Sub-Gaussian ?
W  W  W
p( y)  p( y1 ) p( yn ) 
yn 
y  y1

 ( y)  

,...
p( y)  p( y1 ) p( yn ) 


i 1
L( y,W ) T 1
W 
 (W )   ( y) xT
W
7. Measuring significance of ICA-based clusters
Statistical significance of biological coherence of clusters was measure
using gene annotation databases like Gene Ontology (GO).
Clusters from ICA
GO categories
GO 2
Cluster 1
GO 1
Cluster 2
Cluster 3
GO m
Cluster n
GO i
Cluster i
9. Results
For each method, the minimum p-values (<10-7) corresponding to each
GO functional class were collected and compared.
For every combination of our cluster
and a GO category, we calculated
the p-value, a change probability
that these two clusters share the
observed number of genes based on
the hypergeometric distribution.
 f  g  f 
GO j
 

k 1 
k genes p  1   m  n  m 

i, j
m 0
g
 
n
g: # of genes in all clusters and GOs
f: # of genes in the GO j
n: # of genes in the Cluster i
k: # of genes GO j and Cluster i share
8. Microarray Datasets
For testing, five microarray datasets were used and for each dataset, the
clustering performance of our approach was compared with another
approach applied to the same dataset.
ID
D1
D2
D3
Description
Yeast during cell cycle
Yeast during cell cycle
Yeast under stressful conditions
Genes
5679
6616
6152
Exps Compared with
22
PCA
17
k-means clustering
173
Bayesian approach
Plaid model
D4 C.elegans in various conditions 17817 553 Topomap approach
D5 19 kinds of normal Human tissue 7070 59
PCA