Transcript ge04_bic2

Analysis of Gene Expression and
Gene Networks
Biclustering 2
On this lecture
• Two current biclustering methodologies
• Iterative Signature Algorithm (ISA)
– Simple
– Randomized
• SAMBA
– Combinatorial Roots
– Fast
• And maybe a little more
What makes a biclustering
algorithm?
• Score/Define what is a bicluster
• Algorithm for finding one bicluster in
the data
• Algorithm for finding all (many)
biclusters in the data
• Important themes:
– Normalization
– Redundencies
Previously in GE:
• What is a bicluster:
– Cheng church
– CTWC
• How to search for a bicluster
– Cheng church
– CTWC
• Normalization
• Redundancies
The Iterative
Signature Algorithm
• Developed at Naama Barkai’s
Lab at WIS (I. Ihmels, S.
Bergman)
• Motivation:
– A bicluster is a “stable” set of
genes and conditions
– It is possible to refine
approximate set of genes by
“stabalizing” them
Normalization: ISA
• Can we normalize for both gene and
condition dependent trends?
• In the ISA we are not trying to..
• Given a gene expression matrix E one
conditions U and genes V form:
– EC : normalize each column to 0 mean, 1 std
– EG : normalize each gene to 0 mean, 1 std
What is a bicluster: ISA
• Observe: assume all columns are
independent, what is the distribution of
(j in U’) eGij
for a random condition set U’ and gene
i?
• Mean = 0, Std=sqrt(|U’|)
• Same for (i in V’) eGij and gene set V’.
• In a bicluster, we like independence not
to hold.
What is a bicluster: ISA
• Given a set of genes U’ define:
– ISA(U’) = {v in V s.t. (j in U’) eGvj > TGσU’}
• Given a set of genes V’ define:
– ISA(V’) = {u in U s.t. (j in V’) eCiu > TCσV’}
• TG ,TC – threshold parameters, σU’ ,σV’
standard deviations
• A (perfect) bicluster is a pair (U’,V’) s.t.
ISA(V’) = U’
ISA(U’) = V’
Searching for biclusters: ISA
• ISA – defining a directed graph on the set of
condition and genes subsets.
• A bicluster is a cycle of two nodes U’
• An approximated bicluster is a larger cycle but not
too large.
• The algorithm: start from a random or known gene
set, compute ISA until converging to an approximated
bicluster:
– Ui = ISA(Vi) , Vi = ISA(Ui-1)
– Converge at i when for all j > i-m, |Ui-Uj|/|Ui+Uj| < 1-ε
Redundancies: ISA
• Starting from different seeds yield different
fixed points (Biclusters)
• Using different threshold changes the
graph structure and give more fixed points.
• Need to filter similar solutions and report a
short list of significant biclusters
ISA - applications
• Starting from genes with a known functional
annotation and refine them to a bicluster
• Starting from genes with known transcription
factor binding sites
• Starting from a set of sequence orthologs
• See: Ihmels et al. Nat Gen 2002, Bergman et
al. Phy Rev Letter 2003, Bergman et al. PLoS
2004.
ISA – Pros/Cons
• Pros
– Simple, Quite fast
– Elegant solution to the normalization problem
– Good empirical results in several cases
• Cons
–
–
–
–
Thresholds setting
Finding good seeds
Redundencies
Non normal behaviors
• Assignment 3 will give you more insights
SAMBA
• Developed here
• Motivation:
– Harvest efficient combinatorial techniques
for biclustering large datasets.
– Couple a statistical model to the biclusters
– Allow integration of heterogeneous data
The SAMBA model
edge
conditions
no edge
Goal : Find high
similarity submatrices
Goal : Find dense
G=(U,V,E)
subgraphs
The SAMBA approach
• Normalization: translate GE matrix to a
weighted bipartite graph using a statistical
model for the data
• Bicluster model: Heavy subgraphs
• How to find biclusters: Combined hashing and
local optimization
• Redundancies: Find many biclusters at once,
filter them in post process
From a statistical model to edge
weights – a simple example
• Background model: Independent edges, each
present with prob. p.
• H – subgraph of n genes, m conds, k edges
• P-value = tail of binomial distribution:
 nm  k '
 p (1  p) nm k '  2 nm p k (1  p) nm k
p( H )   
k ' k  k ' 
• Weight the graph
– edges: (-1-log p)
– non-edges: (-1-log(1-p)).
then subgraph weight  log p-value.
Limitations of the uniform
probability model
• Not all dense subgraphs are statistically
significant.
• Different genes/conds have typical noise
characteristics.
• Noisy genes/conds have high probability of
forming dense subgraphs.
• An extended likelihood ratio model:
Bicluster Random
Subgraph Model
Background Random
Graph Model
=
Likelihood model
translates to sum of
weights over edges
and non edges
A Degree Based Random Graph
Model
low-prob edges
medium-prob edges
high-prob edges
• An edge between (u,v) occurs independently with prob p(u,v).
• p(u,v) depends on both u and v degrees
• P(u,v) = Pr((u,v) in E’ | all G=(U,V,E’) such that
deg(w, E’)=deg(w,E) for all w in U,V)
• Approximated using a hyper-geometric calculation
Model Likelihood Ratio
• Model assumption - bicluster edges occur
independently with prob pc
• Likelihood ratio score:
L( B ) 

( u ,v )E '
pc
1  pc

p (u , v) (u ,v )E ' 1  p (u, v)
pc
1  pc
log L( B)   log
  log
p (u, v) ( u ,v )E '
1  p (u , v)
( u ,v )E '
Subgraph weight = log likelihood ratio
Heaviest bipartite subgraph
• NPC (Dawande et al. 97, Hochbaum 98)
• (Recall: node blicque is polynomial!)
• Assumption: degree on V side bounded by d:
• Start by finding heavy bicliques.
• Alg: use hashing to discover heavy subsets of
conds. Takes O(n2d) time and space.
Finding Heaviest Biclique
4
3
6
2
4
42
2
4
3
2
2
2
4
Using bicliques to find the
heaviest biclusters
Assume edge weight = 1, non-edge weight = -1
Note that:
w((U ',V ')) 
 w(((u '),V ')
uU '
Lemma: If B=(U’,V’) is maximal and XU’ then v s.t. |N(v)X|>=|X|/2.
Pf:
0  w(( X ,V '))   | N (v)
 2 | N (v )
X | | N (v)
vV '
X | | X |
vV '
Corrolary: If B=(U’,V’) is maximal then |U’|<= 2d
X |
Using bicliques to find the
heaviest biclusters
A set of conditions in a maximal bicluster is the union of up to log(2D)
subsets of gene neighborhoods.
U’
u’’
u’’’
…
• Exhaustive O((n2D)log(2D)) time alg:
•Hash bicliques
•enumerate all log(2D) size N(v) combinations.
• Can be generalized to handle arbitrary edge/nonedge weights.
SAMBA’s implementation
• Phase I: find heavy bicliques - hash for
each gene of deg<d all subsets of
neighbors of size 4-6.
• Phase II: greedy expansion of heaviest
bicliques containing each gene/cond
• Phase III: filter overlapping biclusters.
Heterogeneous information
sources
Transcription Level
mRNA profiling
ChIP Chip
Protein Level
2-Hybrid
Protein Complexes
Identification using
Mass Spec
and so many more…
Phenotype Level
Barcoded
deletion libraries
1 + 1 = 0
Synthetic
lethality
From experiments to properties
p1
p2
p3
p2
Strong complex
binding to
protein P
p1
Medium complex
binding to
Protein P
p4
Strong Medium Medium
Strong
Induction Induction Repression Repression
p1
p2
Strong
Medium
Binding to Binding to
TF T
TF T
p1
gene g
p2
High
Medium
Sensitivity Sensitivity
p1
p2
High Confidence Medium Confidence
Interaction
Interaction
A Heterogeneous Collection of
Yeast Genomic Information
• Gene expression: ~1000 conditions, 27
publications
• TF binding profiles: 110 profiles from growth
on YPD (Lee et al.)
• Phenotype profiles: 6 (30) profiles
(Giaever et al.)
• Two hybrid interactions: ~1000
(Uetz et al.)
• Protein Complex interaction: ~4000
(Ho et al.)
• MIPS interactions: ~1000
A SAMBA module
Genes
Properties
CPA1
GO annotations
CPA2
log likelihood
Statistical Model Provides
High Specificity
+ Lymphoma data (Alizadeh et.al)
x Shuffled Data
log p-value
Global View of modular organization in yeast
Inferring functional
annotations
• Using SAMBA results for annotating
uncharacterized yeast genes
• Performing “guilt by association”
• Same procedure for properties (which
reflects poorly characterized conditions)
Mating Genes
Putative Mating
Over X%
Uncharacterized
Predictions are highly specific
5 mating predictions were tested experimentally
4 mutants failed to mate
SAMBA as a universal language for
functional genomics databases
User
SAMBA Query
Gene expression
TF location
Proteomics
Phenotypes
…..
Updated
Relevant
Modules
SAMBA – Pros/Cons
• Pros
– Fast
– Allow simultaneous normalization of genes
and conditions
– Allow integration of hetergenous data
– Well suited for query based usage
• Cons
– Discretization
Two words on: Probabilistic
Models for Biclsutering
• Bicluster model: each subcolumn have a
typical normal distribution ,different
from the background
• Model the entire matrix: tile the matrix
by biclusters
• Model score: likelihood based
• Avoid overfitting by standard
techinuqes
Two words on: Probabilistic
Models for Biclsutering
• How to find the biclusters: Start by
clustering and refine them using an EM
algorithm:
– Given a clustering calculate the model
parameters (distirubtions per bicluster)
– Given the distributions, reassign the
biclusters
Biclustering - Summary
• A general data mining problem
• The key point: defining what is a
bicluster
• Algorithms vary depending on the
nature of bicluster model
• The future problem: search for
biclusters in a really huge matrices.