Transcript slides

Iclust: information based clustering
Noam Slonim
The Lewis-Sigler Institute for Integrative Genomics
Princeton University
Joint work with
Gurinder Atwal
Gasper Tkacik
Bill Bialek
1
Running example
Gene expression data
N conditions
2
12 -1
-1 6
12 ?? -5 11
K genes
-3
8
-2
6
11 11 -8
-1
?? 12
5 12
4
8
1
12 1
14 -8
5
-5
11 17 -2
1
11 -8
-6
12
8
?? 7
-2 ??
?? -2
5
-5
3
-4
12 ??
-2
5
14 ??
14 -8
-7
15
5
14 -8
5
16 2
0
5
-5
5
14 18
??
2
1
4 12
4
7
-1
3 -7
3
7
-5
2
4 -11 -3
21 ?? ??
3
3
-3 ??
(log) ratio of the mRNA
expression level of a gene
in a specific condition
9
Relations between genes?
Relations between experimental conditions?
2
Information as a correlation/similarity measure
Some nice features of the information measure:
Model independent
Responsive to any type of dependency
Captures more than just pairwise relations
Suitable for both continuous and discrete data
Independent of the measurement scale
Axiomatic
3
Mutual information - definition
We have some “uncertainty” about the state of gene-A;
but now someone told us the state of gene-B…
-The resulting reduction in the uncertainty about gene-A state
is called the mutual information between these two variables :
p ( a, b)
I [ p(a, b)]  H [ p(a)]  H [ p(a | b)]   p(a, b) log
p(a) p(b)
aA,bB
How much can we learn from the state of gene-B
about the state of gene-A (and vice versa).
4
Model independence & responsiveness to “complicated” relations
MI~2 bits; Corr.~0.6
gene-B expression level
gene-B expression level
MI~1 bit; Corr.~0.9
gene-A expression level
MI~1.3 bits; Corr.~0
MI~0 bits; Corr.~0
gene-B expression level
gene-B expression level
gene-A expression level
gene-A expression level
gene-A expression level
5
Capturing more than just pairwise relations
Triplet-information ~ 1.0 bits
gene-A/gene-B expression
gene-A/gene-B/gene-C expression
MI~0 bits; Corr.~0
Experiment index
Experiment index
Using a model-dependent correlation measure might result in
missing significant dependencies in our data.
6
Mutual-information vs. Pearson-Correlation results
in bacteria gene-expression data
Mycobacterium tuberculosis
Mutual information
81 experiments
Pearson Correlation
7
Information relations between gene expression profiles
Given the expression of gene-A, how much information do we have about
the expression of gene-B ? (when averaging over all conditions)
( sample size: number of conditions - 173 in Gasch data )
Once we find these information relations,
we often want to apply cluster analysis.
Numerous clustering methods are available –
but typically they assume a particular model.
For example, K-means corresponds to the
modeling assumption that each cluster can be
described by a spherical Gaussian.
Back in square one …?
8
Iclust – information based clustering
What is a “good” cluster?
A simple proposal – given a cluster, we pick two items at random,
and we want them to be as “similar” to each other as possible.
S (c)   q (c) q (i1 | c)q (i 2 | c) s (i1, i 2)
Formally, we wish to maximize
c
i1,i 2
Or …
S (c)   q (c)  q(i1 | c)q (i 2 | c)...q (ir | c) s (i1, i 2,..., ir )
Or …
S (c)  I (c)   q(c)  q (i1 | c)q (i 2 | c)...q(ir | c) I (i1, i 2,..., ir )
c
c
i 1,..., ir
i 1,..., ir
Namely, we wish to maximize the average information relations in our clusters, or
to find clusters s.t. in each cluster all items are highly informative about each other.
9
Iclust – information based clustering (cont.)
A penalty term that we wish to minimize, as in rate-distortion theory :
I (C ; i )   p (i )q (c | i ) log
c ,i
q (c | i )
q (c )
S(c) is maximized,
Penalty
Intermediate
term isinteresting
minimized
but thecases
(maximal
penalty
– small
term
compression),
penalty
is maximized
with
buthigh
as
S(c)
well
S(c)
is minimized
(no compression)
as well.
10
Iclust – information based clustering (cont.)
The intuitive clustering problem can be turned into a
General mathematical optimization problem:
Clustering parameters
Expected information relations
among data items
Information between data
items and clusters
F [ q(c | i ) ]  I (c) - T  I (C; i )
Tradeoff parameter
Clustering is formulated as trading bits of similarity against
bits of descriptive power, without any further assumptions.
11
Relations with other classical rate distortion
Iclust
F [ q(c | i ) ]  I (C; i ) 
Classical rate distortion
1

D (c )
F [ q(c | i ) ]  I (C; i ) 
D(c)   q(c)  q(i1 | c)...q(ir | c)  d ( (i1) ,...,  (ir ) )
c
1 ~
D (c )

~
D(c)   q(c) q(i1 | c)  d ( (i1) ,  ( c ) )
i1,..., ir
c
i1
For the special case of pairwise relations
D(c)   q(c) q(i1 | c) q(i 2 | c)  d ( (i1) ,  (i 2 ) )
c
i1
i2
~
D(c)   q(c) q(i1 | c)  d ( (i1) ,  q(i 2 | c)   (i 2 ) )
c
i1
i2
The difference is whether the sum over i2 is before/after d is computed
If the distortion/similarity matrix is a kernel matrix
the formulations are equivalent
12
And yet – some important differences
Iclust is applicable when the raw data is given directly as pairwise relations
Iclust do not require a definition of a “prototype” (or “centroid”)
Iclust can handle more than just pairwise correlations
Both formulations induce different decoding schemes
A sender observes a pattern Φi,
but is allowed to send only the cluster index, c
In classical rate distortion
the receiver is assumed
to decode by

In Iclust he receiver
is assumed to decode by
 (i )   (i) ~ q(i | c)
~ (i )
  ( c )   q (i | c)   (i)
i
~
Deterministic decoding
with vocabulary size Nc
Stochastic decoding
with vocabulary size N
13
Iclust vs. classical rate-distortion decoding
Original figure: 220 gray levels
Iclust (stochastic) decoding
RD (deterministic) decoding
2 clusters
2 clusters
14
Iclust algorithm - freely available Web implementation
1
q (c | i )  q (c) exp{ r  S (c; i )  (r  1)  S (c)}
T
Average “similarity”
of i to c members
Average “similarity”
among c members
Responsive to any type of dependency among the data
Invariant to changes in the data representation
See www.princeton.edu/~nslonim
Allows to cluster based on more than pairwise relations
For more details :
Slonim, Atwal, Tkacik, and Bialek (2005)
Information based clustering, PNAS, in press.
15
Iclust – clusters examples
Clusters of genes
C18
C15
C4
RPS10A
RPS10B
RPS11A
RPS11B
RPS12
…
FRS1
KRS1
SES1
TYS1
VAS1
…
PGM2
UGP1
TSL1
TPS1
TPS2
…
Proteins of the Enzymes that Enzymes involved
small ribosomal attach amino in the trehalose
subunit
acids to tRNA anabolism pathway
Data: Dynamics of stock prices
Data: Rating by viewers
Clusters of stocks
Clusters of movies
C17
C12
C2
C12
C1
C7
Wal-Mart
Target
Home Depot
Best Buy
Staples
…
Microsoft
Apple Comp.
Dell
HP
Motorola
…
NY Times
Tribune Co.
Meredith Corp.
Dow Jones & Co.
Knight-Ridder Inc.
…
Snow White
Cinderella
Dumbo
Pinocchio
Aladdin
…
Psycho
Apocalypse Now
The Godfather
Taxi Driver
Pulp Fiction
…
Star Wars
Return of the Jedi
The Terminator
Alien
Apollo 13
…
Given the price of stock-A, how much
information do we have about the price
of stock-B ? (when averaging over many days)
Given the rating of movie-A, how much
information do we have about the rating
of movie-B ? (when averaging over many viewers)
Hierarchical
S&P 500
K-medians
K-means
Hierarchical
K-medians
ESR
K-means
Hierarchical
100
K-medians
K-means
Coherence
Coherence results – comparison to alternative algorithms
EachMovie
75
50
25
0
Quick Summary
Information as the core measure of data analysis with many appealing features
Iclust - a novel information-theoretic formulation of clustering, with some
intriguing relations with classical rate distortion clustering.
Validations: finding coherent gene clusters based on information relations in
gene-expression data
… and finding coherent stocks clusters, coherent movies clusters …
… and genotype-phenotype association in bacteria, based on phylogenetic data Slonim, Elemento & Tavazoie (2005), Mol. Systems Biol., in press.
… and more?