Transcript Slides

Predicting Genetic Regulatory
Response Using Classification
Us v. Them
(“Them” being Manuel Middendorf, Anshul Kundaje,
Chris Wiggins, Yoav Freund, and Christina Leslie, in
“Predicting Genetic Regulatory Response Using
Classification” (2004)
The Problem
• Current studies of gene transcription tend
to be descriptive
• Need for a predictive system – the ability
to predict gene regulation for new
experiments
• Rather than determining patterns in sets of
genes and conditions, look at underyling
causes of those patterns
The Important Parts of Genes &
Experiments
• Regulation is determined by binding sites
(motifs) and regulators (parents)
• The significance of experiments, then, is
how they affect regulators
• The significance of genes is what motifs
they contain
Binding Sites and Regulation
• Discretize gene response into only up-regulated
(1), down-regulated(-1), or unchanged (0)
• A motif is either present (1) or absent (0)
• A parent is either up-regulated (1), downregulated (-1), or unchanged (0)
• Assume (and we need to check with someone
who actually knows something about biology on
this) that things only happen if motif is present
and parent is either up- or down-regulated
What our matrix really looks like
•
•
•
•
•
•
g = # of genes
e = # of experiments
p = # of parents
m = # of motifs
Then we have g*e response values
For each response, we have p*m parent/motif
combinations
• For each parent/motif combination, there are three
possibilities – present and up-regulated, present and
down-regulated, or all those other possibilities where
nothing happens
• Represent these possibilities as a pair of binary
variables, one for up and one for down
M1, P1
+
M1, P1
-
M1, P2
+
M1, P2
-
…
M1, Pp
+
…
Mm, Pp
-
G1, E1
M1, P1
+
M1, P1
-
M1, P2
+
M1, P2
-
…
M1, Pp
+
…
Mm, Pp
-
G1, E2
M1, P1
+
M1, P1
-
M1, P2
+
M1, P2
-
…
M1, Pp
+
…
Mm, Pp
-
…
M1, P1
+
M1, P1
-
M1, P2
+
M1, P2
-
…
M1, Pp
+
…
Mm, Pp
-
G1, Ee
M1, P1
+
M1, P1
-
M1, P2
+
M1, P2
-
…
M1, Pp
+
…
Mm, Pp
-
…
M1, P1
+
M1, P1
-
M1, P2
+
M1, P2
-
…
M1, Pp
+
…
Mm, Pp
-
Gg, Ee
Some Numbers
• In the paper, the initial dataset had 6110
genes and 173 experiments
• 354 motifs are considered
• 475 regulators are considered
• Set of genes to consider is reduced to only
1411 genes of interest
Some More Numbers
• Only train on genes that are up- or downregulated
• Approx. 8% of gene/experiment pairs from
the overall sample appear to be, so,
assuming this holds true in the reduced
sample, we have 19,632 gene /
experiment pairs to train on
• For each of these values we have
2*354*475 = 336,300 predictor variables
Some Problems
• 19,632 by 336,300 is an awfully large
matrix to want to do any calculations with
• We have far more variables than
observations
Possible Solutions
• Random Projection:
– Pro: we can reduce our dimensionality
– Con: it seems like a somewhat silly approach
– Con: there’ll still be a lot of calculations just to
make the projection
Possible Solutions
• Variable Selection or PCA
– Pro: we could reduce our dimensionality in a
more informed way
– Con: computationally painful
Possible Solutions
• Random Forest
• From Breiman and Cutler:
• Grow a number of classification trees, and take the vote
of the classifications of all trees
• For each tree, if we have n cases, sample, with
replacement, n cases, using some number of randomly
chosen variables much smaller than the full number of
variables
– Pro: allows for flexibility in reduction of dimensions being
considered
– Pro: dimension is reduced without computation problems
– Con: ?
Their Solution
• Alternating decision
tree:
• A tree with
alternating levels of
prediction nodes
and splitter nodes
• Combines a set of
weak prediction
rules (the splitter
nodes) to make one
strong rule
Alternating Decision Trees
v.
Random Forests
• Both the same rough concept – take a vote from
many weak rules to get a strong rule
• In ADTs, rules are based on single variables,
and may be conditioned on values of other
variables
• In random forests, rules can be based on
multiple variables, and are only marginal over all
values of other variables
• ADTs are fairly straightforward to read and
interpret
Alternating Decision Trees
• Pro: computationally kind
• Pro: works better than Random Forests
(Creamer & Freund, 2004)
• Con: we didn’t come up with it
Other Ideas?