Presentation

Transcript Presentation

Identifying co-regulation
using Probabilistic Relational
Models
by Christoforos Anagnostopoulos
BA Mathematics, Cambridge University
MSc Informatics, Edinburgh University
supervised by Dirk Husmeier
General Problematic
Bringing together disparate data sources:
Promoter sequence data
...ACGTTAAGCCAT...
...GGCATGAATCCC...
General Problematic
Bringing together disparate data sources:
Promoter sequence data
...ACGTTAAGCCAT...
...GGCATGAATCCC...
Gene expression data
gene 1: overexpressed
gene 2: overexpressed
...
mRNA
General Problematic
Bringing together disparate data sources:
Promoter sequence data
...ACGTTAAGCCAT...
...GGCATGAATCCC...
Gene expression data
gene 1: overexpressed
gene 2: overexpressed
...
mRNA
Protein interaction data
protein 1 protein 2 ORF 1
ORF 2
-------------------------------------------------AAC1
TIM10
YMR056C YHR005CA
AAD6
YNL201C YFL056C YNL201C
Proteins
Our data
Promoter sequence data
...ACGTTAAGCCAT...
...GGCATGAATCCC...
Gene expression data
gene 1: overexpressed
gene 2: overexpressed
...
mRNA
Bayesian Modelling
Framework
Bayesian Networks
Bayesian Modelling
Framework
Conditional Independence Assumptions
Factorisation of the Joint Probability Distribution
Bayesian Networks
UNIFIED TRAINING
Bayesian Modelling
Framework
Probabilistic Relational Models
Bayesian Networks
Aims for this presentation:
1. Briefly present the Segal model
and the main criticisms offered in
the thesis
2. Briefly introduce PRMs
3. Outline directions for future work
The Segal Model
Cluster genes into transcriptional
modules...
Module 1
Module 2
?
gene
The Segal Model
Module 1
Module 2
P(M = 1)
P(M = 2)
gene
The Segal Model
How to determine P(M = 1)?
Module 1
P(M = 1)
gene
The Segal Model
How to determine P(M = 1)?
Module 1
Motif Profile
motif 3: active
motif 4: very active
motif 16: very active
motif 29: slightly active
gene
The Segal Model
How to determine P(M = 1)?
Predicted
Expression
Levels
Module 1
Motif Profile
motif 3: active
motif 4: very active
motif 16: very active
motif 29: slightly active
Array 1: overexpressed
Array 2: overexpressed
Array 3: underexpressed
...
gene
The Segal Model
How to determine P(M = 1)?
Predicted
Expression
Levels
Array 1: overexpressed
Array 2: overexpressed
Array 3: underexpressed
...
Module 1
Motif Profile
motif 3: active
motif 4: very active
motif 16: very active
motif 29: slightly active
P(M = 1)
gene
The Segal model
PROMOTER SEQUENCE
The Segal model
PROMOTER SEQUENCE
MOTIF PRESENCE
The Segal model
PROMOTER SEQUENCE
MOTIF MODEL
MOTIF PRESENCE
The Segal model
MOTIF PRESENCE
MODULE ASSIGNMENT
The Segal model
MOTIF PRESENCE
REGULATION MODEL
MODULE ASSIGNMENT
The Segal model
MODULE ASSIGNMENT
EXPRESSION DATA
The Segal model
MODULE ASSIGNMENT
EXPRESSION MODEL
EXPRESSION DATA
Learning via hard EM
HIDDEN
Learning via hard EM
Initialise hidden variables
Learning via hard EM
Initialise hidden variables
Set parameters to
Maximum Likelihood
Learning via hard EM
Initialise hidden variables
Set parameters to
Maximum Likelihood
Set hidden values to their
most probable value given
the parameters (hard EM)
Learning via hard EM
Initialise hidden variables
Set parameters to
Maximum Likelihood
Set hidden values to their
most probable value given
the parameters (hard EM)
Motif Model
OBJECTIVE:
r=1
Learn motif so as to discriminate
between genes for which the Regulation
variable is “on” and genes for
which it is “off”.
r=0
Motif Model – scoring scheme
high score:
low score:
...CATTCC...
...TGACAA...
Motif Model – scoring scheme
high score:
low score:
...CATTCC...
...TGACAA...
high scoring
subsequences
...AGTCCATTCCGCCTCAAG...
Motif Model – scoring scheme
high score:
low score:
...CATTCC...
...TGACAA...
high scoring
subsequences
...AGTCCATTCCGCCTCAAG...
low scoring (background) subsequences
Motif Model – scoring scheme
high score:
low score:
...CATTCC...
...TGACAA...
high scoring
subsequences
...AGTCCATTCCGCCTCAAG...
promoter
sequence
scoring
low scoring (background) subsequences
Motif Model
SCORING SCHEME
w:
P ( g.r = true | g.S, w )
parameter set
can be taken to represent motifs
Motif Model
SCORING SCHEME
w:
P ( g.r = true | g.S, w )
parameter set
can be taken to represent motifs
Maximum Likelihood setting
Most discriminatory motif
Motif Model – overfitting
TRUE
PSSM
Motif Model – overfitting
TRUE
PSSM
typical motif:
...TTT.CATTCC...
high score
Motif Model – overfitting
TRUE
PSSM
typical motif:
...TTT.CATTCC...
high score
INFERRED
PSSM
Can triple the score!
Regulation Model
For each module m and each motif i, we
estimate the association umi
P ( g.M = m | g. R ) is proportional to
Regulation Model:
Geometrical Interpretation
The (umi )i define separating hyperplanes
Classification criterion is the inner product:
Each datapoint is given the label of the hyperplane it is
the furthest away from, on its positive side.
Regulation Model:
Divergence and Overfitting
pairwise linear separability
overconfident classification
Method A:
dampen the parameters (eg Gaussian prior)
Method B:
make the dataset linearly inseparable by
augmentation
Erroneous interpretation of
the parameters
Segal et al claim that:
When umi = 0, motif i is inactive in module m
When umi > 0 for all i,m, then only the presence of motifs
is significant, not their absence
Erroneous interpretation of
the parameters
Segal et al claim that:
When umi = 0, motif i is inactive in module m
When umi > 0 for all i,m, then only the presence of motifs
is significant, not their absence
Contradict normalisation conditions!
Sparsity
INFERRED PROCESS
TRUE PROCESS
Sparsity
Reconceptualise the problem:
Sparsity can be understood as pruning
Pruning can improve generalisation performance (deals with
overfitting both by damping and by decreasing the degrees of freedom)
Pruning ought not be seen as a combinatorial problem,
but can be dealt with appropriate prior distributions
Sparsity: the Laplacian
How to prune using a prior:
choose a prior with a simple discontinuity at the origin,
so that the penalty term does not vanish near the origin
every time a parameter crosses the origin, establish whether
it will escape the origin or is trapped in Brownian motion around it
if trapped, force both its gradient and value to 0 and freeze it
Can actively look for nearby zeros to accelerate pruning rate
Results: generalisation
performance
Synthetic Dataset with 49 motifs, 20 modules and 1800 datapoints
Results: interpretability
DEFAULT MODEL:
LEARNT WEIGHTS
TRUE MODULE
STRUCTURE
LAPLACIAN
PRIOR MODEL:
LEARNT WEIGHTS
Regrets: BIOLOGICAL DATA
Aims for this presentation:
1. Briefly present the Segal model
and the main criticisms offered in
the thesis
2. Briefly introduce PRMs
3. Outline directions for future work
Probabilistic Relational
Models
How to model context – specific regulation?
Need to cluster the experiments...
Probabilistic Relational
Models
Variable A can vary with genes
but not with experiments
Probabilistic Relational
Models
We now have variability with experiments
but also with genes!
Probabilistic Relational
Models
Variability with experiments as required
but too many dependencies
Probabilistic Relational
Models
Variability with experiments as required
provided we constrain the parameters of
the probability distributions P(E|A) to be equal
Probabilistic Relational
Models
Resulting BN is essentially UNIQUE.
But derivation: VAGUE, COMPLICATED, UNSYSTEMATIC
Probabilistic Relational
Models
GENES
g.S1, g.S2, ...
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
this variable cannot be considered an attribute of a gene,
because it has attributes of its own that are gene-independent
Probabilistic Relational
Models
GENES
g.S1, g.S2, ...
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
Probabilistic Relational
Models
GENES
EXPERIMENTS
g.S1, g.S2, ...
e.Cycle_Phase
g.R1, g.R2, ...
g.M
g.E1, g.E1, ...
e.Dye_Type
Probabilistic Relational
Models
GENES
EXPERIMENTS
g.S1, g.S2, ...
e.Cycle_Phase
g.R1, g.R2, ...
e.Dye_Type
g.M
g.E1, g.E1, ...
An expression measurement is an attribute
of both a gene and an experiment.
Probabilistic Relational
Models
GENES
EXPERIMENTS
g.S1, g.S2, ...
e.Cycle_Phase
g.R1, g.R2, ...
e.Dye_Type
g.M
g.E1, g.E1, ...
MEASUREMENTS
m(e,g).Level
Examples of PRMs - 1
Segal et al, “From Promoter Sequence to Gene Expression”
Examples of PRMs – 1
Segal et al, “From Promoter Sequence to Gene Expression”
Examples of PRMs - 2
Segal et al, “Decomposing gene expression into cellular processes”
Examples of PRMs - 2
Segal et al, “Decomposing gene expression into cellular processes”
Probabilistic Relational Models
PRM = { BN1, BN2, BN3, ... }
given Dataset1
PRM = BN1
given Dataset2
PRM = BN2
Relational schema :
higher level
description of data
PRM:
higher level
description of BNs
Probabilistic Relational Models
Relational vs flat data structures:
• Natural generalisation – knowledge carries over
• Expandability
• Richer semantics – better interpretability
• No loss in coherence
Personal opinion (not tested yet):
• Not entirely natural as a generalisation
• Some loss in interpretability
• Some loss in coherence
Aims for this presentation:
1. Briefly present the Segal model
and the main criticisms offered in
the thesis
2. Briefly introduce PRMs
3. Outline directions for future
work
Future research
1. Improve the learning algorithm
‘soften’ it by exploiting sparsity
systematise dynamic
addition / deletion
Future research
2. Model Selection Techniques improve
interpretability
learn the optimal number of
modules in our model
Future research
2. Model Selection Techniques improve
interpretability
learn the optimal number of
modules in our model
Are such methods consistent?
Do they carry over just as well in PRMs?
Future research
3. Fine tune the Laplacian regulariser to
fit the skewing of the model
Future research
4. The choice of encoding the question
into a BN/PRM is only partly determined
by the domain
Are there any general ‘rules’ about how
to restrict the choice so as to promoter
interpretability?
Future research
5. Explore methods to express
structural, nonquantifiable prior
beliefs about the biological domain
using Bayesian tools.
Summary:
1. Briefly presented the Segal model
and the main observations offered
in the thesis
2. Briefly introduced PRMs
3. Hinted towards directions for
future work

Presentation

Transcript Presentation

Directory