slides () - Université de Montréal

Download Report

Transcript slides () - Université de Montréal

Gene Expression Analysis
Using Bayesian Networks
Éric Paquet
LBIT
Université de Montréal
1
Biological basis
RNA Polymerase
(Copy DNA in RNA)
DNA
(Storage of Genetic
Information)
Ribosome
(Translate Genetic
Information in Proteins)
mRNA
(Storage & Transport
of Genetic Information)
Proteins
(Expression of
Genetic Information)
*-PDB file 1L3A, Transcriptional Regulator Pbf-2
2
Biological basis
How do proteins get regulated?
E. coli operon lactose example :
In normal time, E. coli uses glucose to get energy, but
how does it react if there is no more glucose but only
lactose?
3
Biological basis
E. coli environment
Glucose
Lactose
X
Lactose decomposor
(β-galactosidase)
Lactose getter
(permease)
RNA
Polymerase
...
...
Gene Lac I associated protein
Polymerase action is blocked because of a DNA lock
4
Biological basis
E. coli environment
Glucose
Lactose
X
X
Lactose decomposor
(β-galactosidase)
Lactose getter
(permease)
RNA
Polymerase
...
Lactose
...
Lactose
Lactose recruits gene lacI associated protein… unlocking
the DNA that is then accessible to the polymerase
5
Biological basis
Lactose decomposor
(β-galactosidase)
Lactose getter
(permease)
= inhibit
6
Biological basis
CAP
E.coli environment
Glucose
Lactose
c-AMP
X
Lactose decomposor
(β-galactosidase)
Lactose getter
(permease)
RNA
Polymerase
...
...
Lactose
In absence of glucose, a polymerase magnet binds to the DNA to
accelerate the products of information that help lactose decomposition
7
Biological basis
Lactose decomposor
(β-galactosidase)
Lactose getter
(permease)
= activate
Research goal:
Infer these links
= inhibit
8
Why?
Get insights about cellular processes
Help understand diseases
Find drug targets
9
How?
Using gene expression data and tools for
learning Bayesian networks
[mRNA]
Experiments
*
+
*-Spellman et al.(1998) Mol Biol Cell 9:3273-97
Tools for Learning
Bayesian networks

Lactose decomposor
(β-galactosidase)
Lactose getter
(permease)
10
What is gene expression data?
Data showing the concentration of a specific mRNA at a given
time of the cell life.
Experiments
[mRNA]
*
A real value is coming from one spot and tells if the
Everyofcolumns
aremRNA
the result
of one image
concentration
a specific
is higher(+)
or lower(-) than
the normal value
*-Spellman et al.(1998) Mol Biol Cell 9:3273-97
11
What is Bayesian networks?
Graphic representation of a joint distribution over a set of random
variables.
A
B
C
D
P(A,B,C,D,E) = P(A)*P(B)
*P(C|A)*P(D|A,B)
*P(E|D)
E
Nodes represent gene expression while edges encode the
interactions (cf. inhibition, activation)
12
Bayesian networks little problem
A Bayesian network should be a DAG (Direct Acyclic Graph),
but there are a lot of example of regulatory networks having
directed cycles.
*
Transcription factor dimer
Histeric oscillator
Switch
*-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003, pages 2271–2282
13
How can we deal with that?
Using DBN (Dynamic Bayesian Networks*) and sequential gene
expression data
A
t
t+1
A1
A2
B1
B2
We unfold the network in time
B
DBN = BN with constraints on parents and children nodes
*-Friedman, Murphy, Russell,Learning the Structure of Dynamic Probabilitic Networks
14
What are we searching for?
A Bayesian network that is most probable given the data D (gene
expression)
We found this BN like that :
BN* = argmaxBN{P(BN|D)}
Where:
Marginal likelihood
Prior on network structure
P( D | BN ) P( BN )
P( BN | D) 
P( D)
Data probability
Naïve approach to the problem : try all possible dags and keep the
best one!
15
It is impossible to try all possible DAGs because
The number of dags increases super-exponentially with the
number of nodes
n = 3 → 25 dags
n = 4 → 543 dags
n = 5 → 29281 dags
n = 6 → 3781503 dags
n = 7 → 1138779265 dags
n = 8 → 783702329343 dags
…
We are interested in problem having around 60 nodes ….
16
Learning Bayesian Networks from data?
Choosing search space method and a conditional distribution
representation
•Networks space search methods
•Greedy hill-climbing
•Beam-search
•Stochastic hill-climbing
•Simulated annealing
•MCMC simulation
•Conditional distribution representation
•Linear Gaussian
•Multinomial, binomial
A
C
Basically add, remove and reverse edges
P(a) = ?
P(b) = ?
P(c|a,b) = ?
B
17
Learning Bayesian Networks from data?
Choosing search space method and a conditional distribution
representation
•Networks space search methods
•Greedy hill-climbing
•Beam-search
•Stochastic hill-climbing
•Simulated annealing
•MCMC simulation
•Conditional distribution representation
•Linear Gaussian
•Multinomial, binomial
A
C
Basically add, remove and reverse edges
P(a) = ?
P(b) = ?
P(c|a,b) = ?
B
18
We use three types of gene expression level?
Sort
-1.06
-0.12
0.18
0.21
1.16
1.19
Split data in 3 equal buckets
-1.06
-0.12
0.18
0
0.21
1.16
1
0
0
2
1.19
2
2
1
1
Discretized data
19
Return on:
Marginal likelihood
Prior on network structure
P( D | BN ) P( BN )
P( BN | D) 
P( D)
Data probability
20
Insight on each terms
P(BN) → prior on network
In our research, we always use a prior equals to 1
We could incorporate knowledge using it
Eg. : we know the presence of an edge. If the edge is in
the BN, P(BN) = 1 else P(BN) = 0
Efforts are made to reduce the search space by using
knowledge eg. limit the number of parents or children
21
Insight on each terms
P(D|BN) → marginal likelihood
Easy to calculate using Multinomial distribution with
Dirichlet prior *
ri
( Nij)
(aijk sijk)
P(d | bn)  

(aijk)
i 1 j 1 ( Nij  Mij ) k 1
n
qi
*-Heckerman,A Tutorial on Learning With Bayesian Networks and Neapolitan,Learning Bayesian Networks
22
MCMC (Markov Chain Monte Carlo) simulation
Markov Chain part:
Zoom on a node of the chain
A
A
C
P(BNnew)
1/5
1/5
A
A
1/5
0
A
C
C
B
C
B
B
C
B
B
1/5
1/5
A
A
C
B
C
B
23
MCMC (Markov Chain Monte Carlo) simulation
Monte Carlo part:
Choose next BN with probability P(BNnew)
Accept the new BN with the following Metropolis–Hastings
acceptance criterion :
 P ( BNnew| D )

MH  min 1,
*P ( BNnew) 
 P ( BNold| D )

P
 P ( D| BNnew) P ( BNnew) P ( D )

 min 1,
*P ( BNnew) 
 P ( D| BNold ) P ( BNold ) P ( D )

 P ( D| BNnew) P ( BNnew)

 min 1,
*P ( BNnew)   P(D) is gone!
 P ( D| BNold ) P ( BNold )

24
Monte Carlo part example :
1.
2.
Choose a random path. Each path having a P(BNnew) of 1/5
Choose another random number. If it is smaller than the
Metropolis-Hasting criterion, accept BNnew else return to BNold
A
A
C
P(BNnew)
1/5
1/5
A
A
1/5
0
A
C
C
B
C
B
B
C
B
B
1/5
1/5
A
A
C
B
C
B
25
MCMC (Markov Chain Monte Carlo)
simulation recap:
Choose a starting BN at random
Burning phase (generate 5*N BN from MCMC without storing
them)
Storing phase (get 100*N BN structure from MCMC)
log(P(D | BN)P(BN))
= Burning phase
= Storing phase
Iteration
26
Why 100*N BN and not only 1:
Cause we don’t have enough data and there are a lot of high
scoring networks
Instead, we associate confidence to edge. Eg. : how many time in
the sample can we find edge going from A to B?
We could fix a threshold on confidence and retrieve a global
network construct with edges having confidence over the
threshold
27
What we are working on:
Mixing both sequential and non-sequential data to retrieve
interesting relation between genes
How?
Using DBN and MCMC for sequential data + BN and
MCMC for non-sequential
100*N networks from DBN
100*N networks from BN
Information
tuner
Learn network
28
How to test the approach:
Problem : There is no way to test it on real data cause there is no
completely known network
Solution : Work on realistic simulation where we know the
network structure
Example :
0
1
12
2
4
13
3
5
6
*
Simulate
7
8
9
10
11
*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”
29
How to test the approach:
0
1
2
3
7
12
4
5
8
Info
tuner
13
6
9
10
11
BN
MCMC
DBN
MCMC
Sequential data
Non-Sequential data
Compare using
ROC curves
0
1
12
2
4
13
3
5
6
*
Simulate
7
8
9
10
11
*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”
30
Test description:
Generate 60 sequential data
Generate 120 non-sequential data (~reality proportion)
Run DBN MCMC on sequential data keep 100*N sample net
Run BN MCMC on non-sequential data keep 100*N sample net
Test performance using weight on sample
0 BN 1 DBN
.05 BN 0.95 DBN
…
0.95 BN .05 DBN
1 BN 0 DBN
The metric used is the area under ROC curve. Perfect learner
gets 1.0 , random gets 0.5 and the worst one gets 0.
31
Area under ROC curve
Results:
1
0
0 BN
1 DBN
32
Perspective:
Working on more sophisticated ways to mix sequential and nonsequential data
Working on real cases:
Yeast cell-cycle
Arabidopsis Thaliana circadian rhythm
Real data also means missing values
Evaluate missing values solution (EM, KNNImpute)
33
Acknowledgements:
François Major
34
Why are there missing datas?
Experimental problems
Low correlation
35
ROC Curve
Receiver Operating Characteristic curve
*
*-http://gim.unmc.edu/dxtests/roc2.htm
36
MCMC simulation and number of sampled
networks
ROC curve area in function of the number of sample networks from M CM C
simulation for N=12
0.895
0.89
0.88
0.875
0.87
0.865
0.86
50
0
75
0
10
00
12
50
15
00
17
50
20
00
22
50
25
00
27
50
30
00
32
50
35
00
37
50
40
00
42
50
45
00
47
50
50
00
ROC area
0.885
# of sam ples from MCMC
37