Bayesian Networks for Genome Expression: A Bayesian

Download Report

Transcript Bayesian Networks for Genome Expression: A Bayesian

A Bayesian Statistical Approach to
Modeling Gene Regulatory Pathways in
Human Placental Data
Elinor Velasquez
Dept. of Biology
San Francisco State University
Outline of talk
• Introduction
• The experimental approach: Obtaining
placenta data
• The experimental approach: Modeling gene
regulatory networks
• Results from experiments
• Conclusions and future work
• Acknowledgements
Introduction
Overall goal
To use a bioinformatics
model for which to
better understand the
human placenta
http://www.biotechnologycenter.org/hio/assets/hisimages/placenta/placenta44.jpg
The human placenta
http://www.uchsc.edu/winnlab/index.html
The basal plate in the placenta
Site of known anatomical
abnormalities in preeclampsia
http://www.uchsc.edu/winnlab/projects.html
EGFR pathway
• EGFR, cell surface receptor for epidermal growth
factors
• Potentially important gene for the placenta
British Journal of Cancer (2006) 94, 184 – 188
EGFR regulates gene expression
EGFR
ANGPT2
CSPG2
DCN
Causal relationships
EGFR
ANGPT2
CSPG2
DCN
Example of a gene regulatory
network
Gene
1
Gene
2
Gene
4
Gene
3
Gene
5
Gene
6
Definition of a Bayesian network
• There exist nodes
(disks)
• There are edges
(arrows) between
some of the nodes
• Causality is implied
by the edges
• Acyclic
Gene
1
Gene
2
Gene
4
Gene
3
Gene
5
Gene
6
The experimental approach:
Obtaining placenta data
Data collected from microarrays
cRNA
• Data comes from 36
experiments conducted by
Virginia Winn et al. at the
SJ Fisher lab, UCSF
• Gene expression profiling
experiments
hybridization
45000 dots (25-mer oligo probe sets)
representing the human genome
Traditional
“spotted”
arrays
What is a probe set?
• Several oligonucleotides designed to hybridize
to various parts of the mRNA generated from
a single gene
Probe set
mRNA
gene
Affymetrix
GeneChips
Microarray data
The normalized log 2 intensity values were
centered to the median value of each probe
set, by Virginia Winn et al.
5 time segments:
1
A probe set
2
3
4
5
x1 ... x6 y1 ... y9 z1 ...z6 w1...w6 s1 ... s9
36 data points per probe set
Microarray data
• Red denotes the up regulated expression and green denotes
the down regulated expression relative to the median value
• Genes differentially expressed in the basal plate of
placentas: Rows contain data from a single basal plate cRNA
sample and columns correspond to a single probe set.
http://www.uchsc.edu/winnlab/index.html
Summary of data used in
bioinformatics experiments
Average gene expression value
• 36 placentas
• 45, 000 probe sets
• Time-series data
from 14-16 weeks
to term
Gene egfr
9.8
9.6
9.4
9.2
9
8.8
8.6
8.4
8.2
8
7.8
14 - 16
18 - 19
21
Weeks
23 - 24
37 - 40
The experimental approach:
Modeling gene regulatory
networks
Outline of bioinformatics experimental
design
PS
1
PS
2
PS
3
PS
4
Step 1. Create a naïve Bayesian network using the probe set data
Step 2. Score the naïve Bayesian network
Step 3. Randomly add/delete an edge and rescore the Bayesian
network
Step 4. Continue until best score reached
Step 5. Combine probe sets to create the gene regulatory network
Four probe sets (Three genes)
Define naïve Bayesian network
• Choose a root node
• All other nodes branch
off of the root node
• PS1 is the parent node
PS
1
PS
2
PS
3
PS
4
Step 1: Create a naïve Bayesian
network using probe set data
PS1
PS2
PS3
PS4
• Use data from one time segment
• Choose Weeks 23-24 data (6 placentas)
• Choose 4 probe sets
Placenta data for Weeks 23-24
PS1 corresponds to 201984 which corresponds to EGFR
PS2 corresponds to 236034, PS3 corresponds to 211148:
PS2 and PS3 both correspond to ANGPT2
PS4 corresponds to 204620 which corresponds to CSPG2
Step 2: Score the naïve Bayesian
network
• We want to score this network:
PS1
PS2
PS4
PS3
The network score is a function of
conditional probabilities
• Conditional probability, Prob(N | Pa(N)),
is the probability of child node N given
parent of N
• Example: Given a parent PS1’s node has an
associated expression value 10, what is the
probability that its child node, PS4, has an
expression value of 8?
PS1
PS4
Conditional probability
PS1
• EGFR (PS1) is the parent node and
has value 10.
• CSPG2 (PS4) is the child node and has
value 8 two times
• Conditional probability = 2/6
PS4
Score for a Bayesian network
The score of the naive network equals
the product of all the nonzero
conditional probabilities associated
with the network:
4
P(N1, N2, N3, N4) = Π P(Ni | pa(Ni))
i=1
Score for the naïve Bayesian
network
P(N1, N2, N3, N4) = 1/3966
= 2.54 x 10-5
PS1
PS2
PS4
PS3
Step 3: Randomly add/delete an edge
and rescore the Bayesian network
PS1
PS2
PS4
The score becomes
1/78732 = 1.27 x 10-5.
PS3
Step 4. Continue until best score
reached
• Since the score is a probability, we want the
score to be high.
• The naïve network is the better choice
between the two networks, so we pick it as
our final network.
PS1
PS2
PS4
PS3
Step 5. Combine probe sets to create
the gene regulatory network
EGFR
ANGPT2
CSPG2
40 probe sets (26 genes)
Gene regulatory pathway
for 26 genes
Step 1. Create a naïve Bayesian network using 40 probe sets for
each time segment
Step 2. Score the naïve Bayesian network
Step 3. Randomly add/delete an edge and rescore the Bayesian
network
Step 4. Continue until best score reached
Step 5. Combine probe sets to create the gene regulatory network
for the placenta
Step 1. Create a naïve Bayesian
network using 40 probe sets for
each time segment
Create a naïve Bayesian network
PS
7
PS
8
PS
9
PS
6
PS
1
PS
2
PS
3
PS
5
PS
4
Step 2. Score the naïve Bayesian
network
Score for a Bayesian network
The score of the naive network equals the
product of all the nonzero conditional
probabilities associated with the network:
40
P(N1, N2, N3, N4) = Π P(Ni | pa(Ni))
i=1
Step 3. Randomly add/delete an edge
and rescore the Bayesian network
Step 4. Continue until best score
reached
With four probe sets, at least two Bayesian
networks were constructed:
PS1
PS1
PS2
PS2
PS4
PS3
PS4
PS3
Exhaustive search
• To be certain that we have the best scoring
network, we need to construct all possible
networks from our naïve networks
• With four probe sets, we only constructed one
other network than the naïve network
• How to construct all possible networks?
How do we construct all possible
networks?
•
•
•
•
•
•
•
1 probe set 1 Bayesian network
2 probe sets 2 possible Bayesian networks
3 probe sets 12 possible Bayesian networks
4 probe sets 144 possible Bayesian networks
5 probe sets > 4800 possible Bayesian networks!
6 probe sets … ??
And so on…
Welcome to “Modern Heuristics”
•
•
•
•
Step 1. Representation of a model
Step 2. The scoring function
Step 3. Defining the search problem
Step 4. Consider local optima
score
local
change
Step 1: Representation of the model
• The model is a gene regulatory pathway.
• We are going to assume a Bayesian model for our
probe set:
PS
1
PS
2
PS
3
PS
4
• The number of possible pathways is so large as to
forbid an exhaustive search for the best Bayesian
network.
Step 2: The scoring function
• The fair coin, p(X = heads) = ½
• What happens if the coin is unfairly weighted?
• We need to re-think probability:
∫
p(X) = p(x) r(x) dx
• r(x) is a weight function.
Step 2. The scoring function
• The scoring function is a
probability
• Assume the network
has a Dirichlet
distribution which is the
weight function used to
weight the conditional
probabilities.
www.wikipedia.com
Step 2. The scoring function
Probability of a fixed network equals product
of conditional probabilities times the Dirichlet
distribution:
40
P(N) = Π P(Ni | pa(Ni)) D(Ni)
i=1
such that
D(Ni) = ∏ Θiάi-1(N i)
Step 3: Defining the search
problem
What it means to search:
a. Construct a first network (Use a naïve
Bayesian network)
b. Score the first network using the scoring
function
c. Perform the Hill-climbing algorithm.
Step 3. Defining the search problem
The Hill-climbing Algorithm:
• Randomly choose a node
• “Search” in the neighborhood of that node for
the best scoring network
Step 4. Consider local optima
score
• Hill-Climbing is a
traditional method for
search techniques
local
• Can get caught on local
maxima
• Step 4 is to keep
choosing random
nodes.
change
randomly chosen
node is the origin
From http://content.answers.com/
Software
• Weka software package written by members of the University
of Waikato, New Zealand,
http://www.cs.waikato.ac.nz/~ml/people.html
• DEAL, R package, written by Susanne G. Bøttcher, Claus
Dethlefsen, http://www.math.auc.dk/novo/deal
• BayesNet Toolbox, Matlab package, written by Kevin Murphy,
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
• ExpressionNet, written by Jingchun Zhu,
http://expressionnet.sourceforge.net/
Results from experiments
26 genes
COL5A1
COL3A1
COL5A2
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
Ingenuity network
Results for 26 genes
• 40 probe sets (26 genes)
• Data comes from five different time intervals:
1. 14 – 16 gestational weeks
2. 18 – 19 gestational weeks
3. 21 gestational week
4. 23 – 24 gestational weeks
5. 37 – 40 gestational weeks
COL5A1
COL5A2
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
Time Segment:
Week 14-16 weeks
COL5A1
COL5A2
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
Time segment:
18 – 19 weeks
COL5A1
COL5A2
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
Time segment:
21 weeks
COL5A1
COL5A2
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
Time segment:
23 – 24 weeks
COL5A1
COL5A2
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
Time segment:
37 – 40 weeks
How to display data
• One of the most pressing questions in
bioinformatics research is how to display the
data effectively
• We have two solutions
1. An interaction map
2. Geometrical considerations
An interaction map for 26 genes
Geometrical considerations
• Will illustrate with the gene egfr
• egfr is an epidermal growth factor
Functions on the cell surface
Activated by binding of its specific ligands
Responsible for many pathways in animal
models
Gene egfr regulated by:
Genes on a dodecahedron: Gene
regulatory network for egfr
CSPG2
CCNG2
COL1A2
PLAU
INHBA
On backside:
PECAM1
ANGPT2
IGFBP1
MRC2
SPP1
USP6NL
DCN
Adapted from http://www.math.cornell.edu/~mec/2003-2004/geometry/platonic/dodecahedron.jpg
Conclusions
• We can predict gene regulatory networks
using Bayesian networks as an intermediate
step
• When we leave arrows in network, we are
able to show causal relationships between the
genes
• Interaction maps and use of geometry are
novel ways to display gene behavior
Future Directions
• A three-dimensional viewer with numerical
values will be implemented to use with the
Weka software
• Use molecular genetics techniques to validate
a portion of the results
• Design a genetic programming algorithm
(evolutionary algorithm) to create a Bayesian
network
Acknowledgements
San Francisco State University:
Leticia Márquez-Magaña, Chris Smith, Frank Bayliss, Juan Castellon, Ernesto
Flores, Rebecca Garcia, Alba Gutierrez, Jainee Lewis, Rebecca Mendez, Cylyn
Cruz, Jasmin Reyes, Jackie Robinson, Peter Thorsen, My family
UC San Francisco:
Susan Fisher, Matthew Gormley
M.B.R.S.-R.I.S.E. Grant 5 - R25-GM59298