Bayesian Networks
Download
Report
Transcript Bayesian Networks
Using Bayesian Networks to
Analayze Expression Data
Shelly Bar-Nahor
Today
1. Introduction to Bayesian Networks.
2. Describe a method for recovering gene
interactions from microarray data, using
tools for learning Bayesian Networks.
3. Apply the method to the S. cerevisiae
cell-cycle measurements.
Bayesian Networks – Compact representation
of probability distributions via conditional
independence.
The representation consists of two components:
1. G - a directed acyclic graph (DAG)
A
Nodes – Random Variables (X1, …,Xn ).
Vertices – direct influence.
B
2. Θ – a set of conditional probability
distributions.
Together, these 2 components specify
a unique distribution on (X1, …,Xn ).
P(C|A)
c0
C
c1
a0
0.95
0.05
a1
0.1
0.9
The graph G encodes the markov
assumption:
Each Variable Xi is independent of it’s nondescendants given it’s parents in G.
By applying the chain rull of probabilities P(X1 ,…,Xn) = ∏ P(Xi | PaG(Xi))
n
P(Xi | PaG(Xi)) is the conditional distribution
for each variable Xi. We denote the
parameters that specify these distributions
by Θ.
The Learning Problem:
Let m be the number of samples and n the
number of variables. Given a training set
D=(X1, …,Xm), where Xi = (Xi1, …,Xin) , find a
network B=<G,Θ> that best matches D.
Let’s assume we have the graph’s structure.
We now want to estimate the parameter Θ
given the data D.
To do so, we will use the ‘Likelihood Function’:
L(Θ:D) = P(D| Θ) = P(X1=x[1], … ,Xm=x[m]| Θ) =
∏ P(x[i] | Θ)
m
Learning Parameters - Likelihood Function
E[1] B[1] A[1] C[1]
D=
:
:
:
:
E[m] B[m] A[m] C[m]
E
B
A
C
Assume i.i.d samples,
the Likelihood functions is:
L(Θ:D) = ∏
P(E[m],B[m],A[m],C[m] | Θ)
m
Learning Parameters - Likelihood Function
L(Θ:D) = ∏
P(E[m],B[m],A[m],C[m] : Θ) =
m
∏
m
P(E[m] : Θ)
.
P(B[m] : Θ)
.
E
B
A
P(A[m] | B[m],E[m] : Θ)
.
=
C
P(C[m] | A[m] : Θ)
∏ P(E[m]:Θ) ∏P(B[m]:Θ) ∏ P(A[m]|B[m],E[m]:Θ) .
m
m
∏ P(C[m]|A[m]:Θ)
m
m
Learning Parameters - Likelihood Function
General Bayesian Networks
L(Θ:D) = ∏
P(X1=x[1], … ,Xm=x[m]| Θ) =
m
∏ ∏ P(X [m]|Pa [m]: Θ ) = ∏ Li(Θ :D)
i
m
i
i
i
i
i
Decomposition – Independent estimation
problems.
Learning Parameters
MLE - maximum likelihood estimator
Bayesian Inference – Learning using bayes rule
Represent uncertainty about the sampling
using a bayesian network:
Θ
The values of X are independent
given Θ
X[1] X[2] . . . . . . .
P(X[m] | Θ ) = Θ
Bayesian prediction is inference
in this network
X[m]
Θ
Bayesian prediction is inference
in this network
X[1]
P(x[m+1]| x[1],…,x[m]) =
X[2]
. . . . . . . X[m]
Observed Data
query
1
0
0
X[m+1]
∫ P(x[m+1]| Θ,x[1],…,x[m])P(Θ|x[1],…x[m])d Θ =
1
∫ P(x[m+1]| Θ)P(Θ|x[1],…,x[m])d Θ
Likelihood
P(Θ|x[1],…,x[m]) =
prior
P(x[1],…,x[m]| Θ) P(Θ)
Bayes rule
P(x[1],…,x[m])
Probability of data
Equivalence Classes of Bayesian Netorks
Problem: The joint probability represented by a
graph can equaly be represented by another one.
A
C
B
P(x) = P(A)P(C|A)P(B|C)
A
C
B
P(x) = P(C)P(A|C)P(B|C)
A
C
B
In the same way
P(C|A)P(A)=
P(A|C)P(C)
Ind(G) – set of independence statements that hold in
all distributions the markov assumption.
G and G’ are aquivalent if Ind(G) = Ind(G’)
Equivalence Classes of Bayesian Networks
Two DAGs are equivalent if and only if
they have the same underlying undirected
graph and the same v-structure.
We will represent an equivealence class of
netwrok structures by a partially DAG
(PDAG), where a directed edge denotes all
members of the class contain the same
directed arc, and an undirected edge
otherwise.
Learning Causal Patterns
•We want to model the mechanism that
generate the adependencies (e.g. gene
transcriptions).
• A Causal Networks is a model of such
causal processes.
• representation: a DAG where each node
represents a random variable with a local
probabilty model. The parents of a variable
are its immediate cause.
Learning Causal Patterns
• Observations – a passive mesurement of
out domain (I.e. a sample from X)
• Intervention – setting the values of
some variable using forces outside the
causal model .
A causal Network models not only the
distribution of the observation, but also
the effects of interventions.
Learning Causal Patterns
•If X causes Y, then manipulating the
value of X affect the value of Y.
•If Y couses X, then manipulating the
value of X will not affect Y.
•X Y and Y X are equivalent Bayseian
networks.
Causal Markov Assumption: given the
values of a variable’s immdeiate causes,
it is independent of it’s earlier causes.
Learning Causal Patterns
• When making the causal Markov
assumption a causal network can be
interpreted as a Bayesian Network.
•From obeservations alone we cannot
distinguish between causal networks that
belong to the same equivalence class.
•From a directed edge in the PDAG we can
infer a causal direction.
So Far…
• The likelihood Function
• Parameter estimation and the
decomposition principle
• Equivalence Classes of Bayesian Netorks.
• Learning Causal Patterns
Analyzing Expression Data
1. Present modeling assumptions
2. Find high scoring networks.
3. Characterize features.
4. Estimate Statistical Confidence in
features.
5. Present local probability models.
Modeling assumptions
We consider probability distributions over
all possible states of the system.
A state is discribed using random variables.
Random Variables –
• expression level of individual gene
• expreimental conditions
• temporal indicators (time/stage the sample was
taken).
• background variables (which clinical procedure
was used to take the sample)
Analyzing Expression Data
1. Present modeling assumptions
2. Find high scoring networks.
3. Characterize features.
4. Estimate Statistical Confidence in
features.
5. Present local probability models.
The Lerning Problem:
Given a training set D=(X1, …,XN) of
independent instances X, find an equivalence
class of networks B=<G, Θ> that best
matches D.
We will use a scoring system:
S(G:D) = logP(G|D) = … = logP(D|G) + logP(G) + C
Marginal
Likelihood
prior
P(D|G) = ∫P(D|G, Θ)P(Θ|G)d Θ
Likelihood
Priors over parameters
The learning problem - Scoring, cont.
Properties of the selected priors:
• structure equivalent
• decomposable
S(G:D) = ∑ ScoreContribution(Xi,Pa(Xi)) : D)
Now, learning amounts to finding structure
G that maximize the score.
This problem is NP-hard
We resort to huristic search
Local Search Strategy
Using decomposition, we can change one arc
and evaluate the gains made by this change
Initial structure G
A
B
C
Neighboring structures G’
A
B
C
A
B
C
If and arc to Xi is added or deleted, only
score(Xi, Pa(Xi)) needs to be evaluated. If
an arc is reversed only score(Xi, Pa(Xi))
and score(Xj, Pa(Xj)) need to be evaluated.
Find High-Scoring Networks
Problem:
Small data sets are not sufficiently
informative to determine that a single
model is the “right” model (or equivalence
class of models).
Solution:
analyze a set of high-scoring networks.
Attempt to characterize features that are
common to most of these networks and
focus on learning them.
Analyzing Expression Data
1. Present modeling assumptions
2. Find high scoring networks.
3. Characterize features.
4. Estimate Statistical Confidence in
features.
5. Present local probability models.
Features
We will use two classes of features
involving pairs of variables.
1. Markov Relations – Is Y in the Markov
Blanket of X?
Y is in X’s Markov Blanket if and only if
there is either an edge between them,
or both are parents of another variable.
A Markov Realation indicates that the two
genes are related in some joint biological
interaction or process.
Features
2. Order Realtions – Is X an ancestor
of Y in all the networks of a given
equivalence class?
Does the PDAG contain a path from X
to Y in which all the edges are
directed?
This is an indication that X might be
a couse of Y.
Analyzing Expression Data
1. Present modeling assumptions
2. Find high scoring networks.
3. Characterize features.
4. Estimate Statistical Confidence
in features.
5. Present local probability models.
Estimate Statistical Confidence in Features
• We want to estimate to what extent the
data support a given feature.
• We use the Bootstrap Method:
We generate “purturbed” versions of
the original data, and learn from them.
• We should be more confident on features
that would still be induced from the
“purturbed” data.
Estimate Statistical Confidence in Features
We use the Bootstrap as foolows:
For i=1 … m (in our experiments m=200)
o Re-sample with replacement N instance of D.
denote Di the resulting dataset.
o Apply the learning procedure on Di induce a
network structure Gi.
For each feature f of interest calculate:
conf(f) = 1/m ∑f(Gi)
f(G) is 1 if f is a feature in G, and 0 otherwise.
Estimate Statistical Confidence in Features
• Features induced with high confidence are
rarely false positives.
• The bootstrap procedure is especially
robust fot the Markov and order features.
• The conclusions that can be established on
high confidence features are reliable even in
cases where the data sets are small for the
model being induced.
Analyzing Expression Data
1. Present modeling assumptions
2. Find high scoring networks.
3. Characterize features.
4. Estimate Statistical Confidence in
features.
5. Present local probability models.
Local Probability Models
• In order to specify a Bayesian network
model we need to choose the type of local
probability model we use.
• The choice of representation depends on
the type of variables we use:
Discrete Variables – can be
represented with a table.
Continuous variables – no
representation for all possible densities.
Local Probability Models
We consider two approaches:
1. Multinomial Model – treat each variable as
discrete and learn a multinomial distribution
that describes the probability of each possible
state of a child variable given the state of it’s
parent.
We descretize by setting a threshold to
the ratio between measured expression
and control: values lower than 2-0.5 are
under_expressed(-1), and higher then
20.5 are over_expressed(1).
Local Probability Models
2. Linear Gaussian model – Learn a linear
regression model for child variable give it’s
parents.
If U1,…,Uk are parent of variable X, then
P(X|u1, …,uk) ~ N(a0 + ∑ai·ui, σ2).
That is, X is normally distributed aroud a
mean that depends linearly on the values
of its parents.
Local Probability Models
• In the multinomial model, By
Discretizing the mesaured expression
levels we loose information.
• The linear-Gaussian model can only
detect dependencies that are close to
linear. In particular it is not likely to
discover combinatorial effects (e.g. a
gene is over expressed if and only if
certain several genes are jointly over
expressed).
Application to Cell Cycle Expression Patterns
• Data from Spellman et al. ‘Comprehensive
identification of cell cycle-regulates genes of the yeast
sacccharomyces cerevisia by microarray hybridization’,
Mullecular Biology of the Cell.
• Contains 76 gene expression
measurements of the mRNA
levels of yeast.
• Spellman et al identified 800
genes whose expression varied
over the different cell-cycle
stages.
Application to Cell Cycle Expression Patterns
• Treat each mesaurement as an independent
sample, and do not take into account the
temporal aspect of the mesaurement.
• Compensate by adding to the root of all
learned networks, a variable denoting the cell
cycle phase.
• Performed 2 experiments: one with the
multinomial distribution, and the other with
linear Gausian distribution.
http://www.cs.huji.ac.il/~nirf/GeneExpression/top800/
Robustness Analysis – Credibility of
confidence assessment
Number of
features
with
confidence
equal or
higer then
the xvalue
Linear Gaussian - Order
Confidence threshold
Robustness Analysis – Credibility of
confidence assessment
Multinomial - Order
Number of
features
with
confidence
equal or
higer then
the xvalue
Confidence threshold
Robustness Analysis – Adding more genes
Multinomial model
x: confidence with 250 genes, y: with 800
Robustness Analysis – discretization
• The descritization methos penalizes genes
whose natural range of variation is small since
a fixed threshold is used.
• Avoid the problem by normalizing the
expression of genes in the data.
• The top 20 Markov relations highlighted by
this method were a bit different, and the
order relation were more robust. Possibly
because order relations depend on the
network structure and not local.
Robustness Analysis – compare between the
linear-Gausian and multinomial experiments.
X-axis: confidence in the multinomial experiment
Y-axis: confidence in the Linear Gaussian experiment
Biological Analysis – Order Relations
• Found existence of dominant genes. Out of
all 800 genes only few seem to dominant the
order.
• Among them are genes that are:
directly involove in initiation of the cell-cycle
and its control.
components of pre-replication complexes.
involved in DNA repair ,which are associated
with transcription initiation.
Biological Analysis - Markov Relations
• Among top scoring relations, all involving
two known genes make sense biology.
• Several of the unknown pairs are
phsically adjacent on the chromozome,
and persumably regulatedby the same
mechanism
• Some relations are beyond limitations of
clustering.
Example: CLN2,RNR3,SVS1,SRO4,RAD51 all
appear in the same cluster by spellman et
al. In our network CLN2 is a parent of
the other 4, while no links were found
between them. This suit biological
knowledege: CLN2 is a central and early
cell-cycle
control, while
there is no
clear biological
relationship
between the
others.
discussion
The approach
is capable of handling noise and
estimating the confidence in the different
features in the network.
managed to extract many biologically
plausible conclusions.
Is capable of learning rich structures
from the data, such as discovering causal
relationships, and interaction between
genes other then positive correlation.
discussion
Ides:
learn models over “clustered” genes.
recover all relationships in one analysis.
Improving the confidence estimation.
Incorporating biological knowledge as prior
knowledge to the analysis.
Learn causal patterns, while adding
intervantional data.
The End