Διαφάνεια 1

Download Report

Transcript Διαφάνεια 1

Using Bayesian Networks to
Analyze Expression Data
By Friedman Nir, Linial Michal,
Nachman Iftach, Pe'er Dana (2000)
Presented by
Nikolaos Aravanis
Lysimachos Zografos
Alexis Ioannou
Outline
•
•
•
•
•
Introduction
Bayesian Networks
Application to expression data
Application to cell cycle expression patterns
Discussion and future work
The Road to Microarray Data
Analysis
• Development of microarrays
– Measure all the genes of an organism
• Enormous amount of data
• Challenge: Analyze datasets and infer biological
interactions
Most Common Analysis Tool
• Clustering Algorithms
• Allocate groups of genes with similar expression
patterns over a set of experiments
• Discover genes that are co-regulated
Problems
• Data give only a partial picture
– Key events are not reflected (translation and
protein (in) activation)
• Amount of samples give few information for
constructing full detailed models
• Using current technologies even few samples
have high noise to signal ratio
Possible Solution
• Analyze gene expression patterns that uncover
properties of the transcriptional program
• Examine dependence and conditional
independence of the data
• Bayesian Networks
Bayesian Networks
• Represent dependence structure between multiple
interacting quantities
• Capable of handling noise and estimating the confidence
in the different features of the network
• Focus on interactions whose signal is strong
• Useful for describing processes composed of locally
interacting components
• Statistical foundations for learning Bayesian networks
and the statistics to do so are well understood and have
been successfully applied
• Provide models of causal influence
Informal Introduction to
Bayesian Networks
•Let P(X,Y) be a joint distribution over variables X and Y
•X and Y independent if P(X,Y) = P(X)P(Y) for all values X and Y
•Gene A is transcriptor factor of gene B
•We expect their expression level to be dependent
•A parent of B
•B trascription factor of C
•Expression levels of each pair are dependent
•If A does not directly affect C, if we fix the expression
level of B, we will observe A and C are independent
•P(A|B,C) = P(A|B) (A and C conditionally independent of B)
I(A;C|B)
Informal Introduction to
Bayesian Networks (contd’)
•Component of Bayesian Networks is that each variable
is a stochastic function of its parents
•Stochastic models are natural in gene expression domain
•The biological models we want to process are stochastic
•Measurements are noisy
Representing Distributions with
Bayesian Networks
• Representation of joint probability distribution
consisting of 2 components
• Directed acyclic graph (G)
• Conditional distribution for each variable given its
parents in G
• G encodes Markov Assumption
i , i ; NonDescendentsi  | Pai 
• By applying chain rule this decomposes in product
form
Equivalence Classes of BNs
A BN implies further independence
assumptions
=> Ind(G)
>1 graphs can imply the same assumptions
=> Equivalent networks if Ind(G)=Ind(G')
Equivalence Classes of BNs
A BN implies further independence statements
=> Ind(G)
>1 graphs can imply the same statements
=> Equivalent networks if Ind(G)=Ind(G')
Ind(G)=Ind(G')=Ø
Equivalence Classes of BNs
For equivalent networks:
 DAGs have the same underlying undirected
graph.
 PDAGs are used to represent them.
Equivalence Classes of BNs
For equivalent networks:
 DAGs have the same underlying undirected
graph.
 PDAGs are used to represent them.
Disagreeing
edge
Learning BNs
Question:
Given dataset D, what BN, B=<G,Θ> best
matches D?
Answer:
Statistically motivated scoring function to
evaluate each BN: e.g. Bayesian Score
S(G:D)=logP(G|D)=logP(D|G)+logP(G)+C,
where C is a constant independent of G
and P(D|G)=∫P(D|G,Θ)P(Θ|G)dΘ
is the marginal likelihood over all parameters for
G.
Learning BNs
Question:
Given dataset D, what BN, B=<G,Θ> best
matches D?
Answer:
Statistically motivated scoring function to
evaluate each BN: e.g. Bayesian Score
S(G:D)=logP(G|D)=logP(D|G)+logP(G)+C,
where C is a constant independent of G
and P(D|G)=∫P(D|G,Θ)P(Θ|G)dΘ
is the marginal likelihood over all parameters for
G.
Learning BNs (contd)
Steps:
– Decide priors (P(Θ|D), P(G))
=> Use of BDe priors
(structure equivalent, decomposable)
(Heckerman et al. 1995)
– Find G to maximize S(G:D)
NP hard problem
=>local search using local permutations of
candidate G
Learning Causal Patterns
– Bayesian Network is model of dependencies
– Interest in modelling the process that generated
them.
=> model the flow of causality in the system of
interest and create a Causal Network (CN).
A Causal Network models the probability
distribution
as well as the effect of causality.
Learning Causal Patterns
CNs VS BNs:
- CNs interpret parents as immediate causes
(c.f. BNs)
- CNs and BNs relate when using the
Causal Markov Assumption :
“given the values of a variable's immediate
causes, it is independent of its earlier causes”, if
this holds, then BN==CN
Learning Causal Patterns
CNs VS BNs:
- CNs interpret parents as immediate causes
(c.f. BNs)
- CNs and BNs relate when using the
Causal Markov Assumption :
“given the values of a variable's immediate
causes, it is independent of its earlier causes”, if
this holds, then BN==CN
X
Y
X
Y
equivalent BNs
but not CNs
Applying BNs to Expression
Data
Expression level of each gene as a random
variable
 Other attributes (e.g temperature, exp.
conditions) that affect the system can be
modelled as random variables
 Bayesian Net/ Dependency structure can
answer queries
 CON: problems in computational complexity
and the statistical significance of the resulting
networks.
 PRO: genetic regulation networks are sparse

Representing Partial Models
– Gene networks: many variables
=> >1 plausible models (not enough data) – we
can learn up to equivalence class.
Focus on feature learning in order to have a
causal network:
Representing Partial Models
Features:
- Markov relations (e.g. Markov Blanket)
- Order relations (e.g. X is an ancestor of Y in all
networks)
Representing Partial Models
Features:
- Markov relations (e.g. Markov Blanket)
- Order relations (e.g. X is an ancestor of Y in all
networks)
Feature learning leads to a Causal Network
Statistical Confidence of
Features
– Likelihood that a given feature is actually true.
– Can't calculate posterior (P(G|D))
=> Bootstrap method
for i=1...n
resample D with replacement ->
D';
learn G' from D';
end
Statistical Confidence of
Features
Individual feature confidence (IFC)
IFC = (1/n)∑{f(G')}
where f(G') = 1 if the feature exists in G'
Efficient Learning Algorithms
– Vast search space
=> need efficient algorithms
– Attention on relevant regions of the search
space
=> Sparse Candidate Algorithm
Efficient Learning Algorithms
Sparse Candidate Algorithm
Identify a small number of candidate parents
for each gene based on simple local statistics
(e.g. correlation).
– Restrict our search to networks with the
candidate parents
– Potential pitfall: early choice
=> Solution: adaptive algorithm
Discretization
The practical side:
Need to define the local probability model for
each variable.
=> discretize experimental data into -1,0,1
(expression level lower, similar, higher than
control)
Set control by averaging.
 Set a threshold ratio for significantly
higher/lower.

Application to Cell Cycle
Expression Patterns
• 76 gene expression measurements of the mRNA levels
of 6177 Saccharomyces cerevisiae ORFs. Six time
series under different cell cycle synchronization
methods(Spellman 1998).
• 800 differentially expressed, 250 clustered in 8 distinct
clusters. Variables for the networks represent the
expression level of the 800 genes.
• Introduced an additional variable that denoted the cell
cycle phase to deal with the temporal nature of the cell
cycle process and forced it as a root in the network
• Applied Sparse Candidate Algorithm to 200- fold
bootstrap of the original data.
• Used no prior biological knowledge in the learning
algorithm
Network with all edges
Network with edges that
represent relations with
confidence level above 0.3
YNL058C Local
Map
• Edges
• Markov
• Ancestors
• Descendants
• SGD entry
• YPD entry
Robustness analysis
• Use 250 gene data for robustness analysis
• Create random data set by permuting the
order of experiments independently for
each gene
• No “real” features are expected to be
found
Robustness analysis (contd’)
• Lower confidence for order and
Markov relations in the random
data set
• Longer and heavier tail in the
high confidence region in the
original data set
• Sparser networks learned from
real data
• Features learned in original
data with high confidence level are
not an artifact of the bootstrap
estimation
Robustness analysis (contd’)
• Compared confidence level of learned features between 250
and 800 gene data set
• Strong linear correlation
• Compared confidence level of learned features between
different discretization thresholds
• Definite linear tendency
Biological Analysis
Order relations
• Dominant genes indicate potential causal
sources of the cell cycle process
• Dominance score of X
 Y ,C
0
k
C
(
X
,
Y
)
( X ,Y )  t o
where Co ( X ,Y ) is the confidence in X
being ancestor of Y , k is used to reward
high confidence features and t is a
threshold to discard low confidence ones
Biological Analysis (contd’)
• Dominant genes are key genes in basic
cell functions
Biological Analysis (contd’)
Markov relations
•
•
•
Top Markov relations reveal functional relations between genes
1. Both genes known: The relations make sense biologically
2. One unknown gene: Firm homologies to proteins functionally
related to the other gene
3. Two unknown genes: Physically adjacent to the chromosome,
presumably regulated by the same mechanism
FAR1- ASH1, low correlation, different clusters, known though to participate in a mating type
switch
CLN2 is likely to be a parent to RNR3, SVS1, SRO4 and RAD41. Appeared in same cluster. No
links between the 4 genes. CLN2 is known to be a central cycle control and there is no clear
biological relationship between the others
Discussion and Future Work
•
•
•
•
•
•
Applied Sparse Candidate Algorithm and Bootstrap resampling to extract a
Bayesian Network for the 800 genes data set of Spellman
Used no prior biological knowledge
Derived biologically plausible conclusions
Capability of discovering causal relationships, interactions between genes
and rich structure between clusters.
Developing hybrid algorithms with clustering algorithms to learn models
over clustered genes
Extensions:
–
–
–
–
–
–
Learn local probability models dealing with continuous data
Improve theory and algorithms
Include biological knowledge as prior knowledge
Improve search heuristics
Apply Dynamic Bayesian Networks to temporal data
Discover causal patterns (using interventional data)