Microarray chips
Download
Report
Transcript Microarray chips
As a result of the need for analyzing
genomic expression data with models
that permit latent variable capturing,
describing complex relationships and
can be scored rigorously against
observational data.
Introduction – motivation, current models,
new techniques.
Bayesian networks, modeling regulatory
networks with them, scoring using the scoring
metric.
Example - using the galactose system.
Representing models with annotated edges.
Scoring annotated models of the galactose
system.
Conclusions.
The opportunity to research in fields like
medicine, biology and pharmacology using
the vast quantity of data generated by gene
arrays.
Understanding basic cellular processes,
diagnosis and treatment of disease and
designing targeted therapeutics.
Data from expression array is inherently
noisy.
Our knowledge regarding genetic regulatory
networks is extremely limited so all
hypotheses about their structure or function
may be incomplete.
Gene expression is regulated in a complex
and combinatorial manner, however most
analysis of expression array data utilizes
only pair wise measures.
Typically performed by clustering the expression
profiles of a collection of genes using pair wise
measures like:
Correlation - given 2 data vectors, normalize them and use
dot product.
Euclidian distance - square root of the sum of the squared
differences in each dimension.
Mutual information – using the information theory how
much information A contains about B (and vice versa),
uses a discretized model, partitioning the expression levels
into bins and find pairs of genes that have one-to-one
mapping by permuting the bin numbers.
Identifying clusters with common sequence motifs.
Previous presentations.
Data - microarray of 8600 human
genes according to expression level –
as can be seen in 5 clusters.
Noise in expression array data is typically
not analyzed in detail - the significance of
alternative conclusions from these studies
cannot be quantitatively compared.
Does not permit models to describe latent
variables - a variable that describes an
unobserved value (such as protein levels)
and make predictions that can be verified
later as data becomes available.
Employing Boolean models that are restricted to
logical relationships between variables:
A graph G(V,E) annotated with a set of states
X = {xi | i = 1, 2, .. , n}, together with a set of Boolean
functions B = {bi | i = 1, 2, .. , k}, bi : {0,1}k -> {0,1}.
Gate: each node vi has associated with it, a function
with inputs the states of the nodes connected to vi..
The state of the node vi at time t is denoted as xi (t).
Then the state of that node at time t+1 is given by :
xi (t+1) = bi (xi1,xi2,..,xik) where xij are the states of the
node connected to vi.
Described in details in Andrey’s presentation.
Network of three nodes - a, b and c. As one can
see, expression of c directly depends on
expression of b, which directly depends on a.
However, influence of b and c on a is more
complex. For example, high level of expression
of both b and c leads to inhibition of a.
Bayesian networks used to describe
relationships between variables in a genetic
regulatory network.
Describes arbitrary combinatorial control of gene
expression not limited to pair wise interaction
between genes.
Useful in describing processes composed of
locally interacting components (the value of each
component directly depends on the values of a
relatively small number of components).
Provide models of casual influence – modeling
the mechanism that generated the dependencies
– helps predict the effect of an intervention in the
domain settling the value of a variable in a way
that the manipulation itself doesn’t affect the other
variables.
Due to their probabilistic nature, robust in the
face of imperfect data/imperfect model (small
variation/noise don’t change much of the
outcome of the network).
Permit latent variable capturing unobserved
values.
The variables in it can be either discrete or
continuous. Can represent mRNA concentration,
protein concentration, genotypic information.
A variable describing an observed value = an
information variable.
A variable describing an unobserved value = a
latent variable.
Describes the relationships between variables at
a qualitative and quantitative level.
At a qualitative level the relationships between variables are
dependence and conditional independence – encoded in a
structure of a directed acyclic graph G:
The vertices correspond to variables.
Directed edges represent dependencies between variables.
At a quantitative level the relationships between variables are
described by a family of joint probability distributions that are
consistent with the independence assertions embedded in the
graph - .
Under the Markov assumption: “Each variable X is independent of
its non-descendants, given its parent in G.
General formulae for the joint probability distribution:
Discrete variables from a finite set, P(X | U1,U2,...,Uk) can
be represented as a table that specifies the probability of
values for X – the number of free parameters in
exponential in the number of parents.
Continuous variables – there is no representation that
can represent all possible densities. P(X | U1,U2,...,Uk)
can be represented using a gaussian conditional density:
P(X | U1,U2,...,Uk) ~ N(a0 + ai * Ui , 2) – the mean
depends linearly on the values of its parents. The
variation is independent of the parent’s values.
Hybrid network a mixture of discrete and continuous
variables.
More than one graph can imply the same set of
independencies.
Y
X
Example ind(G) = .
Two graphs G, G’ are equivalent if
X
Y
ind(G) = ind(G’).
Two DAG’s are equivalent if and only if they have
the same underlying undirected graph and the
same v-structure (converging directed edges into
the same node).
The approach to analyzing gene expression data
using Bayesian network learning techniques is
as follows:
Our modeling assumptions are presented.
Probability distributions of over all possible states
of the system are considered.
A state of the system is described using random
variables.
Each random variable describes:
The expression level of individual genes.
Experimental conditions.
Temporal indicators (the time/stage that the sample
was taken from).
Background variable (which clinical procedure was
used to get a biopsy sample).
Given all the states and the samples, using a
scoring technique, the best model that matches
the data is found (similar to the method in Inbar’s
presentation).
When such a model is found, queries about the
system can be answered.
When given a data base D = {Xi ,..,Xn} of n
samples when Xn = (Xi1,..,Xim) of m variables
finding a network B = <G, > that best matches
D.
A likelihood function is calculated:
P(D|G) = P(D| G, ) * P( | G) d
n
P(D| G, ) = P(Ch|G, )
h 1
C = all the cases in the Data set (under the
assumption that case occur independently).
P( | G) – The likelihood of the probability
assignments given the graph structure.
Calculating the score of the model:
BayesianScore(G) = log(P(D|G) + logP(G) +C.
P(G) – the prior / the probability that a Network
has a graph structure G.
Given 3 variables the assumption is that all possible
belief networks are equally likely. There are 25 possible
belief networks (DAG). The prior probability is uniform.
P(D| G, ) =
n
P(Ch|G, )
h 1
Finding the maximal score/ the structure G that
maximizes the score is NP-hard, so heuristics
are needed.
One possibility is a local search that changes
one edge at each move – greedy hill climbing
algorithm. – at each step performs the local
change that results in the maximal gain until it
reaches a local maximum.
Performs well in practice.
Includes an inherent penalty for model complexity
(balancing a model’s ability to explain observed data with
its ability to do so economically, and consequently guards
against over-fitting models to data.
The model is permitted to be incomplete containing
additional degrees of freedom (while being penalized by
the scoring metric). Scores improve as a model
converges to one without degrees of freedom.
Allows us to represent uncertainty about the precise
dependencies between variables as is a distribution
and not a singular value.
Instantiate the latent variables by sampling from
the distribution of possible values for each such
values – MCMC methods (Markov Chain Monte
Carlo). Becomes computationally prohibitive as
networks become very large:
n
When X – a latent variable, E(X) = 1/n * f(Xt) ,
t 1
f - a function of interest regarding X, n – the
number of samples.
Law of large numbers makes sure that enlarging
n will give a better approximation of X.
This equation assumes that all { Xt }t=1.. n are
independent – incorrect assumption.
As a result, a Markov chain is formed meaning a
series of random variables X1,.., Xn and each
sample is taken from the distribution of
P(Xt+1 | Xt ) – given Xt , Xt+1 isn’t dependant upon
{ X1,.., Xt-1}.
P(Xt+1 | Xt ) – the transition kernel of the chain.
An algorithm for finding X1,.., Xn is called the
“Hastings-Metropolis” .
Variational approximation methods can be used, either on
their own or in conjunction with sampling – for example
``search-based'' methods, which consider node
instantiations across the entire graph. The general hope
in these methods is that a relatively small fraction of the
(exponentially many) node instantiations contains a
majority of the probability mass, and that by exploring the
high probability instantiations (and bounding the
unexplored probability mass) one can obtain reasonable
bounds on probabilities.
Variational methods also yield upper and lower bounds
on the score enabling the highest scoring graph to often
be identified without resorting to sampling.
When a patient has a certain disease, at some point he
took certain pills, which contributing to his dying
eventually. Learning the probability of his staying alive if
he hadn’t taken the pills.
Predicting the effects of an intervention in the domain,
not only the probability of observations.
If X causes Y, then manipulating the value of X affects the
value of Y.
If Y causes X, then manipulating X will not affect Y.
Y
X
X
Y
The 2 networks are equivalent Bayesian
networks but not equivalent as casual networks.
A casual network can be interpreted as a
Bayesian network when assuming the casual
Markov assumption:
Given the values of a variable’s immediate
causes, it’s independent of its earlier causes.
Genetic regulatory network responsible for the control of
genes necessary for the galactose metabolism.
This is a fairly well understood system in yeast and so
allows us the opportunity to evaluate our methodology in
a setting where we can rely on accepted fact.
Example of genetic regulatory networks represented as Bayesian networks.
Boxed variables – mRNA levels that can be determined from expression into
array data.
Unboxed variables – protein levels. In this model they are treated as latent
variables whose values cannot be measured directly.
The 2 networks represent two competing models of a portion of the galactose
system in yeast – differ in terms of the dependence relationships they hold
between the variables Gal80p, Gal4m,Gal4p.
Conclusions based on previous research:
It was originally proposed that Gal80 protein is a
repressor of Gal4 transcription –shown in M1.
It is now clear that Gal4 is expressed constitutively and
that its activity is inhibited by Gal80 protein – shown in
M2.
Expression data for this analysis consisted of 52 genomes
of Affymetric S. cerevisiae genechip data.
In order to get those two competing networks, the
Bayesian scoring metric was used.
Binary quantization was performed independently for
each gene using a maximum likelihood separation
technique.
The simplified
versions of M1 and
M2
Scoring results:
The model M1, in which Gal80p represses transcription of Gal4m,
received a score of –44.0, while the model M2, in which Gal80p
inhibits gal4p actively received a score of –34.5 .
The score difference translates to the data being over 13,000 times
more likely to be observed under M2 .
The score of the more complex model M1 or M2 was –35.4, lower
than that of the currently accepted model.
Score for equivalence
classes of the three
variable galactose
system
The models fall into two primary
grouping based on their score:
Those that include an edge between Gal80 and Gal2, which score
between –34.1 and –35.4 .
Those that don’t, which score between –42.2 and –44 .
This supports the claim that Gal80 and Gal2 are unlikely to be
conditionally independent given Gal4.
Extending the Bayesian network model by adding the ability to
annotate edges, in order to represent additional information about the
type of dependence relationships between variables.
4 types of annotations in the context of binary variables:
An unannotated edge from X to Y – a dependence that can be
arbitrary.
A positive edge from X to Y – higher values of X will bias the
distribution of Y higher. For all instantiations J of the variables
P(Y=1|X=1, J) > P(Y=1|X=0, J).
A negative edge from X to Y – higher values of X will bias the
distribution of Y lower. For all instantiations J of the variables
P(Y=1|X=0, J) > P(Y=1|X=1, J).
A negative/positive edge from X to Y – Y’s dependence on X is
either positive or negative but the true relationship not unknown
As edge annotations describe the relationship between a
variable and a single parent and Bayesian networks describe
the relationship between a variable and all its parent, an added
requirement is that the implied constraints hold for all possible
values of other parents.
Advantages to this extended model:
Allows us to represent finer degrees of refinement regarding
the types of relationships between variables but doesn’t
force us to.
Permits a model to evolve as more knowledge is gained
about the types of influences that are present in the
biological system under study – all edges can be initially
unannotated with ‘+’ and ‘–‘ annotations added
incrementally as activators and repressors are identified.
In the models M1, M2 we allow the edges in each model
to take on all possible combinations of annotations
(‘+’,’-‘,’+/-‘) :
Results:
In model M1 adding different kinds of annotations fails to
change the score significantly, as the structure of the
graph is limited in explaining the observed expression
data.
The same effect is observed when the edge between
Gal4 and Gals2 in considered in model M2 – this is
consistent with the result of figure 3 indicating that the
coupling between Gal4 and Gal2 is weak.
In contrast, adding a ‘+’ annotation to the edge between
Gal80 and Gal2 results in a score comparable with
previously achieved results, but adding a ‘-‘ annotation to
the same edge worsens the score.
Conclusions:
This example illustrates that when the constraints implied
by edge annotations cannot be satisfied by the data,
scores result that are as poor as when the given structure
is incorrect.
For this reason annotations serve a useful discriminator of
the kinds of relationships present in the data.
Precautions when interpreting results:
Although Gal80 is know to act as a repressor in the cell,
this effect is countered by a level of a factor that is
currently unknown and remains unmodeled here.
A complete model would include the effect of this latent
variable and so in such a model given sufficient data, the
edge between Gal80 and Gal2 would be labeled ‘-‘.
Nevertheless in the limited model here, a ‘+’ annotation
for this edge is correct as the level of Gal80 concomitantly
with the level of Gal2 in our data
While Bayesian networks are well suited to dealing robustly
with noisy data, as noise increases, the score difference
between correct and incorrect models goes down.
In the case of uninformative data, correct models will score as
poorly as incorrect ones.
The ability to particular data to enhance score difference
between models suggests the possibility of performing
“experimental suggestion” in the future – meaning models
could be used to generate suggestions for new experiments,
yielding data that would optimally elucidate a given network.
Disadvantages:
A danger of wrong conclusions given bad/small
data/incomplete model.
Given a small amount of data difficult to find a model that
expresses all features.
Assumptions like the prior probability for computing the
scoring.
Assumptions like all experiments performed are
independent.
“Using Graphical models and genomic expression data to
statistically validate models of genetic regulatory networks”
Alexander J Hartemink, David K. Gifford, Tommi S. Jaakola,
Richard A. Young.
“Using Bayesian networks to analyze expression data” Nir
Friedman, Iftach Nachman, Michal Linial Dana Pe’er .
“A Bayesian method for the induction of probabilistic networks
from data” Gregory F. Cooper, Edward Herskovits.
The presentation deals with using a model with a probabilistic nature to analyze
genomic expression data for use with genetic regulatory networks that can be
represented in a computational form.
First, an introduction to such a model is presented, describing the motivation for
such a model, the obstacles of previous models in analyzing the genomic
expression data are described, followed by existing techniques and an example of
previous models.
Then, the new model – the Bayesian network is introduced, first being defined,
then its characteristics described, and given the expression data, analyzing it and
finding the best network that matches it using a scoring method, describing the
method and its advantages and properties (like dealing with latent variables).
Next, an example of using this network with genomic expression data from genes
in the galactose metabolism in S. Cerevisiae is described.
Afterwards, the network semantics are extended to include annotated edges and
the previous example of the galactose system is revisited with the modified
network.
The presentation is concluded with conclusions regarding the usage of the new
model.