Modeling biological data and structure with probabilistic

Download Report

Transcript Modeling biological data and structure with probabilistic

Modeling biological data and
structure with probabilistic
networks I
Yuan Gao, Ph.D.
11/05/2002
Slides prepared from text material by
Simon Kasif and Arthur Delcher of
UIC and Loyola College
Computational Methods in Molecular
Biology
Probabilistic networks
• Also known as Bayes’ networks.
• Bayes’ networks are graphical models of probability
distributions that can provide a convenient medium for
scientists to:
– Express their knowledge of the domain as a collection of
probabilistic or causal rules that describe biological
sequences. These rules capture our intuition about the
information of biological sequences
– Learn the exact structure of the network and the associated
probabilities that govern the probabilistic processes
generating the sequences by directly observing data.
Probabilistic networks
• Bayes’s Law
– P(X|Y) = P(X, Y)/P(Y) or equivalently
– P(X, Y) = P(X|Y)P(Y)
where P(X, Y) is the joint probability distribution of
random variables X and Y
And it can be generalizes to more random variables
P(X1, X2, …,XN) = P(X1|X2, …,XN)P(X2|X3, …XN)
… P(XN-1|XN)P(XN)
Probabilistic networks
• X and Y are independent iff
– P(X|Y) = P(X) or equivalently
– P(X, Y) = P(X)P(Y),
And X and Y are conditionally independent given Z iff
P(X|Y, Z) = p(X|Z) or equivalently
P(X, Y|Z) = P(X|Z)P(Y|Z)
The notion of independence can be generalized to sets of
random variables.
Probabilistic networks
• Let’s consider a sequence of random variables
X1,X2,…,XN,
• Assuming that each Xi is conditionally
independent of X1, X2,…,Xi-2 given Xi-1,
• Or equivalently,
P(Xi|X1,…Xi-1) = P(Xi|Xi-1)
– This is often refered to as the Markov assumption.
In other word, the future only depends on the present, it
has no memory of the entire past!
Probabilistic networks
• Under the Markov assumption,
– P(X1,X2,…,XN) = p(X1|X2)P(X2|X3)…P(XN-1|XN)
– Graphically, these dependencies can be represented as a
chain graph
P(X2|X1)
X1
P(X3|X2)
X2
P(X4|X3)
X3
Figure 1. Chain Model. An example.
P(X5|X4)
X4
X5
Chain Models in Biology
• A simple problem:
– Given a set of homologous protein sequences D, a so called
training set, now given a new protein sequence, what is the
likelihood that it is a member of the set (family) D?
• Let’s start with a few assumptions, which can be removed easily
– The sequences are of length L
• A standard solution: devise a simple probabilistic network model of the
sequences in D
A standard solution with independency
assumption
• First associate a random variable Xi with
each position 1<= i <=L, where the values
of X1 range over the 20 amino acids.
– Assume the model has the topology
X1
X2
X3
X4
Figure 2. Simplest possible network assuming
independency of the random variables.
X5
A standard solution with independency
assumption
• The distribution of Xi is intended to model the
relative frequency of each amino acid in position i
independently of other position.
• The joint probability of X1, …, XL is defined by
– P(X1,…,XL) = P(X1)P(X2)…P(XL)
• This is a stardard model in biology and typically
expressed as a “Consensus Matrix”
A standard solution with independency
assumption
•
Generative interpretation of the model
– Generating a database of protein sequences using the
model
(1) For I = 1 to L, generate a random value for the ith
position in the sequence using the probability
distribution associated with the ith variable.
(2) Repeat step 1, generating many more new sequences.
•
Now we ask the question of what is the best network in
this simple class of networks that models the sequences
in our database as closely as possible. This is referred to
as the Learning Problem for Bayes’ networks.
A standard solution with independency
assumption
• More formally, we want to find a probabilistic model M such that
P(M|D) is maximized. Or mathematically, using Bayes’ Law, we want
to maximize
P(D|M)P(M)/P(D)
– Assuming each model is equally likely, this is equivalent to finding
a model that maximize
P(D|M).
– Again assuming each sequence is generated independently, we can
maximize the product of P(Di|M), where Di is the ith sequence
• Since the model is of the form
P(X1,…XL) = P(X1)P(X2)…P(XL)
• By simply differentiation, the model that maximizes the probability of
the database D is given by recording the individual frequencies of each
amino acid at each position. Alternatively, learning the BEST model
that generates the data is done simply by recording the relative
frequency of each amino acid in each position!
A standard solution with independency
assumption
• Now return to our problem, given a new protein sequence
x = x1,x2,…xL we evaluate its probability of being
generated by the model as
– P(x|M) = P(x1)P(x2)…P(xL)
• Typically, a score is computed as
– P(x|M)/Pr(x) or log[P(x|M)/Pr(x) ]
– Pr is the probability x has been generated “at random”
• The score is then compared to some “carefully” chosen
threshold.
• This is known as “Consensus Matrix” approach” or
“Weight Matrix” approach.
• Is this a Markov model? A Zero-th order Markov Model
because of the independency assumption.
A standard solution with independency
assumption
• A few Notes
– Simple Model with few parameters
– Complex model requires more parameters
– However, the simple learning methodology is
valid for arbitrary fixed (predetermined)structure Bayes’ networks without hidden
variables.
A Slight Generalization to first-order
Markov model
• Remove the independence assumption from the previous
zero-th order Markov model, under the Markov
assumption, we need to record the frequencies of the
amino-acid pairs for each position in the protein sequence.
• With this model, given a new protein sequence
x=x1,x2,…,xL, we evaluate the probability that x has been
generated by the model as the product P(x1|x2)…P(xL1|xL)P(xL)
• This generalizes the consensus matrix approach for
biological modeling. It was followed by Salzberg and
produced statistically significant improvement on the false
positive rate of detecting donor/acceptor sites in DNA
sequences.
Modeling hidden state systems
with probabilistic networks
• Now consider a very simple hidden Markov
Model (HMM) for protein sequence modeling.
– Assuming each amino acid in the protein can be in one
of three states: alpha-helix, beta-sheet, or coil
– With each state Si, we associate a set of transition
probabilities. Let Ai,j denote the probability of moving
from state Si to state Sj
– With each state, we also associate a matrix of “output”
probabilities. That is, in each state, an amino acid is
“outputted” or “emitted” according to some probability
distribution. Use Bi,k denote the probability of
generating the kth amino acid in state Si.
• bb
Modeling hidden state systems
with probabilistic networks
Helix
Sheet
Coil
3. Simple HMM for generating protein sequences.
A simple HMM for protein Sequences
•
A generative model of proteins
–
(1)
(2)
(3)
(4)
To generate protein sequences, we simply follow the
following algorithm. Let L be the length of protein
squence
Set the current state S = coil.
Output an amino acid according to the output
distribution of S, Bs,k.
Transition to a new state S according to the
probability distribution associated with the current
state S, As,j
Repeat steps 2-3 L-1 more times.
A probabilistic analog of HMM
• First, fix the length of the sequences to L
• Associate a random variable Psi with the State in position I
• Associate a random variable Ei with the amino acid
“generated” in position i.
• In fact, every HMM has an equivalent probabilistic
network that is a two-layer tree network.
• Note:
– during learning, all variables are fully observable.
– During inference, the values of the state variables are hidden.
…
…
PSi-1
PSi
PSi+1
Ei-1
Ei
Ei+1
…
…