Finding Patterns in Protein Sequence and Structure

Download Report

Transcript Finding Patterns in Protein Sequence and Structure

Master’s course
Bioinformatics Data Analysis and
Tools
Lecture 12: (Hidden) Markov models
Centre for Integrative Bioinformatics
Problem in biology
• Data and patterns are often not clear cut
• When we want to make a method to recognise a
pattern (e.g. a sequence motif), we have to learn
from the data (e.g. maybe there are other
differences between sequences that have the
pattern and those that do not)
• This leads to Data mining and Machine learning
A widely used machine learning
approach: Markov models
Contents:
•Markov chain models (1st order, higher order and
inhomogeneous models; parameter estimation;
classification)
• Interpolated Markov models (and back-off models)
• Hidden Markov models (forward, backward and BaumWelch algorithms; model topologies; applications to gene
finding and protein family modeling
Markov Chain Models
• a Markov chain model is defined by:
– a set of states
• some states emit symbols
• other states (e.g. the begin state) are silent
– a set of transitions with associated probabilities
• the transitions emanating from a given state define a
distribution over the possible next states
Preamble: Hidden Markov Chain
Models
1. S - observations

x1,…,xn – sequence of observations
2. Q - states


p1,…,pn – hidden sequence of states
f=(f1,…,fN)T - initial probability of states
3. A = (aij) – transition matrix
4. E = (ei(x)) – emission probabilities
A Simple HMM
Markov Chain Models
• given some sequence x of length L, we can ask how
probable the sequence is given our model
• for any probabilistic model of sequences, we can
write this probability as
• key property of a (1st order) Markov chain: the
probability of each Xi depends only on Xi-1
Markov Chain Models
Pr(cggt) = Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
Markov Chain Models
Can also have an end state, allowing the model to
represent:
• Sequences of different lengths
• Preferences for sequences ending with particular symbols
Markov Chain Models
The transition parameters can be denoted by
where
axi1xi
axi1xi  Pr(xi | xi 1 )
Similarly we can denote the probability of a
sequence x as
Where aBxi represents the transition from the begin state
Example Application
• CpG islands
– CG dinucleotides are rarer in eukaryotic genomes than expected given the independent
probabilities of C, G
– In human genome CpG (CG) is least frequent dinucleotide, because C in CpG is easily
methylated and has the tendency to mutate into T afterwards.
– Methylation is suppressed around genes in a genome: CpG appears more frequently within
these regions, called CpG islands.
– Particularly, the regions upstream of genes are richer in CG dinucleotides than elsewhere –
CpG islands
– Identifying the CpG islands in a genome is important.
– useful evidence for finding genes
• Could predict CpG islands with Markov chains
– one to represent CpG islands
– one to represent the rest of the genome
Example includes using Maximum likelihood and Bayes’
statistical data and feeding it to a HMM model
Estimating the Model Parameters
• Given some data (e.g. a set of sequences from
CpG islands), how can we determine the
probability parameters of our model?
• One approach: maximum likelihood estimation
– given a set of data D
– set the parameters  to maximize
Pr(D | )
– i.e. make the data D look likely under the model
Maximum Likelihood Estimation
• Suppose we want to estimate the parameters Pr(a),
Pr(c), Pr(g), Pr(t)
• And we’re given the sequences:
accgcgctta
gcttagtgac
tagccgttac
• Then the maximum likelihood estimates are:
Pr(a) = 6/30 = 0.2
Pr(c) = 9/30 = 0.3
Pr(g) = 7/30 = 0.233
Pr(t) = 8/30 = 0.267
These dinucleotide frequency data are derived from genome
sequences
Higher Order Markov Chains
• An nth order Markov chain over some alphabet is
equivalent to a first order Markov chain over the alphabet
of n-tuples
• Example: a 2nd order Markov model for DNA can be
treated as a 1st order Markov model over alphabet:
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA,
TC, TG, and TT (i.e. all possible dipeptides)
A Fifth Order Markov Chain
Inhomogenous Markov Chains
• In the Markov chain models we have considered
so far, the probabilities do not depend on where
we are in a given sequence
• In an inhomogeneous Markov model, we can have
different distributions at different positions in the
sequence
• Consider modeling codons in protein coding
regions
Inhomogenous Markov Chains
A Fifth Order Inhomogenous
Markov Chain
Selecting the Order of a
Markov Chain Model
• Higher order models remember more “history”
• Additional history can have predictive value
• Example:
– predict the next word in this sentence fragment
“…finish __” (up, it, first, last, …?)
– now predict it given more history
• “Fast guys finish __”
Selecting the Order of a
Markov Chain Model
• However, the number of parameters we need to
estimate grows exponentially with the order
– for modeling DNA we need parameters for an nth order
model, with n  5 normally
• The higher the order, the less reliable we can
expect our parameter estimates to be
– estimating the parameters of a 2nd order homogenous
Markov chain from the complete genome of E. Coli, we
would see each word > 72,000 times on average
– estimating the parameters of an 8th order chain, we
would see each word ~ 5 times on average
Interpolated Markov Models
• The IMM idea: manage this trade-off by
interpolating among models of various orders
• Simple linear interpolation:
Interpolated Markov Models
• We can make the weights depend on the history
– for a given order, we may have significantly more
data to estimate some words as compared to others
• General linear interpolation
Gene Finding: Search by Content
•Encoding a protein affects the statistical properties
of a DNA sequence
– some amino acids are used more frequently than
others (Leu more popular than Trp)
– different numbers of codons for different amino
acids (Leu has 6, Trp has 1)
– for a given amino acid, usually one codon is used
more frequently than others
•This is termed codon preference
•Codon preferences vary by species
Codon Preference in E. Coli
AA codon /100
---------------------Gly GGG 1.89
Gly GGA 0.44
Gly GGU 52.99
Gly GGC 34.55
Glu
Glu
GAG 15.68
GAA 57.20
Asp
Asp
GAU 21.63
GAC 43.26
Search by Content
• Common way to search by content
– build Markov models of coding & noncoding
regions
– apply models to ORFs (Open Reading Frames) or fixedsized windows of sequence
• GeneMark [Borodovsky et al.]
– popular system for identifying genes in bacterial
genomes
– uses 5th order inhomogenous Markov chain
models
The GLIMMER System
• Salzberg et al., 1998
• System for identifying genes in bacterial
genomes
• Uses 8th order, inhomogeneous, interpolated
Markov chain models
IMMs in GLIMMER
• How does GLIMMER determine the values?
• First, let us express the IMM probability
calculation recursively:
IMMs in GLIMMER
• If we haven’t seen xi-1… xi-n more than 400 times,
then compare the counts for the following:
..sequence
goes from
left to right
here…
• Use a statistical test ( 2) to get a value d indicating
our confidence that the distributions represented by
the two sets of counts are different [2 = ((O - E)2/E)]
IMMs in GLIMMER
2 score when comparing nth-order with
n-1th-order Markov model (preceding
slide)
The GLIMMER method
• 8th order IMM vs. 5th order Markov model
• Trained on 1168 genes (ORFs really)
• Tested on 1717 annotated (more or less
known) genes
PPV
PPV
PPV
Plot sensitivity
over 1-specificity
Hidden Markov models (HMMs)
Given say a T in our input sequence, which state emitted it?
A Simple HMM
Hidden Markov Models
1. S - observations

x1,…,xn – sequence of observations
2. Q - states


p1,…,pn – hidden sequence of states
f=(f1,…,fN)T - initial probability of states
3. A = (aij) – transition matrix
4. E = (ei(x)) – emission probabilities
Hidden Markov models (HMMs)
Hidden State
• We will distinguish between the observed parts of a
problem and the hidden parts
• In the Markov models we have considered previously,
it is clear which state accounts for each part of the
observed sequence
• In the model above (preceding slide), there are
multiple states that could account for each part of the
observed sequence
– this is the hidden part of the problem
– states are decoupled from sequence symbols
HMM-based homology searching
HMM for ungapped alignment…
Transition probabilities and Emission probabilities
Gapped HMMs also have insertion and deletion
states (next slide)
Profile HMM: m=match state, I-insert state, d=delete state; go from
left to right. I and m states output amino acids; d states are ‘silent”.
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
Start
Model for alignment with insertions and deletions
m5
End
HMM-based homology searching
• Most widely used HMM-based profile searching
tools currently are SAM-T2K (Karplus et al.,
1998) and HMMER2 (Eddy, 1998)
• formal probabilistic basis and consistent theory
behind gap and insertion scores
• HMMs good for profile searches, but still not
good for alignment (due to parametrization of
the models)
• HMMs are slow
Homology-derived Secondary Structure of Proteins
(HSSP)
Sander & Schneider, 1991
It’s all about trying to push the “don’t know region”
down…
The Parameters of an HMM
HMM for Eukaryotic Gene Finding
Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences
A Simple HMM
Three Important Questions
• How likely is a given sequence?
the Forward algorithm
• What is the most probable “path” for generating a
given sequence?
the Viterbi algorithm
• • How can we learn the HMM parameters given a
set of sequences?
the Forward-Backward (Baum-Welch) algorithm
Three basic problems of HMMs
Once we have an HMM, there are three problems of interest.
(1)
The Evaluation Problem
Given an HMM and a sequence of observations, what is the
probability that the observations are generated by the
model?
(2) The Decoding Problem
Given a model and a sequence of observations, what is the
most likely state sequence in the model that produced the
observations?
(3) The Learning Problem
Given a model and a sequence of observations, how should we
adjust the model parameters in order to maximize
Evaluation problem can be used for isolated (word) recognition. Decoding problem is related to the
continuous recognition as well as to the segmentation. Learning problem must be solved, if we
want to train an HMM for the subsequent use of recognition tasks.
Three main problems
i) Given A+E what is the probability of (x1, ...,xn)?
forward algorithm
ii) Given A+E+(x1, ...,xn) what is (π1, ..., πn)? Viterbi
algorithm
iii) Given (x1, ...,xn) what is A+E? Baum-Welch algorithm
iiii) Given A+E+(x1, ...,xn) what is the most probable state at
step i? forward-backward algorithm
Three Important Questions
• How likely is a given sequence?
Forward algorithm
• What is the most probable “path” for
generating a given sequence?
• How can we learn the HMM parameters
given a set of sequences?
How Likely is a Given Sequence?
• The probability that the path is taken and the
sequence is generated:
• (assuming begin/end are the only silent states on
path)
How Likely is a Given Sequence?
How Likely is a Given Sequence?
The probability over all paths is:
but the number of paths can be exponential in the
length of the sequence...
• the Forward algorithm enables us to compute
this efficiently
How Likely is a Given Sequence:
The Forward Algorithm
• Define fk(i) to be the probability of being in
state k
• Having observed the first i characters of x we
want to compute fN(L), the probability of being
in the end state having observed all of x
• We can define this recursively
How Likely is a Given Sequence:
The forward algorithm
probability that we’re in start state
• Initialisation:
and have observed 0 characters from
f0(0) = 1 (start),
the sequence
fk(0) = 0 (other silent states k)
• Recursion:
fl(i) = el(i)Sk fk(i-1)akl
(emitting states),
fl(i) = Sk fk(i)akl (silent states)
• Termination:
Pr(x) = Pr(x1…xL) = f N(L) = Sk fk(L)akN
probability that we are in the end state and have observed the entire sequence
Forward algorithm example
…
..all the time
calculate over all
possible ways to
get to a considered
state
Three Important Questions
• How likely is a given sequence?
• What is the most probable “path” for
generating a given sequence?
Viterbi algorithm
• How can we learn the HMM parameters
given a set of sequences?
Finding the Most Probable Path:
The Viterbi Algorithm
• Define vk(i) to be the probability of the most
probable path accounting for the first i
characters of x and ending in state k
• We want to compute vN(L), the probability
of the most probable path accounting for all
of the sequence and ending in the end state
• Can be defined recursively
• Can use DP to find vN(L) efficiently
Finding the Most Probable Path:
The Viterbi Algorithm
•Andrew Viterbi used Manhattan grid model to solve this
Decoding problem.
•Every choice of Every choice of π = π1 … πn
corresponds to a path in the graph.
•Only valid direction in the graph is eastward.
•This graph has N2(n-1) edges, where N is number of
states
Finding the Most Probable Path:
The Viterbi Algorithm
Manhattan grid model
Finding the Most Probable Path:
The Viterbi Algorithm
Viterbi – recursive step
What is the probability of the path which ends with qA->qB
and emission EB?
Finding the Most Probable Path:
The Viterbi Algorithm
Viterbi – recursive step
What is the most probable path to the state B in step i?
V(i ,q)= eq(xi)⋅maxsV(i−1,s)⋅as,q
Finding the Most Probable Path:
The Viterbi Algorithm
Viterbi – recursive step
def viterbi(x_seq, Q, fi, A, E):
T = {}
for q in Q:
T[q] = (fi[q], [q])
for x in x_seq:
U = {}
for q_next in X:
max_path = None; max_p = 0
for q in X:
(prob, path) = T[q]
p = prob * A[q][q_next] * E[q_next][x]
if p > max_p:
max_p = p;
max_path = path+[q_next]
U[q_next] = (max_p, max_path)
T=U
Finding the Most Probable Path:
The Viterbi Algorithm
Initialisation:
v0(0) = 1 (start), vk(0) = 0 (non-silent states)
Recursion for emitting states (i =1…L):
Recursion for silent states:
Finding the Most Probable Path:
The Viterbi Algorithm
Three Important Questions
• How likely is a given sequence? (clustering)
• What is the most probable “path” for
generating a given sequence? (alignment)
• How can we learn the HMM parameters given
a set of sequences?
The Baum-Welch Algorithm
The Learning Problem
Generally, the learning problem is how to adjust the HMM
parameters, so that the given set of observations (called the
training set) is represented by the model in the best way for the
intended application.
Thus it would be clear that the ``quantity'' we wish to optimize
during the learning process can be different from application to
application. In other words there may be several optimization
criteria for learning, out of which a suitable one is selected
depending on the application.
There are two main optimization criteria found in the literature;
Maximum Likelihood (ML) and Maximum Mutual Information
(MMI).
The Learning Task
• Given:
– a model
– a set of sequences (the training set)
• Do:
– find the most likely parameters to explain
the training sequences
• The goal is find a model that generalizes well
to sequences we haven’t seen before
Learning Parameters
• If we know the state path for each training
sequence, learning the model parameters is simple
– no hidden state during training
– count how often each parameter is used
– normalize/smooth to get probabilities
– process just like it was for Markov chain models
• If we don’t know the path for each training
sequence, how can we determine the counts?
– key insight: estimate the counts by considering every
path weighted by its probability
Learning Parameters:
The Baum-Welch Algorithm
• An EM (expectation maximization) approach, a
forward-backward algorithm
• Algorithm sketch:
– initialize parameters of model
– iterate until convergence
• Calculate the expected number of times each
transition or emission is used
• Adjust the parameters to maximize the likelihood
of these expected values
• Baum-Welch has as important feature that it
always converges
The Expectation step
The Expectation step
The Expectation step
The Expectation step
The Expectation step
• First, we need to know the probability of the i th
symbol being produced by state q, given sequence x:
Pr( pi = k | x)
•Given this we can compute our expected counts for
state transitions, character emissions
The Expectation step
The Backward Algorithm
The Expectation step
The Expectation step
The Expectation step
The Maximization step
..make it a probability
through dividing by the
total number of
emissions out of state k
The Maximization step
..make it a probability
through dividing by the
total number of
transitions out of state k
The Baum-Welch Algorithm
• Initialize parameters of model
• Iterate until convergence
– calculate the expected number of times each transition or
emission is used
– adjust the parameters to maximize the likelihood of these
expected values
• This algorithm will converge to a local maximum
(in the likelihood of the data given the model)
• Usually in a fairly small number of iterations
References
• bioalgorithms.info
• Wikipedia
• Durbin book
Biological Sequence Analysis : Probabilistic Models of Proteins
and Nucleic Acids
by Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme
Mitchison,
Cambridge University Press, ISBN 0521 62971 3 Pbk