Observed sequence
Download
Report
Transcript Observed sequence
Hidden Markov Models:
an Introduction
by Rachel Karchin
Outline
Stochastic Modeling
Discrete time series
Simple Markov models
Hidden Markov models
Summary of key HMM algorithms
Modeling protein families with linear
profile HMMs
Outline
Overfitting and regularization.
To come.
BME100 9/28/01
References
Lectures from David Haussler’s
CMPS243 class (Winter 1998)
BME100 9/28/01
Stochastic Modeling
Stochastic modeling. For phenomenon
that exhibit random behavior.
Random doesn’t mean arbitrary.
Random events can be modeled by some
probability distribution.
Represented with random variables that
take on values (numeric or symbolic)
according to event outcomes.
BME100 9/28/01
General Discrete Time Series
Chain of random variables
X1,X2,X3, . . . , Xn
Sequence of observed values
x=x1,x2,x3, . . . , xn
When we observe x, say that:
X1= x1,X2= x2,X3= x3, . . . , Xn= xn
BME100 9/28/01
Simple Markov model of order k
Probability distribution for Xt
depends only on values of previous k
random variables:
Xt-1,Xt-2,. . . , Xt-k
BME100 9/28/01
Simple Markov model of order k
Example with k=1 and Xt = {a,b}
Observed sequence: x = abaaababbaa
Model:
Prev
Next
Prob
a
a
0.7
a
b
0.3
b
a
0.5
b
b
0.5
StartSprobs
a
b
0.5
0.5
P(x) = 0.5 * 0.3 * 0.5 * 0.7 * 0.7 * 0.3 * 0.5 * 0.3 * 0.5 * 0.5 * 0.7
BME100 9/28/01
What is a hidden Markov model?
Finite set of hidden states.
At each time t, the system is in a hidden
state, chosen at random depending on
state chosen at time t-1.
At each time t, observed letter is
generated at random, depending only on
current hidden state.
BME100 9/28/01
HMM for random toss of fair and
biased coins
Fair
0.8
Start
0.5
0.2
Biased
P(H)=0.
5
P(H)=0.
1
P(T)=0.
5
P(T)=0.
9
0.2
0.5
Observed sequence: x = HTTHTTTTHHTH
Sequence of states: q = FFFFBBBFFFFF
0.8
BME100 9/28/01
HMM for random toss of fair and
biased coins
Sequence of states is a first -order
Markov model but usually is hidden
to us.
We observe the effect, which is
statistically correlated with the state.
Use the correlations to decode the
state sequence.
BME100 9/28/01
HMM for fair and biased coins
With complete information, can compute:
P(x,q) = 0.5 * 0.5 * 0.8 * 0.5 * 0.8 * 0.5 * 0.8 * 0.5 * 0.2 * 0.9 . . .
Observed sequence: x = HTTHTTTTHHTH
Sequence of states: q = FFFFBBBFFFFF
Otherwise:
P( x ) P( x , q )
q
BME100 9/28/01
Three key HMM algorithms
Forward algorithm. Given observed
sequence x and an HMM M, calculate
P(x|M).
Viterbi algorithm. Given x and M,
calculate the most likely state
sequence q.
Forward-backward algorithm. Given
many observed sequences, estimate
the parameters of the HMM.
BME100 9/28/01
Some HMM Topologies
BME100 9/28/01
Modeling protein families with
linear profile HMMs
Observed sequence is the amino
acid sequence of a protein.
Typically want to model a group of
related proteins.
Model states and transitions will be
based on a multiple alignment of the
group.
No transitions from right to left.
BME100 9/28/01
From multiple alignment to
profile HMM
NF.....ADF.....SY
NYrqsanSNFapistAY
DFvlamrSF
Good model of these proteins must reflect:
– highly conserved positions in the alignment
– variable regions in the alignment
– varying lengths of protein sequences
BME100 9/28/01
From multiple alignment to
profile HMM
NF.....ADF.....SY
NYrqsanSNFapistAY
DFvlamrSF
Three kinds of states: match insert silent
0.0
Start
-
0.0
0.0
0.0
0.8
1.0
0.0
P(R)=0.13
P(Q)=0.07
P(A)=0.2
•••
0.2
0.6
0.0
0.4
P(N)=0.6
P(D)=0.4
1.0
P(F)=0.8
P(Y)=0.2
0.4
P(S)=0.6
P(A)=0.4
0.6
P(Y)=0.67
P(F)=0.33
BME100 9/28/01
Finding probability of a sequence
with an HMM
Once we have an HMM for a group of
proteins, we are often interested in
how well a new sequence fits the
model.
We want to compute a probability for
our sequence with respect to the
model.
BME100 9/28/01
One sequence many paths
DYAF
A protein sequence can be
represented by many paths through
the HMM.
0.0
Start
-
0.0
0.0
0.0
0.8
1.0
0.0
P(R)=0.13
P(Q)=0.07
P(A)=0.2
•••
0.2
0.6
0.0
0.4
P(N)=0.6
P(D)=0.4
1.0
P(F)=0.8
P(Y)=0.2
0.4
P(S)=0.6
P(A)=0.4
0.6
P(Y)=0.67
P(F)=0.33
BME100 9/28/01
One sequence many paths
DYAF
A protein sequence can be
represented by many paths through
the HMM.
-
0.0
Start
0.1
0.0
0.0
0.8
1.0
0.0
P(R)=0.13
P(Q)=0.07
P(A)=0.2
•••
0.1
0.6
0.0
0.4
P(N)=0.6
P(D)=0.4
1.0
P(F)=0.8
P(Y)=0.2
0.4
P(S)=0.6
P(A)=0.4
0.6
P(Y)=0.67
P(F)=0.33
BME100 9/28/01
Finding the probability of a
sequence with an HMM
Not knowing the state sequence q, we’ll
have to use either the forward or the
Viterbi algorithm.
Basic recurrence relation for Viterbi:
P(vt) def.
Prob of most probable path ending in state qt with obs xt
P(vo) = 1
P(vt) = max P(vt-1) * P(qt | qt-1) * P(xt)
Compute with dynamic programming.
BME100 9/28/01
Most probable path: Viterbi
for t=1 to n
algorithm
P(v ) =
max P(vt-1) * P(qt | qt-1) * P(xt)
t
-
0.0
Start
DYAF
0.0
0.0
0.0
P(R)=0.13
0.8 P(Q)=0.07
P(A)=0.2
•••
0.0
1.0
0.4
P(N)=0.6
P(D)=0.4
P(v3142))
P(v
0.0
0.2
0.6
M4
Din
inM3
M2
D
D
in
M1
I4
Din
inI3
I2
D
D
in
I1
Y
in
M4
Yin
inM3
M2
Y
M1
I4
Yin
inI3
I2
Y
Y
in
I1
M4
Ain
inM3
M2
A
A
in
M1
I4
Ain
inI3
I2
A
A
in
I1
M4
inM3
M2
FFFin
in
M1
in
I4
inI3
I2
FFFin
I1
1.0
P(F)=0.8
P(Y)=0.2
0.051*0.6*0
0.4*1.0*0
0.32*0.4*0
1.0*1.0*0.4
0.051*0*0
0.4*0*0
0.32*0.6*0
1.0*0*0
0.051*0.6*0.67
0.4*1.0*0.2
0.32*0.4*0
1.0*1.0*0
0.051*0*0
0.4*0*0
0.32*0.6*0
1.0*0*0
0.051*0.6*0
0.4*1.0*0
0.32*0.4*0.4
1.0*1.0*0
0.051*0*0
0.4*0*0
0.32*0.6*0.2
1.0*0*0
0.051*0.6*0.33
0.4*1.0*0.8
0.32*0.4*0
1.0*1.0*0
0.051*0*0
0.4*0*0
0.32*0.6*0
1.0*0*0
0.4
P(S)=0.6
P(A)=0.4
P(Y)=0.67
P(F)=0.33
0.6
P(v1)
P(v2)
P(v3)
P(v4)
D
M1
0.4
I1
0
M2
0
I2
0
M3
0
I3
0
M4
0
I4
0
Y
0
0
.008
0
0
0
.021
0
A
0
0
0
0
0
0
F
0
0
0.32
0
.01
0
.051 .038
0
0
BME100 9/28/01
Overfitting problems
Our toy example
illustrates a problem with
estimating probability
distributions from small
samples.
P(aa other than D or N)=0
at position 1.
Family members which
don’t begin with D or N
can’t be recognized by
the model.
Probability distribution in Match State 1
BME100 9/28/01
Model regularization
Use pseudocounts.
If an amino acid does not appear in a column
of the alignment, give it a fake count.
NF.....A-
Observed counts
of A in column 1
P(A in col 1)
DF.....SY
NYrqsanSNFifistAY
DFvlpmrSF
Pseudocounts
of A in column 1
0 1
1
5 18
23
Observed counts over all
amino acids in column 1
Pseuodcounts over all
amino acids in column 1
Observed counts
of N in column 1
P(N in col 1)
Pseudocounts
of N in column 1
3 0
3
5 18
23
Observed counts over all
amino acids in column 1
Pseuodcounts over all
amino acids in column 1
BME100 9/28/01
Model regularization
Pseudocounts smooth
the column probability
distributions
In practice, often
pseudocounts are added
by fitting the column to a
set of typical amino acid
distributions found in the
columns of protein
multiple alignments.
Probability distribution in Match State 1
BME100 9/28/01
To come:
HMMs can be used to automatically
produce a high-quality multiple
alignment.
Active areas of research:
– Building HMMs that can recognize very
distantly related proteins
– Multi-track HMMs