Observed sequence

Download Report

Transcript Observed sequence

Hidden Markov Models:
an Introduction
by Rachel Karchin
Outline
Stochastic Modeling
 Discrete time series
 Simple Markov models
 Hidden Markov models
 Summary of key HMM algorithms
 Modeling protein families with linear
profile HMMs

Outline
Overfitting and regularization.
 To come.

BME100 9/28/01
References

Lectures from David Haussler’s
CMPS243 class (Winter 1998)
BME100 9/28/01
Stochastic Modeling

Stochastic modeling. For phenomenon
that exhibit random behavior.
 Random doesn’t mean arbitrary.
 Random events can be modeled by some
probability distribution.
 Represented with random variables that
take on values (numeric or symbolic)
according to event outcomes.
BME100 9/28/01
General Discrete Time Series
Chain of random variables
X1,X2,X3, . . . , Xn
 Sequence of observed values
x=x1,x2,x3, . . . , xn
 When we observe x, say that:
X1= x1,X2= x2,X3= x3, . . . , Xn= xn

BME100 9/28/01
Simple Markov model of order k

Probability distribution for Xt
depends only on values of previous k
random variables:
Xt-1,Xt-2,. . . , Xt-k
BME100 9/28/01
Simple Markov model of order k
Example with k=1 and Xt = {a,b}
 Observed sequence: x = abaaababbaa

Model:
Prev
Next
Prob
a
a
0.7
a
b
0.3
b
a
0.5
b
b
0.5
StartSprobs
a
b
0.5
0.5
P(x) = 0.5 * 0.3 * 0.5 * 0.7 * 0.7 * 0.3 * 0.5 * 0.3 * 0.5 * 0.5 * 0.7
BME100 9/28/01
What is a hidden Markov model?
Finite set of hidden states.
 At each time t, the system is in a hidden
state, chosen at random depending on
state chosen at time t-1.
 At each time t, observed letter is
generated at random, depending only on
current hidden state.

BME100 9/28/01
HMM for random toss of fair and
biased coins
Fair
0.8
Start
0.5
0.2
Biased
P(H)=0.
5
P(H)=0.
1
P(T)=0.
5
P(T)=0.
9
0.2
0.5
Observed sequence: x = HTTHTTTTHHTH
Sequence of states: q = FFFFBBBFFFFF
0.8
BME100 9/28/01
HMM for random toss of fair and
biased coins
Sequence of states is a first -order
Markov model but usually is hidden
to us.
 We observe the effect, which is
statistically correlated with the state.
 Use the correlations to decode the
state sequence.

BME100 9/28/01
HMM for fair and biased coins
With complete information, can compute:
P(x,q) = 0.5 * 0.5 * 0.8 * 0.5 * 0.8 * 0.5 * 0.8 * 0.5 * 0.2 * 0.9 . . .
Observed sequence: x = HTTHTTTTHHTH
Sequence of states: q = FFFFBBBFFFFF
Otherwise:
P( x )   P( x , q )
q
BME100 9/28/01
Three key HMM algorithms
Forward algorithm. Given observed
sequence x and an HMM M, calculate
P(x|M).
 Viterbi algorithm. Given x and M,
calculate the most likely state
sequence q.
 Forward-backward algorithm. Given
many observed sequences, estimate
the parameters of the HMM.

BME100 9/28/01
Some HMM Topologies
BME100 9/28/01
Modeling protein families with
linear profile HMMs
Observed sequence is the amino
acid sequence of a protein.
 Typically want to model a group of
related proteins.
 Model states and transitions will be
based on a multiple alignment of the
group.
 No transitions from right to left.

BME100 9/28/01
From multiple alignment to
profile HMM
NF.....ADF.....SY
NYrqsanSNFapistAY
DFvlamrSF

Good model of these proteins must reflect:
– highly conserved positions in the alignment
– variable regions in the alignment
– varying lengths of protein sequences
BME100 9/28/01
From multiple alignment to
profile HMM
NF.....ADF.....SY
NYrqsanSNFapistAY
DFvlamrSF
Three kinds of states: match insert silent
0.0
Start
-
0.0
0.0
0.0
0.8
1.0
0.0
P(R)=0.13
P(Q)=0.07
P(A)=0.2
•••
0.2
0.6
0.0
0.4
P(N)=0.6
P(D)=0.4
1.0
P(F)=0.8
P(Y)=0.2
0.4
P(S)=0.6
P(A)=0.4
0.6
P(Y)=0.67
P(F)=0.33
BME100 9/28/01
Finding probability of a sequence
with an HMM
Once we have an HMM for a group of
proteins, we are often interested in
how well a new sequence fits the
model.
 We want to compute a probability for
our sequence with respect to the
model.

BME100 9/28/01
One sequence many paths

DYAF
A protein sequence can be
represented by many paths through
the HMM.
0.0
Start
-
0.0
0.0
0.0
0.8
1.0
0.0
P(R)=0.13
P(Q)=0.07
P(A)=0.2
•••
0.2
0.6
0.0
0.4
P(N)=0.6
P(D)=0.4
1.0
P(F)=0.8
P(Y)=0.2
0.4
P(S)=0.6
P(A)=0.4
0.6
P(Y)=0.67
P(F)=0.33
BME100 9/28/01
One sequence many paths

DYAF
A protein sequence can be
represented by many paths through
the HMM.
-
0.0
Start
0.1
0.0
0.0
0.8
1.0
0.0
P(R)=0.13
P(Q)=0.07
P(A)=0.2
•••
0.1
0.6
0.0
0.4
P(N)=0.6
P(D)=0.4
1.0
P(F)=0.8
P(Y)=0.2
0.4
P(S)=0.6
P(A)=0.4
0.6
P(Y)=0.67
P(F)=0.33
BME100 9/28/01
Finding the probability of a
sequence with an HMM
Not knowing the state sequence q, we’ll
have to use either the forward or the
Viterbi algorithm.
 Basic recurrence relation for Viterbi:

P(vt) def.
Prob of most probable path ending in state qt with obs xt
P(vo) = 1
P(vt) = max P(vt-1) * P(qt | qt-1) * P(xt)

Compute with dynamic programming.
BME100 9/28/01
Most probable path: Viterbi
for t=1 to n
algorithm
P(v ) =
max P(vt-1) * P(qt | qt-1) * P(xt)
t
-
0.0
Start
DYAF
0.0
0.0
0.0
P(R)=0.13
0.8 P(Q)=0.07
P(A)=0.2
•••
0.0
1.0
0.4
P(N)=0.6
P(D)=0.4
P(v3142))
P(v
0.0
0.2
0.6
M4
Din
inM3
M2
D
D
in
M1
I4
Din
inI3
I2
D
D
in
I1
Y
in
M4
Yin
inM3
M2
Y
M1
I4
Yin
inI3
I2
Y
Y
in
I1
M4
Ain
inM3
M2
A
A
in
M1
I4
Ain
inI3
I2
A
A
in
I1
M4
inM3
M2
FFFin
in
M1
in
I4
inI3
I2
FFFin
I1
1.0
P(F)=0.8
P(Y)=0.2
0.051*0.6*0
0.4*1.0*0
0.32*0.4*0
1.0*1.0*0.4
0.051*0*0
0.4*0*0
0.32*0.6*0
1.0*0*0
0.051*0.6*0.67
0.4*1.0*0.2
0.32*0.4*0
1.0*1.0*0
0.051*0*0
0.4*0*0
0.32*0.6*0
1.0*0*0
0.051*0.6*0
0.4*1.0*0
0.32*0.4*0.4
1.0*1.0*0
0.051*0*0
0.4*0*0
0.32*0.6*0.2
1.0*0*0
0.051*0.6*0.33
0.4*1.0*0.8
0.32*0.4*0
1.0*1.0*0
0.051*0*0
0.4*0*0
0.32*0.6*0
1.0*0*0
0.4
P(S)=0.6
P(A)=0.4
P(Y)=0.67
P(F)=0.33
0.6
P(v1)
P(v2)
P(v3)
P(v4)
D
M1
0.4
I1
0
M2
0
I2
0
M3
0
I3
0
M4
0
I4
0
Y
0
0
.008
0
0
0
.021
0
A
0
0
0
0
0
0
F
0
0
0.32
0
.01
0
.051 .038
0
0
BME100 9/28/01
Overfitting problems



Our toy example
illustrates a problem with
estimating probability
distributions from small
samples.
P(aa other than D or N)=0
at position 1.
Family members which
don’t begin with D or N
can’t be recognized by
the model.
Probability distribution in Match State 1
BME100 9/28/01
Model regularization


Use pseudocounts.
If an amino acid does not appear in a column
of the alignment, give it a fake count.
NF.....A-
Observed counts
of A in column 1
P(A in col 1) 
DF.....SY
NYrqsanSNFifistAY
DFvlpmrSF
Pseudocounts
of A in column 1
0 1
1

5  18
23
Observed counts over all
amino acids in column 1
Pseuodcounts over all
amino acids in column 1
Observed counts
of N in column 1
P(N in col 1) 
Pseudocounts
of N in column 1
3 0
3

5  18
23
Observed counts over all
amino acids in column 1
Pseuodcounts over all
amino acids in column 1
BME100 9/28/01
Model regularization


Pseudocounts smooth
the column probability
distributions
In practice, often
pseudocounts are added
by fitting the column to a
set of typical amino acid
distributions found in the
columns of protein
multiple alignments.
Probability distribution in Match State 1
BME100 9/28/01
To come:
HMMs can be used to automatically
produce a high-quality multiple
alignment.
 Active areas of research:

– Building HMMs that can recognize very
distantly related proteins
– Multi-track HMMs