Transcript Slides

CS5263 Bioinformatics
Lecture 12-14: Markov Chain and
Hidden Markov Models
Outline
• Background on probability
• Hidden Markov models
– Algorithms
– Applications
Probability Basics
• Definition (informal)
– Experiment: e.g. toss a coin 10 times or sequence a genome
– Outcome: A possible result of an experiment. e.g.
HHHTTTHTTH or ACGTTAAACGTA
– The sample space S of a random experiment is the set of all
possible outcomes. e.g {H, T}10
– Event: any subset of the sample space. E.g.: > 4 heads
– Probabilities are numbers assigned to events that indicate “how
likely” it is that the event will occur when a random experiment
is performed
– A probability law for a random experiment is a rule that assigns
probabilities to the events in the experiment
Example
0  P(Ai)  1
P(S) = 1
Probabilistic Calculus
• P(A U B) = P(A) + P(B) – P(A ∩ B)
Either A or B
both A and B
• If A, B are mutually exclusive:
P(A ∩ B) = 0
P(A U B) = P(A) + P(B)
• A and not(A) are mutually exclusive
– Thus: P(not(A)) = P(Ac) = 1 – P(A)
s
A
B
Joint and conditional probability
• The joint probability of two events A and B
P(A∩B), or simply P(A, B) is the probability that
event A and B occur at the same time.
• The conditional probability of P(A|B) is the
probability that A occurs given B occurred.
P(A | B) = P(A ∩ B) / P(B)
P(A ∩ B) = P(A | B) * P(B)
Example
• Roll a die
– If I tell you the number is less than 4
– What is the prob for the number to be even?
P(d = even | d < 4)
= P(d = even ∩ d < 4) / P(d < 4)
= P(d = 2) / P(d = 1, 2, or 3)
= (1/6) / (3/6) = 1/3
Independence
• P(A | B) = P(A ∩ B) / P(B)
=> P(A ∩ B) = P(B) * P(A | B)
• A, B are independent iff
– P(A ∩ B) = P(A) * P(B)
– That is, P(A) = P(A | B)
• Also implies that P(B) = P(B | A)
– P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)
Examples
• Are P(d = even) and P(d < 4)
independent?
P(d = even and d < 4) = 1/6
 P(d = even) * P(d < 4) = 1/4
or
P(d = even) = ½
 P(d = even | d < 4) = 1/3
• If the die has 8 faces, will P(d = even) and
P(d < 5) be independent?
Theorem of total probability
• Let B1, B2, …, BN be mutually exclusive events whose union equals
the sample space S. We refer to these sets as a partition of S.
• An event A can be represented as:
•
•
Since B1, B2, …, BN are mutually exclusive, then
P(A) = P(A∩B1) + P(A∩B2) + … + P(A∩BN)
Marginalization
And therefore
P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN)
= i P(A | Bi) * P(Bi)
Exhaustive conditionalization
Example
• A loaded die:
– P(6) = 0.5
– P(1) = … = P(5) = 0.1
• Prob of even number?
P(even)
= P(even | d < 6) * P (d<6) +
P(even | d = 6) * P (d=6)
= 2/5 * 0.5 + 1 * 0.5
= 0.7
Another example
• A box of dice:
– 99% fair
– 1% loaded
• P(6) = 0.5.
• P(1) = … = P(5) = 0.1
– Randomly pick a die and roll, P(6)?
• P(6) = P(6 | F) * P(F) + P(6 | L) * P(L)
– 1/6 * 0.99 + 0.5 * 0.01 = 0.17
Chain rule
• P(x1, x2, x3) = P(x1, x2, x3) / P(x2, x3)
* P(x2, x3) / P(x3)
* P(x3)
= P(x1 | x2, x3) P(x2 | x3) P(x3)
x3
x2
x1
Bayes theorem
• P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)
Conditional probability
(likelihood)
=> P(B | A) =
Posterior probability
P ( A | B ) P (B )
Prior of B
P(A )
Prior of A (Normalizing constant)
This is known as Bayes Theorem or Bayes Rule, and is (one of) the
most useful relations in probability and statistics
Bayes Theorem is definitely the fundamental relation in Statistical Pattern
Recognition
Bayes theorem (cont’d)
• Given B1, B2, …, BN, a partition of the sample
space S. Suppose that event A occurs; what is
the probability of event Bj?
Posterior probability
Likelihood
Prior of Bj
• P(Bj | A) = P(A | Bj) * P(Bj) / P(A)
Normalizing constant
= P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj)
Bj: different models
(theorem of total probabilities)
In the observation of A, should you choose a model that maximizes
P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !
Example
• Prosecutor’s fallacy
– Some crime happened
– The criminal left not evidence except hair
– The police got his DNA from his hair
• Expert matched the DNA with someone’s DNA in
a database
– Expert said both false-positive and false negative
rates are 10-6
• Can this be used as an evidence of guilty
against the suspect?
Prosecutor’s fallacy
•
•
•
•
False Pos: P(match | innocent) = 10-6
False Neg: P(no match | guilty) = 10-6
P(match | guilty) = 1 - 10-6 ~ 1
P(no match | innocent) = 1 - 10-6 ~ 1
• P(guilty | match) = ?
Prosecutor’s fallacy
P (g | m) = P (m | g) * P(g) / P (m)
~ P(g) / P(m)
– P(g): the prior probability for someone to be
guilty with no DNA evidence
– P(m): the probability for a DNA match
• How to get these two numbers?
– Don’t really care P(m)
– Want to compare two models:
• P(g | m) and P(i | m)
Prosecutor’s fallacy
• P(i | m) = P(m | i) * P(i) / P(m)
• P(g | m) = P(m | g) * P(g) / P(m)
• Therefore
P(i | m) / P(g | m) = P(m | i) / P(m | g) * P(i) / P(g)
= 10-6 * P(i) / P(g)
• P(i) + p(g) = 1
• It is clear, therefore, that whether we can
conclude the suspect is guilty depends on the
prior probability P(g)
Prosecutor’s fallacy
• How do you get P(g)?
• Depending on what other information you have on the
suspect
• Say if the suspect has no other connection with the
crime, and the overall crime rate is 10-7
• That’s a reasonable prior for P(g)
• P(g) = 10-7, P(i) ~ 1
• P(i | m) / P(g | m) = 10-6 * P(i) / P(g) = 10-6/10-7 = 10
• Or: P(i | m) = 0.91 and P(g | m) = 0.09
• Suspect is more likely to be innocent than guilty, given
only the DNA samples
Another example
• A test for a rare disease claims that it will report
positive for 99.5% of people with disease, and
negative 99.9% of time for those without.
• The disease is present in the population at 1 in
100,000
• What is P(disease | positive test)?
– P(D|P) / P(H|P) ~ 0.01
• What is P(disease | negative test)?
– P(D|N) / P(H|N) ~ 5e-8
Yet another example
• We’ve talked about the boxes of casinos:
99% fair, 1% loaded (50% at six)
• We said if we randomly pick a die and roll,
we have 17% of chance to get a six
• If we get 3 six in a row, what’s the chance
that the die is loaded?
• How about 5 six in a row?
• P(loaded | 666)
= P(666 | loaded) * P(loaded) / P(666)
= 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99)
= 0.21
• P(loaded | 66666)
= P(66666 | loaded) * P(loaded) / P(66666)
= 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99)
= 0.71
Simple probabilistic models for
DNA sequences
•
Assume nature generates a type of DNA
sequence as follows:
1.
2.
3.
4.
•
Given a string say X=“GATTCCAA…” and two
dice
–
–
•
A box of dice, each with four faces: {A,C,G,T}
Select a die suitable for the type of DNA
Roll it, append the symbol to a string.
Repeat 3, until all symbols have been generated.
M1 has the distribution of pA=pC=pG=pT=0.25.
M2 has the distribution: pA=pT=0.20, pC=pG=0.30
What is the probability of the sequence being
generated by M1 or M2?
Model selection by maximum
likelihood criterion
• X = GATTCCAA
• P(X | M1) = P(x1,x2,…,xn | M1)
= i=1..n P(xi|M1)
= 0.258
• P(X | M2) = P(x1,x2,…,xn | M2)
= i=1..n P(xi|M2)
= 0.25 0.33
P(X|M1) / P(X|M2) =  P(xi|M1)/P(xi|M2) = (0.25/0.2)5 (0.25/0.3)3
LLR =  log(P(xi|M1)/P(xi|M2))
Log likelihood ratio (LLR)
= nASA + nCSC + nGSG + nTST
= 5 * log(1.25) + 3 * log(0.833) = 0.57
Si = log (P(i | M1) / P(i | M2)), i = A, C, G, T
Model selection by maximum
a posterior probability criterion
• Take the prior probabilities of M1 and M2
into consideration if known
Log (P(M1|X) / P(M2|X))
= LLR + log(P(M1)) – log(P(M2))
= nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2))
• If P(M1) ~ P(M2), results will be similar to
LLR test
Markov models for DNA sequences
We have assumed independence of
nucleotides in different positions unrealistic in biology
Example: CpG islands
• CpG - 2 adjacent nucleotides, same strand
(not base-pair; “p” stands for the
phosphodiester bond of the DNA
backbone)
• In mammal promoter regions, CpG is more
frequent than other regions of genome
– often mark gene-rich regions
CpG islands
• CpG Islands
– More CpG than elsewhere
– More C & G than elsewhere, too
– Typical length: few 100 to few 1000 bp
• Questions
– Is a short sequence (say, 200 bp) a CpG
island or not?
– Given a long sequence (say, 10-100kb), find
CpG islands?
Markov models
• A sequence of random variables is a k-th order
Markov chain if, for all i, ith value is independent
of all but the previous k values:
• First order (k=1):
• Second order:
• 0th order: (independence)
First order Markov model
A 1st order Markov model for CpG
islands
• Essentially a finite state automaton (FSA)
• Transitions are probabilistic (instead of deterministic)
• 4 states: A, C, G, T
• 16 transitions: ast = P(xi = t | xi-1 = s)
• Begin/End states
Probability of emitting sequence x
Probability of a sequence
• What’s the probability of ACGGCTA in this
model?
P(A) * P(C|A) * P(G|C) … P(A|T)
= aBA aAC aCG …aTA
• Equivalent: follow the path in the automaton, and
multiply the transition probabilities on the path
Training
• Estimate the parameters of the model
– CpG+ model: Count the transition frequencies from
known CpG islands
– CpG- model: Also count the transition frequencies
from sequences without CpG islands
– ast = #(s→t) / #(s → )
a+st
a-st
Discrimination / Classification
• Given a sequence, is it CpG island or not?
• Log likelihood ratio (LLR)
βCG = log2(a+CG/a -CG) = log2(0.274/0.078) = 1.812
βBA = log2(a+ A/a - A) = log2(0.591/1.047) = -0.825
Example
• X = ACGGCGACGTCG
• S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA
+ βAC +βCG +βGT +βTC +βCG
= βBA + 2βAC +4βCG +βGG +βGC +βGA
+βGT +βTC
= -0.825 + 2*.419 + 4*1.812+.313
+.461 - .624 - .730 + .573
= 7.25
CpG island scores
Figure 3.2 (Durbin book) The histogram of length-normalized scores for all
the sequences. CpG islands are shown with dark grey and non-CpG with
light grey.
Questions
• Q1: given a short sequence, is it more
likely from CpG+ model or CpG- model?
• Q2: Given a long sequence, where are the
CpG islands (if any)?
– Approach 1: score (e.g.) 100 bp windows
• Pro: simple
• Con: arbitrary, fixed length, inflexible
– Approach 2: combine +/- models.
Combined model
• Given a long sequence, predict which state each position
is in. (states are hidden: Hidden Markov model)
Hidden Markov Model (HMM)
• Introduced in the 70’s for speech recognition
• Have been shown to be good models for biosequences
–
–
–
–
Alignment
Gene prediction
Protein domain analysis
…
• An observed sequence data that can be modeled by a
Markov chain
– State path unknown
– Model parameter known or unknown
• Observed data: emission sequences X = (x1x2…xn)
• Hidden data: state sequences Π = (π1π2…πn)
Hidden Markov model (HMM)
Definition: A hidden Markov model (HMM) is a five-tuple
• Alphabet
 = { b1, b2, …, bM }
• Set of states Q = { 1, ..., K }
• Transition probabilities between any two states
1
aij = transition prob from state i to state j
ai1 + … + aiK = 1, for all states i = 1…K
• Start probabilities a0i
a01 + … + a0K = 1
• Emission probabilities within each state
ek(b) = P( xi = b | i = k)
ek(b1) + … + ek(bM) = 1, for all states k = 1…K
K
2
…
HMM for the Dishonest Casino
A casino has two dice:
• Fair die
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
• Loaded die
P(1) = P(2) = P(3) = P(4) = P(5) = 1/10
P(6) = 1/2
Casino player switches back and forth between fair
and loaded die once in a while
The dishonest casino model
aFF = 0.95
eF(1) = 1/6
eF(2) = 1/6
eF(3) = 1/6
eF(4) = 1/6
eF(5) = 1/6
eF(6) = 1/6
aFL = 0.05
Fair
aLL = 0.95
LOADED
aLF = 0.05
Transition probability
Emission probability
eL(1) = 1/10
eL(2) = 1/10
eL(3) = 1/10
eL(4) = 1/10
eL(5) = 1/10
eL(6) = 1/2
Simple scenario
• You don’t know the probabilities
• The casino player lets you observe which die
he/she uses every time
– The “state” of each roll is known
• Training (parameter estimation)
– How often the casino player switches dice?
– How “loaded” is the loaded die?
– Simply count the frequency that each face appeared
and the frequency of die switching
– May add pseudo-counts if number of observations is
small
More complex scenarios
• The “state” of each roll is unknown:
– You are given the results of a series of rolls
– You don’t know which number is generated by
which die
• You may or may not know the parameters
– How “loaded” is the loaded die
– How frequently the casino player switches
dice
The three main questions on
HMMs
1. Decoding
GIVEN
FIND
2. Evaluation
GIVEN
FIND
a HMM M,
and a sequence x,
the sequence  of states that maximizes P (x,  | M )
Sometimes written as
P (x, ) for simplicity.
a HMM M,
and a sequence x,
P ( x | M ) [or P(x) for simplicity]
3. Learning
GIVEN
FIND
a HMM M with unspecified transition/emission probs.,
and a sequence x,
parameters  = (ei(.), aij) that maximize P (x | )
Question # 1 – Decoding
GIVEN
A HMM with parameters. And a sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
What portion of the sequence was generated with the fair die, and what
portion with the loaded die?
This is the DECODING question in HMMs
A parse of a sequence
Given a sequence x = x1……xN, and a HMM with k states,
A parse of x is a sequence of states  = 1, ……, N
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
K
xK
Probability of a parse
1
1
1
…
1
Given a sequence x = x1……xN
and a parse  = 1, ……, N
2
2
2
…
2
…
…
…
To find how likely is the parse:
(given our HMM)
K
K
K
x2
x3
x1
…
…
P(x, ) = P(x1, …, xN, 1, ……, N)
= P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1)
= P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1)
= a01 a12……aN-1N e1(x1)……eN(xN)
K
xK
Example
0.05
0.95
0.95
Fair
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
LOADED
0.05
• What’s the probability of
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
 = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair
X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?
Example
0.05
0.05
• What’s the probability of
 = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair
X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?
P = ½ * P(1 | F) P(Fi+1 | Fi) …P(5 | F) P(Li+1 | Fi) P(6|L) P(Li+1 | Li) …P(4 | F)
= ½ x 0.957 0.052 x (1/6)6 x (1/10)2 x (1/2)2 = 5 x 10-11
Decoding
• Parse (path) is unknown. What to do?
• Alternative algorithms:
– Most probable single path (Viterbi algorithm)
– Sequence of most probable states (Forwardbackward algorithm)
The Viterbi algorithm
• Goal: to find
• Is equivalent to find  *  arg max log( P( x,  ))

The Viterbi algorithm
L
L
L
L
L
L
L
L
L
L
F
F
F
F
F
F
F
F
F
F
B
P(s|L) = 1/10, for s in [1..5]
P(6|L) = 1/2
P(s|F) = 1/6, for s in [1..6]
• Find a path with the following objective:
– Maximize the product of transition and emission probabilities
 Maximize the sum of log probabilities
Edge weight
(symbol independent)
Node weight
(depend on symbols in seq)
The Viterbi algorithm
x1
L
x2
L
x3
L
…
…
xi
xi+1
L
L
…
…
xn-1
L
xn
L
wLF
B
F
F
F
…
F
wFF
E
F
wFF = log (aFF)
VF (i+1) = rF (xi+1) + max
Weight for the best parse of (x1…xi+1),
with xi+1 emitted by state F
VL (i+1) = rL (xi+1) + max
Weight for the best parse of (x1…xi+1),
with xi+1 emitted by state L
…
F
F
rF(xi+1) = log (eF(xi+1))
VF (i) + wFF
VL (i) + wLF
VF (i) + wFL
VL (i) + wLL
Recursion from FSA directly
wFF=-0.05
WFL=-3.00
wLL=-0.05
aLL=0.95
aFF=0.95
aFL=0.05
LOADED
Fair
WLF=-3.00
rF(s) = -1.8
s = 1...6
rL(6) = -0.7
rL(s) = -2.3
(s = 1…5)
LOADED
Fair
aLF=0.05
P(s|F) = 1/6
s = 1…6
P(6|L) = ½
P(s|L) = 1/10
(s = 1...5)
VF (i+1) = rF (xi+1) + max {VL (i) + WLF
VF (i) + WFF }
PF (i+1) = eF (xi+1)  max {PL (i)  aLF
PF (i)  aFF }
VL (i+1) = rL (xi+1) + max {VL (i) + WLL
VF (i) + WFL }
PL (i+1) = eL (xi+1)  max {PL (i)  aLL
PF (i)  aFL }
In general: more states / symbols
• Alphabet
 = { b1, b2, …, bM }
• Set of states Q = { 1, ..., K }
• States are completely connected.
– K2 transitions probabilities (some may be 0)
– Each state has M transition probabilities (some may be 0)
xi
xi+1
k
2
2
…
…
1
k
……
l
K
1
……
2
…
1
K
K
l
Vl (i  1)  rl ( xi )  max k 1.. K (Vk (i)  wkl )
The Viterbi Algorithm
x1 x2 x3 … … xi+1……… … … ……………………xN
State 1
2
l
Vl(i+1)
K
Vl (i  1)  rl ( xi )  max k 1.. K (Vk (i)  wkl )
Similar to “aligning” a set of states to a sequence
Time: O(K2N)
Space: O(KN)
The Viterbi Algorithm (in log space)
Input: x = x1……xN
Initialization:
V0(0) = 0
Vl(0) = -inf, for all l > 0
(zero in subscript is the start state.)
(0 in parenthesis is the imaginary first position)
Iteration:
for each i
for each l
Vl(i) = rl(xi) + maxk (wkl + Vk(i-1))
Ptrl(i)
= argmaxk (wkl + Vk(i-1))
end
end
Termination:
Prob(x, *) = exp{maxk Vk(N)}
Traceback:
N* = argmaxk Vk(N)
i-1* = Ptri (i)
// rj(xi) = log(ej(xi)), wkj = log(akj)
The Viterbi Algorithm (in prob space)
Input: x = x1……xN
Initialization:
P0(0) = 1
Pl(0) = 0, for all l > 0
(zero in subscript is the start state.)
(0 in parenthesis is the imaginary first position)
Iteration:
for each i
for each l
Pl(i) = el(xi)  maxk (akl  Pk(i-1))
Ptrl(i)
= argmaxk (akl  Pk(i-1))
end
end
Termination:
Prob(x, *) = maxk Pk(N)
Traceback:
N* = argmaxk Pk(N)
i-1* = Ptri (i)
CpG islands
• Data: 41 human sequences, including 48 CpG
islands of about 1kbp each
• Viterbi:
– Found 46 of 48
– plus 121 “false positives”
• Post-processing:
–
–
–
–
merge within 500bp
discard < 500
Found 46/48
67 false positive
Problems with Viterbi decoding
• Most probable path not necessarily the
only interesting one
– Single optimal vs multiple sub-optimal
• What if there are many sub-optimal paths with
slightly lower probabilities?
– Global optimal vs local optimal
• What’s best globally may not be the best for each
individual
Example
•
•
•
•
The dishonest casino
Say x = 12341623162616364616234161221341
Most probable path:  = FF……F
However: marked letters more likely to be L than
unmarked letters
• Another way to interpret the problem
– With Viterbi, every position is assigned a single label
– Confidence level for each assignment?
Posterior decoding
• Viterbi finds the path with the highest probability
• We want to know
k
=1
• In order to do posterior decoding, we need to know P(x)
and P(i = k, x), since
• Computing P(x) and P(x,i=k) is called the evaluation
problem
• The solution: Forward-backward algorithm
Probability of a sequence
• P(X | M): prob that X can be generated by M
• Sometimes simply written as P(X)
• May be written as P(X | M, θ) or P(X | θ) to emphasize
that we are looking for θ to optimize the likelihood
(discussed later in learning)
• Not equal to the probability of a path P(X, )
– Many possible paths can generate X. Each with a probability
– P(X) = P(X, ) = P(X | ) P()
– How to compute without summing over all possible paths
(exponential of them)?
• Dynamic programming
The forward algorithm
• Define fk(i) = P(x1…xi,
i=k)
– Implicitly: sum over all possible paths for x1…xi-1
xi
k
f k (n)  P ( x1...xn ,  n  k )  P ( x,  n  k )
P ( x )   P ( x,  n  k )   f k ( n )
k
k
f k (i )  P( x1...xi ,  i  k )
 P( x1...xi 1 ,  i  k ) P( xi |  i  k )
 P( xi |  i  k ) P( x1...xi 1 ,  i 1  j ,  i  k )
j
 ek ( xi ) P( x1...xi 1 ,  i 1  j ) P( i  k |  i 1  j )
j
 ek ( xi ) P( x1...xi 1 ,  i 1  j )a jk
j
 ek ( xi ) f j (i  1)a jk
j
The forward algorithm
k
xi
The forward algorithm
We can compute fk(i) for all k, i, using dynamic
programming!
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fk(i) = ek(xi) j fj(i-1) ajk
Termination:
Prob(x) = k fk(N)
Relation between Forward and Viterbi
VITERBI (in prob space)
FORWARD
Initialization:
P0(0) = 1
Pk(0) = 0, for all k > 0
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
Iteration:
Pk(i) = ek(xi) maxj Pj(i-1) ajk
Termination:
Prob(x, *) = maxk Pk(N)
fk(i) = ek(xi) j fj(i-1) ajk
Termination:
Prob(x) = k fk(N)
Posterior decoding
• Viterbi finds the path with the highest probability
• We want to know
k
=1
• In order to do posterior decoding, we need to know P(x)
and P(i = k, x), since
Need to know how
to compute this
Have just shown
how to compute this
xi
k
The backward algorithm
• Define bk(i) = P(xi+1…xn | i=k)
– Implicitly: sum over all possible paths for xi…xn
xi
k
k
xi
1
This does not include the emission probability of xi
The forward-backward algorithm
•
•
•
•
Compute fk(i) for each state k and position i
Compute bk(i), for each state k and position i
Compute P(x) = kfk(N)
Compute P(i=k | x) = fk(i) * bk(i) / P(x)
The prob of x, with the
constraint that xi was
generated by state k
Sequence
state
Forward probabilities
x
Backward probabilities
/ P(X)
Space: O(KN)
Time: O(K2N)
P(i=k | x)
Relation to another F-B algorithm
• We’ve learned a forward-backward algorithm in linearspace sequence alignment
– Hirscheberg’s algorithm
– Also known as forward-backward alignment algorithm
y
x
What’s P(i=k | x) good for?
• For each position, you can assign a
probability (in [0, 1]) to the states that the
system might be in at that point –
confidence level
• Assign each symbol to the most-likely
state according to this probability rather
than the state on the most-probable path –
posterior decoding
^i = argmaxk P(i = k | x)
Posterior decoding for the
dishonest casino
If P(fair) > 0.5, the roll is more likely to be
generated by a fair die than a loaded die
Posterior decoding for another
dishonest casino
In this example, Viterbi predicts that all rolls were from the fair die.
CpG islands again
• Data: 41 human sequences, including 48 CpG islands of
about 1kbp each
• Viterbi:
– Found 46 of 48
– plus 121 “false positives”
Post-process:
46/48
67 false pos
• Posterior Decoding:
– same 2 false negatives
– plus 236 false positives
46/48
83 false pos
Post-process: merge within 500; discard < 500
What if a new genome comes?
We just sequenced the porcupine genome
We know CpG islands play the same role in this
genome
However, we have not many known CpG islands
for porcupines
We suspect the frequency and characteristics of
CpG islands are quite different in porcupines
How do we adjust the parameters in our
model?
- LEARNING
Learning
• When the state path is known
– We’ve already done that
– Estimate parameters from labeled data
(known CpG and non-CpG)
– “Supervised” learning
• When the state path is unknown
– Estimate parameters without labeled data
– “unsupervised” learning
Basic idea
1. Estimate our “best guess” on the
model parameters θ
2. Use θ to predict the unknown
labels
3. Re-estimate a new set of θ
4. Repeat 2 & 3
Two ways
Viterbi training
1. Estimate our “best guess” on the model
parameters θ
2. Find the Viterbi path using current θ
3. Re-estimate a new set of θ based on the
Viterbi path
– Count transitions/emissions on those paths,
getting new θ
4. Repeat 2 & 3 until converge
Baum-Welch training
1.
2.
3.
Estimate our “best guess” on the model parameters
θ
Find P(i=k | x,θ) using forward-backward algorithm
Re-estimate a new set of θ based on all possible
paths

For example, according to Viterbi, pos i is in state k and pos
(i+1) is in state l
•
•
4.
This contributes 1 count towards the frequency that transition
k l is used
In Baum-Welch, pos i has some prob in state k and pos (i+1)
has some prob in state l. This transition is counted only
partially, according to the prob of this transition
Repeat 2 & 3 until converge
Probability that a transition is used
i
k
l
i+1
Estimated # of kl transition
Viterbi vs Baum-Welch training
• Viterbi training
–
–
–
–
Returns a single path
Each position labeled with a fixed state
Each transition counts one
Each emission also counts one
• Baum-Welch training
– Does not return a single path
– Considers the prob that each transition is used and
the prob that a symbol is generated by a certain state
– They only contribute partial counts
Viterbi vs Baum-Welch training
• Both guaranteed to converges
• Baum-Welch improves the likelihood of the
data in each iteration: P(X)
– True EM (expectation-maximization)
• Viterbi improves the probability of the most
probable path in each iteration: P(X, *)
– EM-like
expectation-maximization (EM)
• Baum-Welch algorithm is a special case of the
expectation-maximization (EM) algorithm, a
widely used technique in statistics for learning
parameters from unlabeled data
• E-step: compute the expectation (e.g. prob for
each pos to be in a certain state)
• M-step: maximum-likelihood parameter
estimation
• We’ll see EM and similar techniques again in
motif finding
• k-means clustering has some similar flavor
Compute sum of probabilities in log
space
• Two probabilities x and y, x < y
• lx = log(x), ly = log(y),
(lx < ly)
• z = x + y = y (1 + x/y)
lz = log(z) = log(x+y)
= log(y) + log(1 + x/y)
= ly + log(1 + exp(log(x)-log(y))
= ly + log(1 + exp(lx – ly))
Also see page 4 in this doc:
http://cs.utsa.edu/~jruan/teaching/cs5263_fall_2007/proj1.pdf
HMM summary
•
•
•
•
Viterbi – best single path
Forward – sum over all paths
Backward – similar
Baum-Welch – training via EM and
forward-backward
• Viterbi – another “EM”, but Viterbi-based
Silent states
• Silent states are states that do not emit
symbols (e.g., the state 0 in our previous
examples)
• Silent states can be introduced in HMMs
to reduce the number of transitions
Silent states
• Suppose we want to model a sequence in which
arbitrary deletions are allowed (later this lecture)
• In that case we need some completely forward
connected HMM (O(m2) edges)
Silent states
• If we use silent states, we use only O(m)
edges
Algorithms can be
modified easily to deal
with silent states, so long
as no silent-state loops
• Nothing comes free
Suppose we want to assign
high probability to 1→5 and
2→4, there is no way to have
also low probability on 1→4
and 2→5.
Modular design of HMM
• HMM can be designed modularly
• Each modular has own begin / end states
(silent, i.e. no emission)
• Each module communicates with other
modules only through begin/end states
C+
B+
E-
G+
A+
T+
A-
TC-
G-
E+
B-
HMM modules
and non-HMM
modules can be
mixed
HMM applications
•
•
•
•
Pair-wise sequence alignment
Multiple sequence alignment
Gene finding
Speech recognition: a good tutorial on
course website
• Machine translation
• Many others
FSA for global alignment
e


Xi aligned
to a gap
d
Xi and Yj
aligned
d

Yj aligned
to a gap
e
HMM for global alignment
1- 
1-2
Xi and Yj
aligned
P(xi,yj)
16 emission probabilities
1-
Xi aligned
to a gap
q(xi): 4 emission probabilities



q(yj): 4 emission probabilities
Yj aligned
to a gap

Pair-wise HMM: emit two sequences simultaneously
Algorithm is similar to regular HMM, but need an additional dimension.
e.g. in Viterbi, we need Vk(i, j) instead of Vk(i)
HMM and FSA for alignment
• After proper transformation between the
probabilities and substitution scores, the two are
identical
 (a, b)  log [p(a, b) / (q(a) q(b))]
 d  log 
 e  log 
• Details in Durbin book chap 4
• Finding an optimal FSA alignment is equivalent
to finding the most probable path with Viterbi
HMM for pair-wise alignment
• Theoretical advantages:
– Full probabilistic interpretation of alignment scores
– Probability of all alignments instead of the best
alignment (forward-backward alignment)
– Posterior probability that Ai is aligned to Bj
– Sampling sub-optimal alignments
• Not commonly used in practice
– Needleman-Wunsch and Smith-Waterman algorithms
work pretty well, and more intuitive to biologists
Other applications
• HMM for multiple alignment
– Widely used
• HMM for gene finding
– Foundation for most gene finders
– Include many knowledge-based fine-tunes
and GHMM extensions
– We’ll only discuss basic ideas
HMM for multiple seq alignment
• Proteins form families both across and within
species
– Ex: Globins, Zinc finger
– Descended from a common ancestor
– Typically have similar three-dimensional structures,
functions, and significant sequence similarity
• Identifying families is very useful: suggest
functions
• So: search and alignment are both useful
• Multiple alignment is hard
• One very useful approach: profile-HMM
Say we already have a database of reliable multiple alignment
of protein families
When a new protein comes, how do we align it to the existing
alignments and find the family that the protein may belong to?
Solution 1
• Use regular expression to represent the
consensus sequences
– Implemented in the Prosite database
– for example
C - x(2) - P - F - x - [FYWIV] - x(7) - C x(8,10) - W - C - x(4) - [DNSR] - [FYW] x(3,5) - [FYW] - x - [FYWI] - C
Multi-alignments represented by consensus
• Consensus sequences are very intuitive
• Gaps can be represented by Do-not-cares
• Aligning sequences with regular expressions can be
done easily with DP
• Possible to allow mismatches in searching
• Problems
– Limited power in expressing more divergent positions
• E.g. among 100 seqs, 60 have Alanine, 20 have Glycine, 20 others
– Specify boundaries of indel not so easy
• unaligned regions have more flexibilities to evolve
– May have to change the regular expression when a new member
of a protein family is identified
Solution 2
• For a non-gapped
alignment, we can
create a positionspecific weight matrix
(PWM)
• Also called a profile
• May use
pseudocounts
A
4
8
3
4
7
6
1
1
5
6
R
3
1
0
10
2
1
1
2
1
13
N
3
1
2
8
40
1
4
0
7
1
D
6
0
8
5
0
1
1
3
5
2
C
0
9
12
0
0
3
0
2
2
4
E
1
5
3
0
2
0
2
4
3
3
Q
3
0
1
2
3
3
2
0
2
11
G
3
6
5
3
5
4
2
1
0
6
H
2
4
4
2
4
4
32
7
6
7
I
7
2
25
13
2
2
1
50
6
4
L
4
4
6
8
0
1
1
3
10
8
K
33
5
1
2
4
1
1
9
31
2
M
7
1
2
10
4
2
1
4
1
2
F
6
7
8
3
2
4
2
1
7
10
P
1
27
2
7
9
1
3
3
1
1
S
2
4
1
9
2
2
1
0
1
4
T
5
0
8
8
2
60
37
1
2
4
W
2
7
1
3
7
2
3
1
3
6
Y
4
0
5
1
4
1
1
5
3
1
V
4
8
2
1
1
0
4
3
2
6
Scoring by PWMs
A
4
8
3
4
7
6
1
1
5
6
R
3
1
0
10
2
1
1
2
1
13
N
3
1
2
8
40
1
4
0
7
1
D
6
0
8
5
0
1
1
3
5
2
C
0
9
12
0
0
3
0
2
2
4
E
1
5
3
0
2
0
2
4
3
3
Q
3
0
1
2
3
3
2
0
2
11
G
3
6
5
3
5
4
2
1
0
6
H
2
4
4
2
4
4
32
7
6
7
I
7
2
25
13
2
2
1
50
6
4
L
4
4
6
8
0
1
1
3
10
8
K
33
5
1
2
4
1
1
9
31
2
M
7
1
2
10
4
2
1
4
1
2
F
6
7
8
3
2
4
2
1
7
10
P
1
27
2
7
9
1
3
3
1
1
S
2
4
1
9
2
2
1
0
1
4
T
5
0
8
8
2
60
37
1
2
4
W
2
7
1
3
7
2
3
1
3
6
Y
4
0
5
1
4
1
1
5
3
1
V
4
8
2
1
1
0
4
3
2
6
x = KCIDNTHIKR
P(x | M) = i ei(xi)
Random model: each amino
acid xi can be generated with
probability q(xi)
P(x | random) = i q(xi)
Log-odds ratio:
S = log P(X|M) / P(X|random)
= i log ei(xi) / q(xi)
PWMs
• Advantage:
– Can be used to identify both strong and weak
homologies
– Easy to implement and use
– Probabilistic interpretation
• PWMs are used in PSI-BLAST
–
–
–
–
Given a sequence, get k similar seqs by BLAST
Construct a PWM with these sequences
Search the database for seqs matching the PWM
Improved sensitivity
• Problem:
– Not intuitive to deal with gaps
PWMs are HMMs
B
M1
Mk
Transition probability = 1
20 emission probabilities for each state
• This can only be used to search for sequences without
insertion / deletions (indels)
• We can add additional states for indels!
E
Dealing with insertions
Ij
B
M1
Mj
Mk
E
• This would allow an arbitrary number of
insertions after the j-th position
– i.e. the sequence being compared can be
longer than the PWM
Dealing with insertions
B
I1
Ij
Ik
M1
Mj
Mk
E
• This allows insertions at any position
Dealing with deletions
B
Mi
Mj
E
• This would allow a subsequence [i, j] being
deleted
– i.e. the sequence being compared can be
shorter than the PWM
Dealing with deletions
B
E
• This would allow an arbitrary length of
deletion
– Completely forward connected
– Too many transitions
Deletion with silent states
B
Dj
Silent state
Mj
E
• Still allows any length of deletions
• Fewer parameters
• Less detailed control
Full model
• Called profile-HMM
Dj
D: deletion state
I: insertion state
M: matching state
Ij
B
Mj
E
Algorithm: basically the same as a regular HMM
Using profile HMM
• Alignment
– Align a sequence to a profile HMM (Viterbi)
• Searching
– Protein classification: Given a sequence, and HMMs for different
protein families, which family the sequence may belong to?
– New family member detection: Given a HMM for a protein family,
and many proteins, which protein may belong to the family?
– Viterbi or forward
• Training / Learning
– Given a multiple alignment, construct a HMM (supervised)
– Given an unaligned protein family, construct a HMM
(unsupervised: BM or Viterbi)
Pfam
• A database of protein families
– Developed by Sean Eddy and colleagues while
working in Durbin’s lab
•
•
•
•
Hand-curated “seed” multiple alignment
Train HMM from seed alignment
Hand-chosen score thresholds
Automatic classification / classification of all
other protein sequences
• 7973 families in Rfam 18.0, 8/2005
(covers ~75% of proteins)
Build models from aligned sequences
• Matching state for
columns with no gaps
• When gaps exist, how to
decide whether they are
insertions or matching?
– Heuristic rule: >50%
gaps, insertion,
otherwise, matching
• How to add
pseudocount?
– Simply add one
– According to
background distribution
– Use a mixture of priors
(Dirchlet mixtures)
• Sequence weighting
– Very similar sequences
should each get less
weight
Build models from unaligned sequences
• Choose a model length and initial parameters
– Commonly use average seq length as model length
• Baum-Welch or Viterbi training
– Usually necessary to use multiple starting points or
other heuristics to escape from local optima
• Align all sequences to the final model using
Viterbi
Searching
Protein
Database
+
• Scoring
– Log likelihood: Log P(X | M)
– Log odds: Log P(X | M) / P(X | random)
– Length-normalization
• Is the matching biologically
interesting?
– How does the score compare with those
for sequences already in the family?
– How does the score compare with those
for random sequences?
Score for each protein
Example: modeling and searching
for globins
• 300 randomly picked globin sequence
• Build a profile HMM from scratch (without
pre-alignment)
• Align 60,000 proteins to the HMM
• Even after length normalization, LL is still lengthdependent
• Log-odd score provides better separation
– Takes amino acid composition into account
– Real globins could have scores less than 0
• Estimate mean score and standard deviation for nonglobin sequences for each length
• Z-score = (raw score – mean) / (standard deviation)
– Z-score is length-invariant
– Real globins have positive scores
Gene prediction
Gene structure
exon1
intron1
exon2
intron2
exon3 Intergenic
5’
3’
transcription
splicing
translation
Exon: coding
Intron: non-coding
Intergenic: non-coding
Finding genes
GATCGGTCGAGCGTAAGCTAGCTAG
ATCGATGATCGATCGGCCATATATC
ACTAGAGCTAGAATCGATAATCGAT
CGATATAGCTATAGCTATAGCCTAT
Human
Fugu
worm
E.coli
As the coding/non-coding length ratio decreases, exon
prediction becomes more complex
Gene prediction in prokaryote
• Finding long ORFs (open reading frame)
• An ORF may not contain stop codons
– Average ORF length = 64/3
– Expect 300bp ORF per 36kbp
– Actual ORF length ~ 1000bp
• Codon biases
– Some triplets are used more frequently than
others
– Codon third position biases
HMM for eukaryote gene finding
• Basic idea is the same: the distributions of
nucleotides is different in exon and other regions
– Alone won’t work very well
• More signals are needed
exon1
intron1
exon2
intron2
exon3
Intergenic
5’
3’
Promoter
ATG
5’-UTR
Splicing acceptor: AG
Splicing donor: GT
STOP Poly-A
3’-UTR
• How to combine all the signal together?
Simplest model
Intergenic
4 emission
probability
exon
64 triplets
emission
probabilities
intron
4 emission
probability
Actually more accurate at the di-aminoacid level, i.e. 2 codons. Many methods
use 5th-order Markov model for all regions
• Exon length may not be exact multiple of 3
• Basically have to triple the number of states to remember the excess
number of bases in the previous state
More detailed model
Single exon
Init
exon
intron
Intergenic
Term
exon
Internal
Exon
Sub-models
CDS: coding sequence
Init exon
Term exon
5’-UTR
START
CDS
CDS
STOP
3’-UTR
• START, STOP are PWMs
• Including start and stop codons and surrounding bases
Sub-model for intron
Intron
Splice donor
Intron
body
Splice
acceptor
• Sequence logos: an informative display of PWMs
• Within each column, relative height represents probability
• Height of each column reflects “information content”
Duration modeling
• For any sub-path, the probability consists
of two components
– The product of emission probabilities
• Depend on symbols and state path
– The product of transition probabilities
• Depend on state path
Duration modeling
• Model a stretch of DNA for which the distribution
does not change for a certain length
• The simplest model implies that
P(length = L) = pL-1(1-p)
• i.e., length follows geometric distribution
– Not always appropriate
P
Duration: the number of
times that a state is used
consecutively without
visiting other states
p
s
1-p
L
Duration models
P
s
s
s
s
1-p
Min, then geometric
P
P
P
P
s
s
s
s
1-p
1-p
Negative binominal
1-p
1-p
Explicit duration modeling
Exon
Generalized HMM.
Often used in gene
finders
Intron
Intergenic
P(A | I) = 0.3
P(C | I) = 0.2
P(G | I) = 0.2
P(T | I) = 0.3
P
L
Empirical intron length distribution
Explicit duration modeling
x1 x2 x3 ………………………………………..xN
1
2
Pk(i)
K
• For each position j and each state i
– Need to consider the transition from all previous
positions
• Time: O(N2K2)
• N can be 108
Speedup GHMM
• Restrict maximum duration length to be L
– O(LNK2)
• However, intergenic and intron can be quite long
– L can be 105
• Compromise: explicit duration for exons only,
geometric for all other states
• Pre-compute all possible starting points of ORFs
– For init exon: ATG
– For internal/terminal exon: splice donor signal (GT)
GeneScan model
Approaches to gene finding
• Homology
– BLAST, BLAT, etc.
• Ab initio
– Genscan, Glimmer, Fgenesh, GeneMark, etc.
– Each one has been tuned towards certain organisms
• Hybrids
– Twinscan, SLAM
– Use pair-HMM, or pre-compute score for potential
coding regions based on alignment
• None are perfect, never used alone in practice
Current status
• More accurate on internal exons
• Determining boundaries of init and term
exons is hard
• Biased towards multiple-exon genes
• Alternative splicing is hard
• Non-coding RNA is hard
• State of the Art:
– predictions ~ 60% similar to real proteins
– ~80% if database similarity used
– lab verification still needed, still expensive
HMM wrap up
• We’ve talked about
–
–
–
–
–
Probability, mainly Bayes Theorem
Markov models
Hidden Markov models
HMM parameter estimation given state path
Decoding given HMM and parameters
• Viterbi
• F-B
– Learning
• Baum-Welch (Expectation-Maximization)
• Viterbi
HMM wrap up
• We’ve also talked about
– Extension to gHMMs
– HMM for multiple sequence alignment
– gHMM for gene finding
• We did not talk about
– Higher-order Markov models
– How to escape from local optima in learning