Transcript class05-ms

Markov Chains
Lecture #5
Background Readings: Durbin et. al. Section 3.1,
Polanski&Kimmel Section 2.8.
Prepared by Shlomo Moran, based on Danny
Geiger’s and Nir Friedman’s
.
Dependencies along the
genome
In previous classes we assumed every
letter in a sequence is sampled
randomly from some distribution q()
over the alpha bet {A,C,T,G}.
This is too restrictive for true
genomes.
1. There are special subsequences in
the genome, like TATA within the
regulatory area, upstream a gene.
2. The pattern CG is less common than
expected for random sampling.
We model such dependencies by Markov
chains and hidden Markov model,
which we define next.
2
Finite Markov Chain
An integer time stochastic process,
consisting of a domain D of m>1
states {s1,…,sm} and
1. An m dimensional initial distribution vector
( p(s1),.., p(sm)).
2. An m×m transition probabilities
matrix M= (msisj)
For example, D can be the letters {A, C, T,
G}, p(A) the probability of A to be the
1st letter in a sequence, and mAG the
probability that G follows A in a
sequence.
3
Markov Chain (cont.)
X2
X1
Xn-1
Xn
• For each integer n, a Markov Chain assigns
probability to sequences (x1…xn) in Dn (i.e, xi
D) as follows:

n
p(( x1, x2 ,...xn ))  p( X1  x1 ) p( X i  xi | X i1  xi1 )
n
 p( x1 ) mxi1xi
i 2
i 2
4
Markov Chain (cont.)
X1
X2
Xn-1
Xn
Similarly, each Xi is a probability
distributions over D, which is determined by
the initial distribution (p1,..,pn) and the
transition matrix M.
There is a rich theory which studies the
properties of such “Markov sequences”
(X1,…, Xi ,…). A bit of this theory is
presented next.
5
Matrix Representation
A
0.9
A 5
B
C
0
0.0
5
D
0
The transition
probabilities Matrix
M =(mst)
M is a stochastic Matrix:
B
C
0.2 0.5
0
0
D
0.2
0
0
0
1
0.3
0.8
0
 m 1
t st
The initial distribution
vector (u1…um) defines
the distribution of X1
(p(X1=si)=ui) .
After one move, the distribution is
changed to
X2 = X1M
6
Matrix Representation
A
0.9
A 5
B
C
B
C
0
0.0
5
0
Example: if X1=(0, 1, 0, 0)
then X2=(0.2, 0.5, 0, 0.3)
0
0.3
And if X1=(0, 0, 0.5, 0.5)
then X2=(0, 0.1, 0.5, 0.4).
0.2 0.5
D
0
0.2
0
0.8
0
0
1
0
D
The i-th distribution is Xi
= X1Mi-1
Basic question: what can we tell on the distributi
7
Representation of a Markov
Chain as a Digraph
A
B
C
0.
95
0
0.0
5
0
0.
B 2
0.5
0
0.3
0
0.2
0
0.8
0
0
1
0
0.95
A
0.2
0.5
D
C
0.05 0.2
0.3
D
0.8
1
Each directed edge AB is associated
with the positive transition probability
from A to B.
8
Properties of Markov Chain
states
States of Markov chains are classified by
the digraph representation (omitting the
actual probability values)
A, C and D are recurrent states: they are in
strongly connected components which are sinks
in the graph.
B is not recurrent – it is a transien
Alternative definitions:
A state s is recurrent if it
can be reached from any
state reachable from s;
otherwise it is transient.
9
Another example of Recurrent
and Transient States
A and B are transient
states, C and D are
recurrent states.
Once the process moves
from B to D, it will never
come back.
10
Irreducible Markov Chains
A Markov Chain is irreducible if the
corresponding graph is strongly connected (and
thus all its states are recurrent).
E
11
Periodic States
E
A state s has a period k if
k is the GCD of the
lengths of all the cycles
that pass via s. (in the
shown graph the period
of A is 2).
Exercise: All the states in the same strongly
connected component have the same period
A Markov Chain is periodic if all the states in it
have a period k >1. It is aperiodic otherwise.
12
Ergodic Markov Chains
A Markov chain is ergodic
if :
1. the corresponding
graph is strongly
connected.
2. It is not peridoic
Ergodic Markov Chains are important since they
guarantee the corresponding Markovian process
converges to a unique distribution, in which each
state has a strictly positive probability.
13
Stationary Distributions for
Markov Chains
Let M be a Markov Chain of m states, and let V
= (v1,…,vm) be a probability distribution over
the m states
V = (v1,…,vm) is stationary distribution for M
if VM=V.
(ie, if one step of the process does not change
the distribution).
V is a stationary distribution
V is a non-negative left (row) Eigenvector of
M with Eigenvalue 1.
14
Stationary Distributions for a
Markov Chain M
Exercise: A stochastic matrix always has a real
left Eigenvector with Eigenvalue 1 (hint: show
that a stochastic matrix has a right Eigenvector
with Eigenvalue 1. Note that the left
Eigenvalues of a Matrix are the same as the
right Eigenvlues).
It can be shown that the above Eigenvector V
can be non-negative. Hence each Markov
Chain has a stationary distribution.
15
“Good” Markov chains
A Markov Chain is good if the distributions Xi ,
as i∞:
(1) Converge to a unique distribution,
independent of the initial distribution.
(2) In that unique distribution, each state has a
positive probability.
The Fundamental Theorem of Finite Markov
Chains:
A Markov Chain is good  the
corresponding graph is ergodic.
We will prove the  part, by showing that nonergodic Markov Chains are not good.
16
Examples of
“Bad” Markov Chains
A Markov chain is “bad” if
either :
1. It does not converge to a
unique distribution (independent
on the initial distribution).
2. It does converge to u.d., but
some states in this distribution
have zero probability.
17
Bad case 1: Mutual
Unreachabaility
Consider two initial
distributions:
a) p(X1=A)=1 (p(X1 = x)=0
if x≠A).
b) p(X1= C) = 1
In case a), the sequence will stay at A forever.
In case b), it will stay in {C,D} for ever.
Fact 1: If G has two states which are
unreachable from each other, then {Xi} cannot
converge to a distribution which is independent
on the initial distribution.
18
Bad case 2: Transient
States
Once the process moves from B to D, it will
never come back.
19
Bad case 2: Transient
States
Fact 2: For each initial
distribution, with
probability 1 a transient
state will be visited only a
finite number of times.
X
Proof: Let A be a transient state, and let X be
the set of states from which A is unreachable.
It is enough to show that, starting from any
state, with probability 1 a state in X is reached
after a finite number of steps (Exercise:
complete the proof)
20
Corollary: A good
Markov Chain is
irreducible
21
Bad case 3: Periodic
Markov Chains
E
Recall: A Markov Chain is periodic if all the states
in it have a period k >1. The above chain has
period 2.
In the above chain, consider the initial distribution
p(B)=1.
Then states {B, C} are visited (with positive
probability) only in odd steps, and states {A, D, E}
are visited in only even steps.
22
Bad case 3: Periodic States
E
Fact 3: In a periodic Markov Chain (of period k
>1) there are initial distributions under which the
states are visited in a periodic manner.
Under such initial distributions Xi does not
converge as i∞.
Corollary: A good
Markov Chain is not
periodic
23
The Fundamental Theorem of
Finite Markov Chains:
We have proved that non-ergodic Markov
Chains are not good. A proof of the opposite
direction below is based on Perron-Frobenius
theory on nonnegative matrices, and is beyond
the scope of this course:
If a Markov Chain is ergodic, then
1. It has a unique stationary distribution vector V >
0, which is a right (row) Eigenvector of the
transition matrix.
2. For any initial distribution, the distributions Xi ,
as i∞, converges to V.
24
Use of Markov Chains in
Genome search:
Modeling CpG Islands
In human genomes the pair CG often
transforms to (methyl-C) G which
often transforms to TG.
Hence the pair CG appears less than
expected from what is expected from
the independent frequencies of C and G
alone.
Due to biological reasons, this process
is sometimes suppressed in short
stretches of genomes such as in the
start regions of many genes.
These areas are called CpG islands (-Cphosphate-G-).
25
Example: CpG Island (Cont.)
We consider two questions:
Question 1: Given a short stretch of
genomic data, does it come from a CpG
island ?
Question 2: Given a long piece of
genomic data, does it contain CpG
islands in it, where, what length ?
We “solve” the first question by modeling
strings with and without CpG islands as
Markov Chains over the same states
{A,C,G,T} but different transition
probabilities:
26
Example: CpG Island (Cont.)
The “+” model: Use transition
matrix M+ = (m+st), Where:
m+st = (the probability that t follows s
in a CpG
island)
The “-” model: Use transition matrix
M- = (m-st), Where:
m-st = (the probability that t follows s
in a non
CpG island)
27
Example: CpG Island (Cont.)
With this model, to solve Question 1 we
need to decide whether a given short
sequence of letters is more likely to come
from the “+” model or from the “–”
model. This is done by using the
definitions of Markov Chain, in which
the parameters are determined by known
data and the log odds-ratio test.
28
Question 1: Using two Markov
chains
M+ (For CpG islands):
We need to specify p+(xi | xi-1) where +
stands for CpG Island. From Durbin et al
we have:
Xi
A
Xi-1
C
G
T
A 0.18 0.27 0.43 0.12
C 0.17 p+(C 0.27 p+(T|
| C)
4
C)
G 0.16 p+(C p+(G p+(T|
|G)
|G)
G)
T
p (C
p (T| need n
(Recall: rows must add
up to one; columns
0.08
+
|T)
p+(G +
T)
|T)
29
Question 1: Using two Markov
chains
M- (For non-CpG islands):
…and for p-(xi | xi-1) (where “-” stands for
Non CpG island) we have:
Xi
A
Xi-1
A
C
0.3
C
G
T
0.2 0.29 0.21
pp0.07
0.32 (C|C
(T|C
8
)
)
G
ppp0.25 (C|G (G|G (T|G
)
)
)
T
pp0.18 (C|T (G|T
p-
30
Discriminating between the two
models
X1
X2
XL-1
XL
Given a string x=(x1….xL), now compute the ratio
L 1
p( x |  model)
RATIO 

p( x |  model)
 p (x
 i 1 | xi )
i 0
L 1
 p (x
 i 1 | xi )
i 0
If RATIO>1, CpG island is more
likely.
Actually – the log of this ratio is
computed:
Note: p+(x1|x0) is defined for convenience
as p+(x1).
p-(x1|x0) is defined for convenience as p-
31
Log Odds-Ratio test
Taking logarithm yields
p(xi|xi1 )
p(x1...x L|  )
log Q  log
 log
p(x1...x L|  ) i p(xi|xi1 )

If logQ > 0, then + is more likely (CpG island).
If logQ < 0, then - is more likely (non-CpG
island).
32
Where do the parameters
(transition- probabilities) come
from ?
Learning from complete data, namely,
when the label is given and every xi is
measured:
Source: A collection of sequences from CpG
islands, and a collection of sequences from nonCpG islands.
Input: Tuples of the form (x1, …, xL, h),
where h is + or -
Output: Maximum Likelihood parameters (MLE)
Count all pairs (Xi=a, Xi-1=b) with
label +, and with label -, say the
numbers are Nba,+ and Nba,- .
33
Maximum Likelihood Estimate
(MLE) of the parameters (using
labeled data)
X2
X1
XL-1
XL
The needed parameters are:
P+(x1), p+ (xi | xi-1), p-(x1), p-(xi | xi-1)
The ML estimates are given by:
p ( X 1  a ) 
N a,
N
Where Nx,+ is the number of
times letter x appear in CpG
islands in the dataset.
x,
x
p ( X i  a | X i1  b) 
Nba,
N
x
bx, 
Where Nbx,+ is the number of
times letter x appears after
letter b in CpG islands in the
dataset.
Using MLE is justified when we have a large sample. The
numbers appearing in our tables (taken from Durbin et al p.
50) are based on 60,000 nucleotides.
34
Question 2: Finding CpG
Islands
Given a long genomic string with
possible CpG Islands, we define a
Markov Chain over 8 states, all
interconnected (hence it is ergodic):
A+
C+ G+
T+
A-
C- G-
T-
The problem is that we
don’t know the sequence
of states which are
traversed, but just the
sequence of letters.
Therefore we use here Hidden Markov
36
Hidden Markov Model
M
M
M
SL-1
M
S1
S2
SL
T
T
T
T
x1
x2
XL-1
xL
L
A Markov chain (s1,…,spL(s):,
1
, sL )   p(si | si1)
i 1
and for each state s and a symbol x we have p(Xi=x|S
Application in communication: message sent is
(s1,…,sm) but we receive (x1,…,xm) . Compute
what is the most likely message sent.
Application in speech recognition: word said is
(s1,…,sm) but we recorded (x1,…,xm) . Compute
what is the most likely word said.
37
Hidden Markov Model
M
M
M
M
S1
S2
SL-1
SL
T
T
T
T
x1
x2
XL-1
xL
Notations:
Markov Chain transition probabilities: p(Si+1= t|Si
Emission probabilities: p(Xi = b| Si = s) = es(b)
s
For Markov Chains we know:
p( )  p( s ,
1
L
, sL )   p(si | si1)
What is p(s,x) = p(s1,…,sL;x1,…,xL)?
i 1
38
Hidden Markov Model
M
M
M
SL-1
M
S1
S2
SL
T
T
T
T
x1
x2
XL-1
xL
p(Xi = b| Si = s) = es(b), means that the
probability of xi depends only on the
probability of si.
Formally, this is equivalent to the
conditional independence assumption:
p(Xi=xi|x1,..,xi-1,xi+1,..,xL,s1,..,si,..,sL) =
esi(xi)
L
Thus
p(s, x)  p(s1, , sL ; x1,..., xL )   p(si | si1) esi ( xi )
i 1
39