Markov Chain

Download Report

Transcript Markov Chain

Probabilistic Models
Consider our FMR triplet repeat finite state automaton...
a
S
g
1
c
2
g
3
c
4
g
5
g
6
c
7
t
8
g
c
While the above finite state automaton above is
non-deterministic, it is NOT probabilistic.
Sequences under test are either elements of the set of
“accepted” sequences, or are not (“rejected”)
What makes a model of DNA or protein probabilistic?
e
Probabilistic Models
Consider our FMR triplet repeat finite state automaton...
a
S
g
1
c
2
g
3
c
4
g
5
g
6
c
7
t
8
g
c
The key property of a probabilistic model is that
the model must define a probability distribution
over the set of possible sequences
Um, OK. What the heck does that actually mean?
Let’s review some more basic concepts of probability
e
Probability Distributions
First a formal description…
A probability distribution Pr{} on a sample space S is a
mapping from “events” of S to the real numbers such that
the following axioms are satisfied:
• Pr{A} ≥ 0 for any event A (i.e. probability of event A)
• Pr{S} = 1 (S here is the “certain event”)
• Pr{A ∪ B} = Pr{A} + Pr{B} for mutually exclusive events
An “event” is just some subset of the sample space
Probability Distributions
Sample spaces and events can conveniently be
conceptualized as a Venn diagram
Pr{A ∩ B} = ∅
S
A
B
Mutually independent events A and B here clearly don’t overlap (their
“intersection is null”), and the probability of their union is the sum of
their individual probabilities: Pr{A ∪ B} = Pr{A} + Pr{B}
Probability Distributions
Now an example
Consider the following sample space S:
S = {AA, AC, AG, AT, CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT}
• Each element s in S (s ∈ S) here is a dinucleotide.
• Sample space here is the set of all possible dinucleotides.
Pr
{UAi} = SPr{Ai}
i
i
The above formula is just a generalization of our axiom
Pr{A ∪ B} = Pr{A} + Pr{B} for mutually exclusive events
Probability Distributions
S
is represented by
the entire large
box.
We make the
“normalizing
assumption” that
its area is 1 in
order to satisfy
axiom 2
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
In this distribution, there are 16 possible mutually exclusive elemental
events, that together “paint” the entire sample space S without overlap.
This distribution has the additional property of being uniform
Probability Distributions
AA AC AG AT
A
Contains “C”
CA CC CG CT
GA GC GG GT
TA TC TG TT
B
No “C”
Events need not always be elemental. Here mutually exclusive events A
and B are shown and the axioms (as they must) are still satisfied
Discrete Probability Distributions
Many of the probability distributions we will be
considering have the additional property of being discrete*
Consider, for example, that for DNA strings of some
definite finite length n, there only 4n possibilities to
consider
Pr{A} =
SPr{s}
s∈A
A probability distribution is discrete if it is defined
over a finite or “countably infinite” sample space
* discrete, as in separate; distinct; individual. NOT discreet, as in “I won’t tell your wife what you did last Friday”
Unconditional Probability
Consider statements of the form:
“The chance of snow is 60%” or
“The test is 99% accurate”
These are unconditional statements, so we can say:
Pr{H} ≡ total probability
of
|H|
= set S of all
event H Pr{H}
(over the
|𝑆|
possible events)
Using intuitive notions of probability, we can express this
as the counts of events where H occurred, over the total
number of possible events in S
Unconditional Probability
Statements of this kind can conveniently be
conceptualized as a Venn diagram
S
H
Using intuitive notions of probability, we can express this
as the counts of events where H occurred, over the total
number of possible events in S
Conditional Probability
More often than not, we wish to express probabilities
conditionally. i.e. we specify the assumptions or
conditions under which it was measured
Conditional statements of probability take the form:
Pr{H|O} ≡ probability that event
H occurred in the subset of
cases where event O also
occurred
Conditional Probability
Again, this can be represented as a Venn diagram
S
H
O
The intersection H ∩ O represents events
where both H and O occurred
Conditional Probability
More often than not, we wish to express probabilities
conditionally. i.e. we specify the assumptions or
conditions under which it was measured
We can express the idea shown in the Venn diagram as:
Pr{H|O} =
|H ∩ O|
|𝑂|
Note that in the “universe of possibilities”, O has effectively
replaced S. Our probability for event H is now conditional
on the assumption that event O has already taken place.
Joint Probability
Let’s discuss the idea of joint probability, which
expresses the idea that both events O and H occur
In joint probability, the intersection of events are
expressed relative to the entire sample space S:
Pr{H∩O} =
|H ∩ O|
|𝑆|
A common alternative notation: P(H,O) = Pr{H∩O}
Joint Probability
Let’s discuss the idea of joint probability
i.e. both events O and H occur
There’s no reason we can’t rewrite this as:
Pr{H∩O} =
|H ∩ O |
|𝑆|
=
|H ∩ O| |O|
|𝑂|
|𝑆|
But wait… these new terms look awfully familiar!
Conditional & Joint Probability
Let’s discuss the idea of joint probability
i.e. both events O and H occur
Pr{H∩O} =
|H ∩ O |
|𝑆|
=Pr{H|O}Pr{O}
In other words, we can express joint probability in terms of
their separate conditional and unconditional probabilities!
This key result turns out to be exceedingly useful!
Conditional Probability
It gets even better! Dig it!
The intersection operator makes no assertion
regarding order
Pr{H∩O} =Pr{H|O}Pr{O}
=Pr{O|H}Pr{H}
We can therefore express everything only in terms of
reciprocal conditional and unconditional probabilities:
Pr{O|H}Pr{H} =Pr{H|O}Pr{O}
We’ll have a lot of fun* with this later!
*
So. Much. Fun.
Sequences considered probabilistically
What is the probability of some sequence of events x?
If it was generated from a probabilistic model it can
always be written:
The probability of observing sequence of states x...
P(x) = P(xL, xL-1, … ,x1)
...is equal to the probability that the XLth state was
whatever AND the XL-1th state was whatever else,
AND etc., etc. until we reach the first position
This is familiar as statement of joint probability
But how to proceed?
Sequences considered probabilistically
What is the probability of chain of events x?
We could try repeatedly applying our rules of
conditional probability…
P(x) = P(xL, xL-1, … ,x1)
= P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1)
The probability of events X AND Y happening is equal to
the probability of X happening given that Y has already
happened, times the probability of event Y
Remember our result P(X,Y) = P(X|Y) * P(Y)
Sequences considered probabilistically
What is the probability of chain of events x?
We could try repeatedly applying our rules of
conditional probability…
P(x) = P(xL, xL-1, … ,x1)
= P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1)
This is still incredibly yucky, but at least we
now have a separate probability term for
each position in our sequence
Is this the best we can do? Let’s digress for a moment…
Markov Chains
A traffic light considered as a sequence of states
Prg = 1
Pgy = 1
Pyr = 1
A trivial Markov chain – the transition probability between
the states is always 1
Markov Chains
A traffic light considered as a sequence of states
If we watch our traffic light, it will emit a string of states
In the case of a simple Markov model,
the state labels (e.g. green, red, yellow)
are the observable outputs of the process
Markov Chains
An occasionally malfunctioning traffic light!!
Pgy = 1
Prg =
.85
Pyg = .10
Pyr = .9
Pry = .15
The Markov property is that the probability of observing next a given
For this reason
westate
say that
a Markov
a memoryless
future
depends
only chain
on theiscurrent
state! process
Markov Chains
The Markov Property
English Translation:
The transition probability ast from state s to state t…
ast = P(xi = t | xi-1 = s)
…is equal to the probability that the ith state was t..
given that
that the immediately proceeding state (xi-1) was s
You should recognize this a statement of conditional probability!
Markov Chains
The Markov Property
ast = P(xi = t | xi-1 = s)
Another way to look at this is to say
that the conditional probability
distribution for the system’s next step
depends only on the current state, not
on any prior state or states.
There is no xi-2 in this equation!
Markov Chain
An occasionally malfunctioning traffic light!!
If we know the transition probabilities, we may already feel intuitively that
some outcomes are more likely to have been produced by our model than
others…..
But can we calculate the probability of an observed sequence?
Markov Chains
Can they help simplify our statement of probability for a sequence?
P(x) = P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1)
Remember, the key property of a Markov Chain is
that probability of symbol xi depends ONLY on the
value of preceding symbol Xi-1!!
Therefore:
P(x) = P(xL | xL-1) P(xL-1 | xL-2) ... P(x2|x1) P(x1)
L
P(x) = P(x1)
P
i=2
ax
i-1xi
Markov Chains
Scoring an arbitrary sequence
If we know the transition probabilities, our formula tells us
exactly how to calculate the probability of a sequence of
unknown origin:
L
P(x) = P(x1)
Pa
i=2
xi-1xi
Markov Chains
How about nucleic acid sequences?
A
C
G
T
No reason why nucleic acid sequences found in an
organism cannot be modeled using Markov chains
Markov Model
What do we need to probabilistically model DNA sequences?
States
Transition
probabilities
A
C
G
T
The states are the same for all organisms, so the transition
probabilities are the model parameters (θ) we need to estimate
Markov Models
What do we need to probabilistically model DNA sequences?
A
C
e
S
G
T
As with transformational grammars, we can also
model special start and end states
Markov Models
Which model best explains a newly observed sequence?
A
C
A
C
G
T
G
T
Organism A
Organism B
Each organism will have different transition probability
parameters, so you can ask “was the sequence more likely
to be generated by model A or model B?”
Markov Models
Using Markov chains for discrimination
=S log a
L
P(x|θA)
S(x) = log
P(x|θB)
i =1
aAxi-1xi
B
xi-1xi
A commonly used metric for discrimination using
Markov Chains is the Log-Odds ratio. Odds ratios
allow us to calibrate our thinking about a probability we
observe under one set of assumptions relative to a
probability given another set of assumptions
We must be careful in interpretation of log-odds ratios -- an
impressive seeming ratio will not necessarily be significant
Markov Models
Using Markov chains for discrimination
S(x) = log
P(x|θA)
P(x|θB)
=S
L
i =1
log
aAxi-1xi
aBxi-1xi
= log P(x|θA) - P(x|θB)
=S
L
i =1
log aAx
i-1xi
- log aBx
i-1xi
Note that we can rewrite this in several ways, and often one
form will be more convenient or efficient in a given context
More on this when we discuss numerical stability
Some unfinished business
Does a Markov Chain really define a probability distribution
over the entire space of sequences?
In other words, is the sum of the probability of all
possible sequences of any length equal to one?
This turns out to be just slightly slippery in the case
where we model the ends, but it’s not too bad in the case
of a definite length L. We’ll address this in two ways:
•
Primarily by simulation in Python
•
But also have a go at convincing yourself:
L
1=
𝑃(𝑥) =
{𝑥}
…
𝑥1
𝑥2
𝑃(𝑥1)
𝑥𝐿
Pa
i=2
xi-1-xi
The “Occasionally Dishonest Casino”
Consider a casino that uses two dice, one fair, one loaded.
Once in a while, they sneakily switch the dice…
You watch for awhile and observe the
following sequence of rolls…..
1 2 2 6 4 3 2 1 5 6 6 6 5 1 6 2 6
Would you know which die the casino was using?
Would you know if they switched at some point?
Your first problem……
The fair and cheat die have the SAME observables!!
DNA and Amino Acids are like our dice
We see the same symbols regardless of any functional association!
QATNRNTDGS
RWWCNDGRTP
SALLSSDITA
DGNGMNAWVA
TDYGILQINS
GSRNLCNIPC
SVNCAKKIVS
WRNRCKGTDQ
Can we tell by casual examination which segment of
this amino acid sequence corresponds to a transmembrane domain and which does not?
The amino acids are not labelled
Hidden Markov Models
What does Hidden refer to?
A
C
G
T
In our simple Markov model, we cannot say anything about
functional associations based on observed symbols..
Transition probabilities don’t reflect the underlying state of the system
Hidden Markov Models
Can we alter our model to accommodate alternative states for each observable?
C-
G-
A-
T-
A+
T+
Figure after Durbin, et al.
C+
G+
The complete set of “within set” transitions are omitted here for clarity. It’s still yucky!
Hidden Markov Models
The clearest representation is to separate transitions from emissions of symbols
A: 0.30
0.9
C: 0.25
A: 0.20
0.1
C: 0.35
0.99
G: 0.15
0.01
G: 0.25
T: 0.30
T: 0.20
State “+”
State “-”
Each state now has its own table of emission probabilities
and transitions now occur strictly between states
Hidden Markov Models
We now must incorporate the concept of a
state path
1 2 2 6 4 3 2 1 5 6 6 6 5 1 6 2 6
F F F F F F F F L L L L L L L F F
A A T A C A C G G C T A G C T A A
- - - - - - - + + + + + + + + - Often, the true state path (often denoted p ) associated
with sequences of interest will be hidden from us
Three Classic Problems of HMMs
We will study algorithms that address each of these problems
1. The learning problem
•
Given a model and some observed sequences, how do
we rationally choose or adjust (i.e. estimate) the model
parameters?
2. The evaluation problem
•
Given some HMM, and some observed sequences, what
is the probability that our model could have produced
those sequences? i.e. what is Pr{observation X | θ }
3. The decoding problem
•
Given some HMM and some observed sequences,
what state path through the model was most likely to
have produced those observations?
The Evaluation Problem
With HMMs, if we forget about the symbols that are emitted, the state
path still forms a Markov chain and the Markov property applies
The transition probability akl from state k to state l…
akl = P(pi = l | pi-1 = k)
…is equal to the probability that the ith state was l..
given that
… the immediately proceeding state (pi-1) was k
This statement of conditional probability is in exactly the same
form as was shown in our first Markov chain example
The Evaluation Problem
What about the emission probabilities ?
The probability of emitting symbol b when in state k
ek(b) = P(xi = b | pi = k)
…is equal to the probability that the ith symbol is b..
given that
… the current state (xi) is k
The probability that a particular symbol will be emitted is now
dependent on what state we are currently in! In other words, emission
probabilities are conditional on state.
Hidden Markov Models
Imagine we run this model generatively
A: 0.30
0.1
0.9
A: 0.20
S
0.9
C: 0.25
G: 0.15
T: 0.30
State “+”
C: 0.35
0.99
0.1
G: 0.25
0.01
T: 0.20
State “-”
What general sequence of events would occur?
The Evaluation Problem
Putting together the state transitions and the symbol emissions
The joint probability of some observed sequence x and
some state sequence p
P
L
P(x,p) = aS → p1
epi(xi) api →pi+1
i=1
…is equal to the probability of transitioning from Start to the first state..
..times a running product across all positions in the sequence of…
… the state specific emission probability times the probability of
transitioning to the next state
What we are really interested in knowing is the joint probability of the
sequence and the state path. But what if the state path is hidden?
The Evaluation Problem
Scoring with known state sequence….
S
0.10
0.10
A: 0.30
0.90
C: 0.35
G: 0.15
G: 0.25
T: 0.30
T: 0.20
0.01
Sequence under test
A: 0.20
C: 0.25
State “+”
0.9
0.90
_ACGCT
0.99
S---++
State “-”
0.99
0.99
0.01
0.9
*
*
*
*
0.2
0.35
0.25
0.25
= 1.042x10-5
0.3
0.9 * 0.198 * 0.3465 *0.0025 * 0.225 * 0.3
already a small
number!
This is the procedure when we have a known state sequence – but
We’ll
this last
another
day…
how could
werevisit
approach
this question
if the state
path was
unknown??
Alternative python transition dicts
Which reflects a uniform distribution?
self.transitions = {
"A": {
"A": 0.25,
"C": 0.25,
"G": 0.25,
"T": 0.25,
},
"C": {
"A": 0.25,
"C": 0.25,
"G": 0.25,
"T": 0.25,
},
"G": {
"A": 0.25,
"C": 0.25,
"G": 0.25,
"T": 0.25,
},
"T": {
"A": 0.25,
"C": 0.25,
"G": 0.25,
"T": 0.25,
},
}
self.transitions = {
"A": {
"A": 0.3,
"C": 0.2,
"G": 0.2,
"T": 0.3,
},
"C": {
"A": 0.4,
"C": 0.2,
"G": 0.25,
"T": 0.15,
},
"G": {
"A": 0.2,
"C": 0.2,
"G": 0.3,
"T": 0.3,
},
"T": {
"A": 0.4,
"C": 0.15,
"G": 0.25,
"T": 0.2,
},
}
I’ll send you some test sequences, and your program will tell me which
of these models was more likely to have produced the sequence!
Our python HMM data structures
Here are distributions for a dishonest casino…
self.emissions = {
self.transitions = {
"S":
{
"F": 0.5,
"L": 0.5,
},
"F":
{
"F": 0.95,
"L": 0.05,
},
"L":
{
"L": 0.90,
"F": 0.10,
}
}
"S":
# “S” is start
{
"": 1
# always emit the null string!
},
"F":
# 'F' indicates a fair die
{
"1": 1 / 6,
"2": 1 / 6,
"3": 1 / 6,
"4": 1 / 6,
"5": 1 / 6,
"6": 1 / 6
},
"L":
# 'L' indicates a loaded die
{
"1": 1 / 10,
"2": 1 / 10,
"3": 1 / 10,
"4": 1 / 10,
"5": 1 / 10,
"6": 1 / 2
}
}
Let’s adapt your simple Markov program to first work with our dice example…..