HMM Viterbi JLR 1

Download Report

Transcript HMM Viterbi JLR 1

HIDDEN
MARKOV
MODELS
Presentation
by
Jeff Rosenberg, Toru Sakamoto, Freeman Chen
1
The Plan
•
•
•
•
•
•
•
Modeling Biological Sequences
Markov Chains
Hidden Markov Models
Issues
Examples
Techniques and Algorithms
Doing it with Mathematica
2
Biological Sequences
FA12_HUMAN
VVGGLVALRGAHPYIAALYWGHSFCAGSLIAPC
TRYP_PIG
IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWV
TRY1_BOVIN
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWV
URT1_DESRO
TRY1_SALSA
STGGLFTDITSHPWQAAIFAQNRRSSGERFLCGG
IVGGYECKAYSQTHQVSLNSGYHFCGGSLVNENWV
TRY1_RAT
IVGGYTCPEHSVPYQVSLNSGYHFCGGSLINDQWV
NRPN_MOUSE
COGS_UCAPU
ILEGRECIPHSQPWQAALFQGERLICGGVLVGDRW
IVGGVEAVPNSWPHQAALFIDDMYFCGGSLISPEW
3
Sequences and Models
• Many biological sequences (DNA/RNA, proteins) have very
“subtle” rules for their structure; they clearly form “families”
and are “related,” yet simple measures or descriptions of these
relationships or rules rarely apply
• There is a need to create some kind of “model” that can be
used to identify relationships among sequences and distinguish
members of families from non-members
• Given the complexity and variability of these biological
structures, any practical model must have a probabilistic
component: that is, it will be a stochastic model, rather than a
mechanistic one. It will be evaluated by the (statistical)
accuracy and usefulness of its predictions, rather than the
correspondence of its internal features to any corresponding
internal mechanism in the structures being modeled.
4
Markov Chains
• A system with a set of m possible states, Si;
at each of a sequence of discrete points in
time t>=0, the system is in exactly one of
those states; the state at time t >= 0 is
designated by qt; the movement from qt to
qt+1 is probabilistic, and depends only on
the states of the system at or prior to t.
• An initial state distribution
π(i) = Prob(q0 = Si)
• Process terminates either at time T or when
reaching a designated final state Sf
5
Markov Chains of Order N
• Nth-order Markov chain (N >= 0):
transition probabilities out of state qt depend only on
the values of qt, qt-1,… qt-(N-1).
• Typically deal with 1st-order Markov chain, so only qt
itself affects the transition probabilities.
• In a 1st-order chain, for each state Sj, there is set of m
probabilities for selecting the next state to move to:
ai,j = Prob(qt+1 = Si | qt = Sj) [1 <= i <= m, t >= 0]
• If there is some ordering of states such that ai,j = 0
whenever i < j (i.e., no “non-trivial” loops), then this is
a “linear” (or “left-to-right”) Markov process
• “Homogeneous” Markov model: ai,j is independent of t
6
Simple Markov Models
• Might use a Markov chain to model a sequence
where the symbol in position n depends on the
symbol(s) in position(s) n-1,…n-N.
• For example, if a protein is more likely to have Lys
after a sequence Arg-Cys, this could be encoded as
(a small part of) a 2nd-order Markov model.
• If the probabilities of a given symbol are the same
for all positions in the sequence, and independent of
symbols in other positions, then can use the
“degenerate” 0th-order Markov chain, where the
probability of a given symbol is constant, regardless
of the preceding symbol (or of the position in the
sequence).
7
Hidden Markov Models (HMMs)
• In a Markov chain (or model), the states are themselves visible;
they can be considered the outputs of the system (or
deterministically associated with those outputs).
• However, if each state can emit (generate) any of several
possible outputs (symbols) vk, from an output alphabet O of M
symbols, on a probabilistic basis, then it is not possible (in
general) to determine the sequence of the states themselves;
they are “hidden.”
• Classic example: the “urn” game
– A set of N urns (states), each containing various colored balls (output
symbols – total of M colors available), behind a curtain
– Player 1 selects an urn at random (with Markov assumptions), then picks
a ball at random from that urn and announces its color to player 2
– Player 1 then repeats the above process, a total of T times
– Player 2 must determine sequence of urns selected based on the
sequence of colors announced
8
Additional Parameters for HMMs
• Now, in addition to the transition probabilities, each
state has a prescribed probability distribution to emit
or produce a symbol vk from O:
bi,k = Prob(vk | Si)
• If qt = Si, then the generated output at time t is vk with
probability bi,k.
• So, a HMM is “doubly stochastic” – both the
(hidden) state transition process and the (visible)
output symbol generation process are probabilistic.
9
Bayesian Aspects of HMM Usage
• Given an HMM M, we can relatively easily calculate the
probability of occurrence of an arbitrary output sequence,
s: P(s | M)
• However, we often want to determine the underlying set of
states, transition probabilities, etc. (the model) that is
“most likely” to have produced the output sequence s that
we have observed: P(M | s)
• Bayes’ Formula for sequence “recognition”:
P(M | s) = P(s | M) P(M) / P(s)
• Very hard to find this “absolute” probability: depends on
specific a priori probabilities we are unlikely to know)
• Instead, make it a “discrimination” problem: define a “null
model” N, find P(M | s) / P(N | s)
10
Issues in Using HMMs
• Model architecture/topology
• Training
– Selecting an appropriate training set
– Finding an “optimal” HMM that fits that set
– Must avoid overfitting
• Scoring (for HMM construction, sequence recognition)
– How likely that our sequence was generated by our HMM?
– Versus some “null model” – this converts a very difficult
recognition problem into a tractable discrimination problem
– Score is the “relative likelihood” for our HMM
• Efficiency of evaluation
– Pruning the search: Dynamic programming
– Using log-odds scores
11
A Simple HMM for Some DNA Sequences
Transition
Probabilities (ai,j)
State
(Si)
Emission
Probabilities (bi,k)
12
HMMs and Multiple Alignments
• Can convert a multiple alignment into an HMM:
– Create a node for each column in which most sequences have an
aligned residue
– Columns with many missing letters go to Insert states
– Emission probabilities are computed from the relative
frequencies in the alignment column (for Match states), usually
with aid of a regularizer (to avoid zero-probability cases)
– Emission probabilities for Insert states are taken from
background frequencies
• Can also create a multiple alignment from a linear HMM:
–
–
–
–
Find Viterbi (most likely) path in the HMM for each sequence
Each match state on that path creates a column in the alignment
Ignore or show in lower case letters from insert states
Setting transition probabilities is equivalent to setting gap
penalties in sequence alignment - more an art than a science.
13
Protein Sequences Alignment
ALYW-------GHSFCAGSL
AIFAKHRRSPGERFLCGGIL
AIYRRHRG-GSVTYVCGGSL
AIFAQNRRSSGERFLCGGIL
ALFQGE------RLICGGVL
ALFIDD------MYFCGGSL
AIYHYS------SFQCGGVL
SLNS-------GSHFCGGSL
Three states
1. Match states
2. Insertion states
3. Deletion states
14
Topology of Profile HMM
I0
D1
D2
D3
D4
I1
I2
I3
I4
M1
M2
M3
M4
Match states
Mi -> Mi+1
Mi -> Ii
Mi -> Di+1
Deletion states
Insertion states
Ii -> Mi+1
Ii -> Ii
Ii -> Di+1
Di -> Mi+1
Di -> Ii
Di -> Di+1
15
Regularizers
• For avoiding overfitting to training set
• Substitution matrices
– Identify more likely amino acid substitutions, reflecting
biochemical similarities/differences
– Fixed for all positions in a sequence; one value for a given pair
of amino acides
• Pseudocounts
– For protein sequences, typically based on observed (relative)
frequencies of various amino acids
– “Universal” frequencies or position/type dependent values
• Dirichlet mixtures
– Probabilistic combinations of Dirichlet densities
• Densities over probability distributions: i.e., the probability
density of various distributions of symbols (in a given
sequence position)
• Used to generate data-dependent pseudocounts
16
Algorithms for HMM Tasks (1)
• 3 Major Problems:
– Determine HMM parameters (given some HMM topology
and a training set)
– Calculate (relative) likelihood of a given output sequence
through a given HMM
– Find the optimal (most likely) path through a given HMM
for a specific output sequence, and its (relative) likelihood
• Forward/Backward Algorithm
– Used for determining the parameters of an HMM from
training set data
– Calculates probability of going forward to a given state
(from initial state), and of generating final model state
(member of training set) from that state
– Iteratively adjusts the model parameters
17
Algorithms for HMM Tasks (2)
• Baum-Welch (Expectation-maximization, EM)Algorithm
–
–
–
–
–
Often used to determine the HMM parameters
Can also determine most likely path for a (set of) output sequence(s)
Add up probabilities over all possible paths
Then re-update parameters and iterate
Cannot guarantee global optimum; very expensive
• Forward Algorithm
– Calculates probability of a particular output sequence given the HMM
– Straightforward summation of product of (partial path) probabilities
• Viterbi Algorithm
– Classical dynamic programming algorithm
– Choose “best” path (at each point), based on log-odds scores
– Save results of “subsubproblems” and re-use them as part of higher-level
evaluations
– More efficient than Baum-Welch
18
HMMs for Protein/Gene Sequence Analysis
• Using any of various means, identify a set of
related sequences with conserved regions
• Make 1st-order Markov assumptions: transitions
independent of sequence history and sequence
content (other than at the substitution site itself)
• Construct a HMM based on the set of sequences
• Use this HMM to search for additional members
of this family, possibly performing alignments
– Search by comparing fit to HMM against fit to some
null model
• For phylogenetic trees, also concerned with the
length of the paths involved and with shared
intermediate states (sequences)
19
Very Simple Viterbi
DNA Sequence Alignment – with value of +1 for match, -1
for mismatch, -2 for “gap”; remember “best” at each step
A
T
G
A
A
C
T
A
0 min min min min
min 1
-1 -3 -5
min
min
min
A
T
G
A
A C
T
A
0 min min min min
min 1 -1 -3 -5
min -1 0
0 -2
min -3 -2 -1 -1
min
min = -1099
A
T
G
A
A
C
T
A
0 min min min min
min 1
-1
-3
-5
min -1
0
0
-2
min
min
A
T
G
A
A C
T
A
min min min min
1 -1 -3 -5
-1 0
0 -2
-3 -2 -1 -1
-5 -4 -3 0
0
min
min
min
min
20
Very Simple Viterbi: Traceback
A
T
G
A
0
min
min
min
min
A
T
C
A
min min min min
-5
-3
-1
1
-2
0
0
-1
-1
-1
-2
-3
0
-3
-4
-5
21
Onward to Mathematica!
22