HMM-Lec9-092104

Download Report

Transcript HMM-Lec9-092104

Lecture 9
Hidden Markov Models
BioE 480
Sept 21, 2004
– Delete states (circle): silent or null state. Do not match any
residues, they are there so it is possible to jump over one or more
columns:
• For modeling when just a few of the sequences have a “-” at a
position.
• Example:
Pseudo-counts
• Dangerous to estimate a probability distribution from just a few
observed amino acids.
– If there are two sequences, with Leu at a position:
• P for Leu =1, but P = 0 for all other residues at this position
• But we know that often Val substitutes Leu.
• The probability of the whole sequence are easily become 0 if a
single Leu is substituted by a Val.
• Or , the log-odds is minus infinity.
• How to avoid “over-fitting” (strong conclusions drawn from very little
evidence)?
• Use pseudocounts:
– Pretend to have more counts than those from the data.
– A. Add 1 to all the counts:
• Leu: 3/23, other a.a.: 1/23
• Adding 1 to all counts is as assuming a priori all a.a. are equally likely.
• Another approach: use background composition as pseudocounts.
Searching a database with HMM
• Know how to calculate the probability of a sequence in the alignment:
– multiplying all the probabilities (or adding the log-odds scores) in the model
along the path followed by that sequence.
• For sequences not in the alignment, we do not know the path.
– Find a path through the model where the new sequence fits well:
– we can then score it as before.
• Need to “align” the sequence to the model:
– Assigning states to each residue in the sequence.
– A given sequence can have many alignments.
• Eg. A protein has a.a. as: A1, A2, A3, …
HMM states as: M1, M2, M3, … for match states,
I1, I2, I3, … for insertion states,
• An alignment:
– A1 matches M1, A2 and A3 match I1, A4 matches M2, A5 matches
M6 (after passing through three delete states).
• For each alignment, we can calculate the probability of the sequence,
or the log-odds score.
– Can find the best alignment.
• A dynamic programming method can be used:
– Viterbi algorithm.
– It also gives the probability of the sequence for that alignment, thus
we have a score.
• The log-odds score found can be used to search databases for members
of the same family.
Model estimation
• Profile HMMs can be viewed as a generalization of weight matrix to
incorporate insertions and deletions.
– We can also estimate the model: determining all the probability
parameters from unaligned sequences.
– A multiple alignment will also be produced.
• Iteratively done:
– Start with a model: with random probabilities, or reasonable
alignment of some of the sequences.
– When all the sequences are aligned, we can use the alignment to
improve the probabilities in the model, leading to a slightly
different alignment.
– Repeat this process and improve the probabilities again, until
convergence: things no longer improve.
– Final model is a multiple alignment.
• Problems:
– How do we choose the length of the model?
This affects the number of inserts in the final
alignment.
– Iterative procedure can converge to local
minimum: No guarantee that it will find the
optimal multiple alignment.
• An example of a HMM model: the self-loops on the diamond insertion states. The
self-loops allow deletions of any length to fit the model, regardless of the length
of other sequences in the family
A possible hidden Markov model for the protein ACCY. The protein is represented as a
sequence of probabilities. The numbers in the boxes show the probability that an amino
acid occurs in a particular state, and the numbers next to the directed arcs show
probabilities which connect the states. The probability of ACCY is shown as a
highlighted path through the model.
• Scoring: Any sequence can be represented by a path through the model. The
probability of any sequence, given the model, is computed by multiplying the
emission and transition probabilities along the path.
• For ACCY, only the probability of the emitted amino acid is given.
– The probability of A being emitted in position 1 is 0.3, and the probability of
C being emitted in position 2 is 0.6. The probability of ACCY along this path
is :
4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.
• Simplification by transforming probabilities to logs so that addition can replace
multiplication.
– The resulting number is the raw score of a sequence, given the HMM.
– the score of ACCY along the path shown is:
loge(.4) + loge(.3) + loge(.46) + loge(.6) + loge(.97) + loge(.5) + loge(.015) +
loge(.73) +loge(.01) + loge(1) = -13.25
• The calculation is easy if the exact state path is known.
• In a real model, many different state paths through a model can
generate the same sequence.
– Therefore, the correct probability of a sequence is the sum of
probabilities over all of the possible state paths.
– Unfortunately, a brute force calculation of this problem is
computationally unfeasible, except in the case of very short
sequences.
– Two good alternatives are:
• to calculate the sum over all paths inductively using the
forward algorithm, or
• to calculate the most probable path through the model using
the Viterbi algorithm.
• Another HMM model:
I1
I0
M1
I2
M2
I3
M3
• The Insert, Match, and Delete states can be labeled with their position number in the
model, M1, D1 etc.
– Because the number of insertion states is greater than the number of match or
delete states, there is an extra insertion state at the beginning of the model,
labeled I0.
– several state paths through the model are possible for this sequence.
Viterbi Algorithm
• The most likely path through the model is computed with the Viterbi algorithm.
– The algorithm employs a matrix.
• The rows of the matrix are indexed by the states in the model,
• The columns indexed by the sequence.
• Deletion states are not shown, since, by definition, they have a zero
probability of emitting an amino acid.
– The elements of the matrix are initialized to zero and then computed.
– 1. The probability that the amino acid A was generated by state I0 is
computed and entered as the first element of the matrix.
– 2. The probabilities that C is emitted in state M1 (multiplied by the
probability of the most likely transition to state M1 from state I0) and in
state I1 (multiplied by the most likely transition to state I1 from state I0)
are entered into the matrix element indexed by C and I1/M1.
– 3. The maximum probability, max(I1, M1), is calculated.
– 4. A pointer is set from the winner back to state I0.
– 5. Steps 2-4 are repeated until the matrix is filled.
Prob(A in state I0) = 0.4*0.3=0.12
Prob(C in state I1) = 0.05*0.06*0.5 = .015
Prob(C in state M1) = 0.46*0.01 = 0.005
Prob(C in state M2) = 0.46*0.5 = 0.23
Prob(Y in state I3) = 0.015*0.73*0.01 = .0001
Prob(Y in state M3) = 0.97*0.23 = 0.22
• The most likely path through the model can now be found by
following the back-pointers.
• Once the most probable path through the model is known, the
probability of a sequence given the model can be computed by
multiplying all probabilities along the path.
• In dishonest casino case, based on the sequences of rolls the Viterbi
algorithm can recover the state sequence (which die casino was using)
quite well.
• The spirit is similar to what we calculated last time: what’s the
possibility the casino used a loaded die what three “6” showed up.
Projects
• Scoring matrix for alignment
(insertion/deletion)
• HMM (contact prediction)
• Phylogenetic tree
• Genetic circuit (regulation based expression
prediction)
• Microarray (clustering analysis)