Hidden Markov Model

Download Report

Transcript Hidden Markov Model

Hidden Markov Model
Ed Anderson and Sasha Tkachev
Who Was Markov?
 Graduate of Saint Petersburg University (1878),
where he began a professor in 1886
 Applied the method of continued fractions,
pioneered by his teacher Pafnuty Chebyshev, to
probability theory
 He proved the central limit theorem under fairly
general assumptions
 Most remembered for his study of Markov chains,
sequences of random variables in which the
future variable is determined by the present
variable but is independent of the way in which
 Andrei A Markov
the present state arose from its predecessors. This
work launched the theory of stochastic processes  Born: 14 June 1856
in Ryazan, Russia
 In 1923 Norbert Weiner became the first to treat
 Died: 20 July 1922
rigorously a continuous Markov process. The
in Petrograd, Russia
foundation of a general theory was provided
during the 1930s by Andrei Kolmogorov.
Excerpted from: http://www-groups.dcs.st-and.ac.uk/~history/Mathematicians/Markov.html
What is the Hidden Markov Model?
Clipped from http://www.nist.gov/dads/HTML/hiddenMarkovModel.html
What Makes HMM Useful?
 Efficiency:
 The algorithms are simple enough to be performant for realtime speech recognition.
 Speed is advantageous when dealing with large biological data
sets
 Strong Theoretical Basis
 Probability distribution must sum to 1.
 Scores are not influenced by ad-hoc criteria.
 Scores may be compared across different experiments of
varying size and complexity
 Well suited for analyzing noisy, time-phased or
sequentially connected events.
What are HMM’s Limitations?
Model building is not so easy
“Since HMM training algorithms are local
optimizers, it pays to build HMMs on pre-aligned
data whenever possible… the parameter space may
be complex with may spurious local optima than can
trap a training algorithm.”1
Distance between related states must be
constant
A disadvantage when analyzing distant and
arbitrarily spaced items:
Amino acids in folded proteins
RNA base pairs
1Eddy,
S.R., Profile hidden Markov models, Bioinformatics Review, 1998, Vol. 14,
no. 9 1998, pg. 757
A Concrete Example
 Can you guess the weather based on a person’s activity?
 Use the Forward algorithm to calculate the probabilities.
(Π) Initial State Probabilities
(A) Transition Probabilities
Tomorrow
Rain
Sun
0.7
0.3
0.4
0.6
Today
Rain
Sun
Typical Weather
Rain
Sun
0.6
0.4
(B) Emission Probabilities
IF
Walk
0.1
0.6
Rain
Sun
Observation:
Hidden States
Sun-Sun-Sun
Sun-Sun-Rain
Sun-Rain-Sun
Sun-Rain-Rain
Rain-Sun-Sun
Rain-Sun-Rain
Rain-Rain-Sun
Rain-Rain-Rain
Then
Shop
0.4
0.3
0.4
0.4
0.4
0.4
0.6
0.6
0.6
0.6
Clean
3
Shop
2
Walk
1
p(w eather n |
w eather n-1)
Clean
0.5
0.1
p(w eather n |
w eather n-1)
P(activity |
w eather)
.24
.06
0.6
0.6
0.6
0.6
0.1
0.1
0.1
0.1
0.6
0.6
0.4
0.4
0.3
0.3
0.7
0.7
p(w eather n |
w eather n-1)
P(activity |
w eather)
.18
.16
0.3
0.3
0.4
0.4
0.3
0.3
0.4
0.4
0.6
0.4
0.3
0.7
0.6
0.4
0.3
0.7
P(activity |
w eather)
.20
.35
Probability
0.1
0.5
0.1
0.5
0.1
0.5
0.1
0.5
Example adapted from http://en.wikipedia.org/wiki/Viterbi_algorithm
0.002592
0.008640 False Maximum
0.001152
0.013440 True Maximum
0.000324
0.001080
0.000504
0.005880
How to Avoid False Optima?
 Is it necessary to calculate every possible path?
 The Viterbi algorithm can help.
Example from http://www.telecom.tuc.gr/~ntsourak/demo_viterbi.htm
HMM In Speech Recognition
 Handling a single word; evaluating each HMM according to the input,
using the Viterbi Search
 Every senone gets a HMM:
ONE
W
AH
TWO
T
UW
TH
R
THREE
5-state HMM
Adapted from Shir, O. M., Speech Recognition Seminar, 10/15/03
Leiden Institute of Advanced Computer Science
N
IY
HMM In Speech Recognition
State with best path-score
State with path-score < best
State without a valid path-score
Pj (t) = max [Pi (t-1) aij bj (t)]
i
State transition probability, i to j
Score for state j, given the input at time t
Total path-score ending up at state j at time t
time
Taken from Shir, O. M., Speech Recognition Seminar, 10/15/03
Leiden Institute of Advanced Computer Science
HMM in Bioinformatics
Sequence profiling
Gene finding
Protein secondary structure prediction
Radiation hybrid mapping
Genetic linkage mapping
Phylogenetic analysis
HMM in Sequence Profiling
 Review – Lecture 7 Highlights
 Emission probabilities and transition probabilities
HMM in Sequence Profiling
 Log Odds scores are comparable across different length
sequences
Taken from lecture 7 slides, apparently from Krogh, “Computational Methods in molecular biology,
pages 45-63, Elsevier, 1998.
Why HMM for Sequence
Analysis?
 Position-specific scoring methods make intuitive sense.
 BLAST and FASTA use pair-wise alignment as opposed
to profile scoring
 Profile methods have historically used ad hoc scoring
systems.
 HMM gap penalties a grounded in probability theory.
 HMMs provide a coherent, probabilistic model. 2
(2) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763
Profile HMM Software
 ‘Motif’ models have strings of match states separated by a small
number of insert states. ‘Profile’ models have insert and delete
states associated with each match state.. 3
4
(3) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763
(4) Ibid., Figure 3 on page 758.
HMMER Architecture
Both local and global profile alignment.
5
(5) Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
How Does it Work?
 Generative models work by recursive enumeration of
possible sequences from a finite set of rules.
 The Plan 7 architecture explicitly models the entire
target sequence, regardless of how much of that
sequence matches the main model.
 All alignments to a Plan 7 model are “global”
alignments!
(6) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
6
HMMR Programs 7









hmmalign - align sequences to an HMM profile
hmmbuild - build a profile HMM from an alignment
hmmcalibrate - calibrate HMM search statistics
hmmconvert - convert between profile HMM file formats
hmmemit - generate sequences from a profile HMM
hmmfetch - retrieve an HMM from an HMM database
hmmindex - create a binary SSI index for an HMM database
hmmpfam - search one or more sequences against an HMM database
hmmsearch - search a sequence database with a profile HMM
 HMMER’s native alignment format is called Stockholm format, the format
of the Pfam protein database that allows extensive markup and annotation.
 HMMER can read alignments in several common formats, including the output of
the CLUSTAL family of programs, Wisconsin/GCG MSF format, the input format
for the PHYLIP phylogenetic analysis programs, and “alighed FASTA” format
(where the sequences in a FASTA file contain gap symbols, so that they are all the
same length).
(7) Excerpted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Building a profile with hmmbuild 8
> hmmbuild globin.hmm globins50.msf
hmmbuild - build a hidden Markov model from an alignment
HMMER 2.3 (April 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alignment file:
globins50.msf
File format:
MSF
Search algorithm configuration:
Multiple domain (hmmls)
Model construction strategy:
MAP (gapmax hint: 0.50)
Null model used:
(default)
Prior used:
(default)
Sequence weighting method:
G/S/C tree weights
New HMM file:
globin.hmm
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alignment: #1
Number of sequences: 50
Number of columns: 308
Constructed a profile
Average score: 189.04
Minimum score: -17.62
Maximum score: 234.09
Std. deviation: 53.18
HMM (length 143)
bits
bits
bits
bits
(8) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Calibrating the profile
9
> hmmcalibrate globin.hmm
hmmcalibrate -- calibrate HMM search statistics
HMMER 2.3 (April 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM file: globin.hmm
Length distribution mean: 325
Length distribution s.d.: 200
Number of samples: 5000
random seed: 1051632537
histogram(s) saved to: [not saved]
POSIX threads: 4
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM : globins50
mu : -39.897396
lambda : 0.226086
max : -9.567000
(9) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB
10
 Header Section

hmmsearch globin.hmm Artemia.fa
hmmsearch - search a sequence database with a profile HMM
HMMER 2.3 (April 2003)
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine
Freely distributed under the GNU General Public License (GPL)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - HMM file:
globin.hmm [globins50]
Sequence database:
Artemia.fa
per-sequence score cutoff:
[none]
per-domain score cutoff:
[none]
per-sequence Eval cutoff:
<= 10
per-domain Eval cutoff:
[none]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Query HMM: globins50
Accession: [none]
Description: [none]
[HMM has been calibrated; E-values are empirical estimates]
(10) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB (cont.)
11
 Sequence Top Hits Section
(11) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB (cont.)
12
 Alignment Output Section
(12) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Searching the sequence DB (cont.)
13
 Score Histogram Section
(13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Local versus Global Alignment
14
 HMMER does not do local (Smith/Waterman) and global
(Needleman/Wunsch) style alignments in the same way that most
computational biology analysis programs do it.
 To HMMER, whether local or global alignments are allowed is part
of the model, rather than being accomplished by running a different
algorithm.
 You must choose what kind of alignments you want to allow when
you build the model
 By default, hmmbuild builds models which allow alignments that
are global with respect to the HMM, local with respect to the
sequence, and allows multiple domains to hit per sequence.
(13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.
Experimental Observations
 My tests on the clipped SH3 Domain sequence in the Krogh paper.15
 The insert gap penalty was small but significant.
 The number of inserts had a linear, negative affect on the score.
 Relative to the overall score, the inserts and deletes had a small effect.
Avg Log Odds by Domain
8.63
-1.46
14.77
Insert Region Log Odds Correlated to Number
of Inserts
-1.34
-1.36 0
2
4
6
8
Log Odds
-1.38
-1.40
-1.42
-1.44
-1.46
-1.48
-1.50
-1.52
Total Inserts
(15) Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.