Biology 162: Computational Genetics

Transcript Biology 162: Computational Genetics

Profile HMMs
Biology 162 Computational
Genetics
Todd Vision
16 Sep 2004
Outline
• Profile HMMs generate MSAs
• States and transitions for
– Matches, Insertions, Deletions, Silent and Flanking states
• Statistics
– Null model, E values
• Training
– Model construction, Weighting training sequences and
including pseudocounts (which have a Bayesian
interpretation)
• Existing tools
– Interpro, including Pfam and HMMER
Globins
Helix
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
LGB2_LUPLU
Consensus
1111111222222222222222222222
3333333333333
-DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKLGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--NLKGTFATLSELHCDKLKHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K-GHHEAELKPLAQSHATKHLK-GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG.l.t
... .kHg.kV. a. .....
.
..l. L. .H. K.
Hidden Markov models
• Observed sequence of symbols
• Hidden sequence of underlying states
• Transition probabilities govern
transitions among states
• Emission probabilities govern the
likelihood of observing a symbol in a
particular state
Profile HMMs
• Use scores rather than emission
probabilities directly
L
ei (x i )
S   log
qx i
i1
where ei (x i ) is the emission prob of symbol
x i a at posi
qx i is the prob ofx i under null model
L is length of profile
A PSSM as a simple HMM
begin
M1
…
Mi
Mi+1
…
ML
With emission probabilities unique to each match state
end
But what about gaps?
• Ignore them (BLOCKS database)
OR
• Model them
– Insert states have background emission
probabilities
Ii
begin
M1
…
Mi
…
ML
end
Gap scores
• For an insert of length k with
background emission probabilities, we
have affine gap scores
If eI i (a)  qa
then log
Sgap
eI i (a)
0
qa
 logaM i I i  (k 1)logaI i I i  logaM i I j
Length distribution of inserts
• Geometric distribution
P(k)  a a
 a (1 aII )
k1
II
IM

aII
k1
II
I
aMI
aIM
Which columns are match
states?
Helix
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
LGB2_LUPLU
Consensus
1111111222222222222222222222
3333333333333
-DLS-----HGSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKLGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL---D--NLKGTFATLSELHCDKLKHLKTEAEMKASEDLKKHGVTVLTALGAILKK----K-GHHEAELKPLAQSHATKHLK-GTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG.l.t
... .kHg.kV. a. .....
.
..l. L. .H. K.
• Options
– Assign columns to be match states by eye
– Heuristic i.e. no more than 50% gaps per column
– Maximum a posteriori (MAP) model construction
• O(L2) dynamic programming algorithm exists to find
model that optimizes score on training data
Two ways to handle deletions
• Transitions between match states
begin
M1
…
Mi
…
ML
end
• Silent deletion states (no emission)
begin
D1
D2
D3
M1
M2
M3
end
Profile HMM
Flanking states
• Many sites in a sequence may be
assigned to 'flanking' states (N, C, or J)
• Transitions should force one or more
match states to be traversed at least
once
Local or global alignment?
• Are transitions allowed
– From start to internal match?
– From internal match to end?
• Are there states that can emit sequences
before and after the profile?
• Do transitions allow the profile to be
repeated?
• In HMMs
– Global/local behavior governed by model not
algorithm
– Behavior may differ w.r.t. the profile and the
sequence
Null model
S
N
P(seq | HMM )
Sbit  log 2
P(seq | null)

T
Extreme value problems
• How to convert Sbit to an expect value?
• Since alignment is not truly local, theory used
for BLAST does not hold here
• Solutions (both available in HMMER)
– Conservative approximation valid for any profile
– Empirically fit extreme value distribution using
simulated sequences
– Must be done once for every profile HMM
Why not always use full
model?
• The sum of probabilities is constrained
to be one
• Spreading probability among many
paths decreases power to discriminate
among them
• You should always choose the most
restrictive model (fewest transitions)
consistent with your purpose
Which algorithm to use?
• Three choices
– Viterbi: maximum likelihood path
– Forward: sum of probabilities of all possible paths
– Forward-backward: prob of each state at each pos
• For database search
– Query sequence against a database of profile
HMMs
– Profile HMM against a database of sequences?
• For alignment
– Adding new sequence(s) to an existing alignment
Training a profile HMM
• Weighting training sequences
– We saw the same problem when scoring multiple
alignments
– Same approaches are used for profile HMMs
• Estimating transition probabilities
– Taken care of by MAP model construction
• Estimating emission probabilities
– We will assume the alignment is correct
– Only issue is how to add pseudocounts
Better pseudocounts
• Laplace's Rule
– Ignores background frequencies of
residues
• Background frequency pseudocounts
–
eM i (a) 
c ja  Aqa
c
ja 
 A
a 
where c ja are actual counts
qa are background frequencies
A is a weight (eg 20)
Pseudocounts as Bayesian
priors
• Bayes' rule
Likelihood
Prior
Posterior
P(D | M)P(M)
P(M | D) 
 P(D | M)P(M)
for all M
where
M  model
D  data
Dirichlet mixture pseudocounts
• Background probabilities are not uniform
throughout the protein
– eg exposed loops (hydrophilic residues abundant)
vs. buried core (small side chains abundant)
• Different sets of pseudocount priors (Dirichlet
distributions) for each environment
• Pseudocounts for Ii are determined by a
mixture of Dirichlet distributions fit to position i
Evolutionary pseudocounts
• Related to phylogenetic methods we will see
later
– Calculate probability of each residue having been
the common ancestor of the residues in a column
– Calculate probability of each residue as a
descendent
– Use these probabilities as priors with appropriate
weighting
• Requires use of a position-independent
scoring matrix (eg PAM)
Queries vs. subjects
• Two directions of search are possible
– Sequence query against database of profile HMMs
– Profile HMM against a database of sequences
• Bit scores will be the same regardless
• But E-values will differ
– Search space (ie number of subjects in database) can differ
considerably
– It is usually more sensitive to search a database of profile
HMMs
Interpro
• Regular Expressions
– PROSITE
• PSSMs, other motifs
– PROSITE, PRINTS, PRODOM
• Profile HMMs
–
–
–
–
–
Pfam
SMART
TIGRFAMs
PIR SuperFamily
SUPERFAMILY
Interpro v8
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Pfam
• A profile HMM database
– Based on Swissprot and TREMBL
• Current version (v15) has 7503 families.
– ~75% of all new protein sequences match an
existing Pfam profile
• Profiles constructed semi-automatically
–
–
–
–
New families identified
Seed alignment manually optimized
Profile HMM constructed
All matching sequences aligned to HMM
HMMER
• Used in construction of Pfam
– Can build a profile (with MAP algorithm)
– Can search a sequence against a profile
and vice versa (i.e. with forward algorithm)
– Can add new sequences to an alignment
(via Viterbi)
– Uses Plan 7 profiles
• User sets the local/global behavior
HMMER2.0 [2.3.1]
NAME fn3
ACC
PF00041.8
DESC Fibronectin type III domain
LENG 84
ALPH Amino
RF
no
CS
no
MAP
yes
COM
hmmbuild -F HMM_ls.ann SEED.ann
COM
hmmcalibrate --seed 0 HMM_ls.ann
NSEQ 108
DATE Mon Jul 26 14:10:07 2004
CKSUM 1153
GA
8.0 -1.0
TC
8.0 0.0
NC
7.9 7.9
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
45
531
201
384 -1998
-644
EVD
-45.006527
0.260185
HMM
A
C
D
E
F
G
H
R
S
T
V
W
Y
m->m
m->i
m->d
i->m
i->i
d->m
d->d
-13
* -6758
1 -1698 -4236 -5400
-853 -4220 -2885 -1258
-4774 -1187 -1320
-120 -4666 -1510
1
-150
-501
232
46
-382
399
104
95
358
118
-368
-296
-251
-144 -3402 -12951
-19 -6284
-701 -1378
2
-613 -5389
1601
-868 -5707
553 -3558
-2014
1940
-582 -1300 -5575 -1474
3
-149
-500
233
43
-381
399
106
96
359
117
-369
-294
-249
-4
197
249
902
-1085
-142
-21
-313
I
K
L
M
N
P
Q
b->m
m->e
-930
-2438
408
-3428
-4769
3631
-1835
-628
211
-461
-722
274
395
44
-13
-5456
*
-3139
-894
-4479
-526
1872
-1543
-626
210
-466
-720
275
394
45
Summary
• Profile HMMs generate MSAs
• States and transitions for
– Matches, Insertions (which can model affine
gaps), Deletions, which allow local alignment,
Silent states and flanking states
• Statistics
– Scored relative to a null model and E values must
be determined empirically
• Training
– MAP model construction, Training sequence
weighting and pseudocounts (which have a
Bayesian interpretation)
• Existing tools
– Interpro, including Pfam and HMMER
Assignment
• Look over study guide
– posted on Blackboard
• Turn in lab/problem set on Tuesday
• Midterm on Thursday

Biology 162: Computational Genetics

Transcript Biology 162: Computational Genetics

Directory