52930 Protein informatics 09

Download Report

Transcript 52930 Protein informatics 09

Multiple sequence alignment (msa)
• A) define characters for phylogenetic analysis
• B) search for additional family members
SeqA
SeqB
SeqC
SeqD
NYLS
N
N
N
N
∙
∙
K
∙
NKYLS
NFS
+K
-L
F
F
Y
Y
YF
L
–
L
L
S
S
S
S
NFLS
Complexity of performing msa
• Sum of pairs (SP) score
• Extension of sequence pair alignment by
dynamic programming
– O(LN), where L is sequence length and N is
number of sequences
• MSA algorithm searches subspace
bounded by
– Heuristic multiple alignment (suboptimal)
– Pairwise optimal alignments
Pairwise projection
Pairwise optimal alignment
Greedy multiple alignment
Global msa optimum (sequences A,B,C,…) is probably found
in the area in-between
Sequence A
Sequence B
Motivation: existing alignment methods
Heuristic alignment algorithms
method
progressive pairwise alignment
profile hidden Markov models
software
ClustalW, MUSCLE, ProbCons
SAM, HMMER
strengths
consider evolutionary relatedness
site-specific emission & indel
probabilities
drawbacks assume homogeneous process
across sites
assume star-like tree, problems with
long gaps
other
require training alignment
require guide tree
Progressive alignment
• 1. derive a guide tree from pairwise
sequence distances
• 2. align the closest (groups of) sequences
• Pluses: works with many sequences
• Minuses: errors made early on are frozen
in the final alignment
• Programs: Clustal, Pileup, T-Coffee
Motivation: existing alignment methods
Progressive pairwise alignment
Evolution
Alignment
TACAGTA
TACAGTA
TACGTA
TACGTA
TACGTC
TACGTC
TACCTA
TATA
TATA
CACCTA
time
CACCTA
progress
Motivation: existing alignment methods
Progressive pairwise alignment
TACAGTA
1.
2.
TACGTC
3.
TATA
CACCTA
1.
T
A
C
G
T
C
TACAGTA
match
TACAG
TA
TAC GTC
vertical gap
horizontal gap
2.
progress
T
A
T
A
3.
C
A
C
C
T
A
CAC
GTA
TA
TA
Motivation: existing alignment methods
Progressive pairwise alignment
1.
2.
TACAGTA
TAC GTC
3.
TA
TA
CAC
GTA
TACAGTA
TAC GTC
TATA
CAC CTA
TACAGTA
TAC GTC
TA
TA
CAC GTA
Motivation: existing alignment methods
Profile HMMs
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF
CAPNTAGNHPCNGFGTCVPGHGDGSAANNVF
Iterative methods
• Gotoh
– inner loop uses progressive alignment
– outer loop recomputes distances and guide
tree based on the msa obtained
• DIALIGN
– Identify high-scoring ungapped segment pairs
– Build up msa using consistent segment pairs
– Greedy algorithm DIALIGN2; exact DIALIGN3
Motifs
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
- several proteins are grouped together by similarity
searches
- they share a conserved motif
- motif is stringent enough to retrieve the family members
from the complete protein database
SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM------SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI------SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM------SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM------SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT------SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN----SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAM
SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN------AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS------TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD------TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD------SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL------TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------:: :
:
:
* *:*
Local msa
• Profile analysis a portion of the alignment that
is highly conserved and produces a type of
scoring matrix called a profile. New sequences
can be aligned to the profile using dynamic
programming.
• Block analysis scans a global msa for
ungapped regions, called blocks, and these
blocks are then used in sequence alignments.
• Pattern-searching or statistical methods find
recurrent regions of sequence similarity in a set
of initially unaligned sequences.
TEIRESIAS
• Finds all patterns with minimum support in input
set of unaligned sequences
– Maximum spacer length
– For example L…G…………..A….L…L
• Enumerative algorithm
– Build up longer patterns by merging overlapping short
patterns
– For example A….L + L…L  A….L…L
• Fewer instances by chance when more positions are
specified
• Pluses: exact
• Minuses: biological significance of the patterns?
Expectation maximization (EM)
• Initial guess of motif location in each sequence
• E-step: Estimate frequencies of amino acids in
each motif column. Columns not in motif provide
background frequencies. For each possible site
location, calculate the probability that the site
starts there.
• M-step: use the site probabilities as weights to
provide a new table of expected values for base
counts for each of the site positions.
• Repeat E-step and M-step until convergence.
EM procedure
•
Let the observed variables be known as y and the latent variables as z. Together, y and z form the complete
data. Assume that p is a joint model of the complete data with parameters θ: p(y,z | θ). An EM algorithm will then
iteratively improve an initial estimate θ0 and construct new estimates θ1 through θN. An individual re-estimation
step that derives from takes the following form (shown for the discrete case; the continuous case is similar):
•
In other words, θn + 1 is the value that maximizes (M) the expectation (E) of the complete data log-likelihood with
respect to the conditional distribution of the latent data under the previous parameter value. This expectation is
usually denoted as Q(θ):
– Eqn 1
–
•
•
•
•
•
•
Eqn 2
Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, datadependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is
maximized in the second (M) step of an EM algorithm.
It can be shown that an EM iteration does not decrease the observed data likelihood function, and that the only
stationary points of the iteration are the stationary points of the observed data likelihood function. In practice, this
means that an EM algorithm will converge to a local maximum of the observed data likelihood function.
EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form
estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture
of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known.
"Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe
or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an
EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density
model.
An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M
step, rather than maximum likelihood.
There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient
or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or
second derivatives of the likelihood function.
Eqn 1
Eqn 2
Exercise
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
Seq9
Seq10
C
G
G
T
C
T
A
C
A
G
CAG
TTA
TAC
TAT
AGA
TTT
TAC
TAT
GCT
TAG
A
A
C
T
T
G
T
G
C
A
• Analyze the following ten
DNA sequences by the
expectation maximization
algorithm. Assume that
the background base
frequencies are each
0.25 and that the middle
three positions are a
motif. The size of the
motif is a guess that is
based on a molecular
model. The alignment of
the sequences is also a
guess.
(a) Calculate the observed
frequency of each base at each of
the three middle positions
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
Seq9
Seq10
C
G
G
T
C
T
A
C
A
G
CAG
TTA
TAC
TAT
AGA
TTT
TAC
TAT
GCT
TAG
A
A
C
T
T
G
T
G
C
A
Base 1st
2nd
3rd
A
0.1
0.6
0.2
C
0.1
0.1
0.2
G
0.1
0.1
0.2
T
0.7
0.2
0.4
(b) Calculate the odds likelihood of
finding the motif at each of the
possible locations in sequence 5
Seq5
CAGAT
Base 1st
2nd
3rd
A
C
G
T
0.6
0.1
0.1
0.2
0.2
0.2
0.2
0.4
0.1
0.1
0.1
0.7
CAGAT:
0.1*0.6*0.2/0.25/0.25/0.25=0.768
CAGAT:
0.1*0.1*0.2/0.25/0.25/0.25=0.128
CAGAT:
0.1*0.6*0.4/0.25/0.25/0.25=1.536
Non-motif sites: 0.25/0.25=1
(c) Calculate the probability of
finding the motif at each position of
sequence 5
Seq5
CAGAT
Base 1st
2nd
3rd
A
C
G
T
0.6
0.1
0.1
0.2
0.2
0.2
0.2
0.4
0.1
0.1
0.1
0.7
CAGAT:
0.1*0.6*0.2*0.25*0.25=0.00075
CAGAT:
0.25*0.1*0.1*0.2*0.25=0.000125
CAGAT:
0.25*0.25*0.1*0.6*0.4=0.0015
Non-motif sites: p=0.25
(d) Calculate what change will be made to the
base count in each column of the motif table as a
result of matching the motif to sequence 5.
CAGAT: 0.1*0.6*0.2*0.25*0.25=0.00075, rel. weight 0.32
CAGAT: 0.25*0.1*0.1*0.2*0.25=0.000125, rel. weight 0.05
CAGAT: 0.25*0.25*0.1*0.6*0.4=0.0015, rel. weight 0.63
Base
1st
2nd
3rd
A
0.05
0.32+0.63=0.95 0.05
C
0.32
0
0
G
0.63
0.05
0.32
T
0
0
0.63
Updated counts from sequence 5 (initial location guess shaded).
(e) What other steps are taken to
update or maximize the table
values?
• The weighted sequence data from the
remaining sequences are also added to
the counts table
• The base frequencies in the new table are
used as an updated estimate of the site
residue composition.
• The expectation and maximization steps
are repeated until the base frequencies do
not change.
Gibbs sampler
• Start from random msa; realign one
sequence against profile derived from the
other sequences; iterate
• Finds the most probable pattern common
to all of the sequences by sliding them
back and forth until the ratio of the motif
probability to the background probability is
a maximum.
A. Estimate the amino acid frequencies in the motif columns of all but one sequence.
Also obtain background.
Random start
positions chosen
xxxMxxxxx
xxxxxxMxx
xxxxxMxxx
xMxxxxxxx
Xxxxxxxxx
Mxxxxxxxx
xxxxMxxxx
xMxxxxxxx
xxxxxxxxM
Motif
↓
xxxMxxxxx
xxxxxxMxx
xxxxxMxxx
xMxxxxxxx
Xxxxxxxxx
Mxxxxxxxx
xxxxMxxxx
xMxxxxxxx
xxxxxxxxM
B. Use the estimates from A to calculate the ratio of probability of motif to background
score at each position in the left-out sequence. This ratio for each possible location
is the weight of the position.
xxxxxxxxxx
xxxxxxxxxx
xxxxxxxxxx
xxxxxxxxxx
M
M
M
M
C. Choose a new location for the motif in the left-out sequence by a random selection
using the weights to bias the choice.
xxxxxxxMxx
Estimated location of the motif in left-out sequence.
D. Repeat steps A to C many (>100) times.
Exercise: Gibbs sampler
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
Seq9
Seq10
C
G
G
T
C
T
A
C
A
G
CAG
TTA
TAC
TAT
AGA
TTT
TAC
TAT
GCT
TAG
A
A
C
T
T
G
T
G
C
A
• Analyze the left-handside DNA sequences
by the Gibbs
sampling algorithm.
(a) Assuming that the background base
frequencies are 0.25, calculate a log odds
matrix for the central three positions.
Base
1st
2nd
3rd
Base
1st
2nd
3rd
A
0.1
0.6
0.2
A
-1.32 1.26 -0.32
C
0.1
0.1
0.2
C
-1.32 -1.32 -0.32
G
0.1
0.1
0.2
G
-1.32 -1.32 -0.32
T
0.7
0.2
0.4
T
1.49 -0.32 0.68
(b) Assuming that another sequence GTTTG is the
left-out sequence, slide the log-odds matrix along the
left-out sequence and find the log-odds score at each
of three possible positions.
Base
1st
2nd
3rd
A
-1.32
1.26
-0.32
C
-1.32
-1.32
-0.32
G
-1.32
-1.32
-0.32
T
1.49
-0.32
0.68
GTTTG: -1.32 + -0.32 + 0.68 = -1.00
GTTTG: 1.49 + -0.32 + 0.68 = 1.81
GTTTG: 1.49 + -0.32 + -0.32 = 0.85
(c) Calculate the probability of a match at
each position in the left-out sequence
Base
1st
2nd
3rd
A
0.1
0.6
0.2
C
0.1
0.1
0.2
G
0.1
0.1
0.2
T
0.7
0.2
0.4
log-odds=log2(p/bg), p=2s bg
GTTTG: 2-1.00/64=0.078, 0.1*0.2*0.4=0.008
GTTTG: 21.81/64=0.055, 0.7*0.2*0.4=0.056
GTTTG: 20.85/64=0.028, 0.7*0.2*0.2=0.028
(d) How do we choose a possible location
for the motif in the left-out sequence?
• 0.008+0.056+0.028=0.092
• Normalised weights:
– GTTTG: 0.008/0.092=0.09
– GTTTG: 0.056/0.092=0.61
– GTTTG: 0.028/0.092=0.30
Modelling protein families
MMCOL10A1_1.483
Ca1x_Chick
S15435
CA18_MOUSE.597
Ca28_Human
MM37222_1.98
COLE_LEPMA.264
HP27_TAMAS.72
S19018
C1qb_Mouse
C1qb_Human
Cerb_Human
2.HS27109_1
SGSAIMELTENDQVWLQLPNA-ESNGLYSSEYVHSSFSGFLVAPM------SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI------SGSAVLLLRPGDRVFLQMPSE-QAAGLYAGQYVHSSFSGYLLYPM------SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM------SGGAVLQLRPNDQVWVQIPSD-QANGLYSTEYIHSSFSGFLLCPT------SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN----SNLALLHLTDGDQVWLETLR--DWNGXYSSSEDDSTFSGFLLYPDTKKPTAM
SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN------AGGTVLQLRRGDEVWIEKDP--AKGRIYQGTEADSIFSGFLIFPS------TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD------TGGMVLKLEQGENVFLQATD---KNSLLGMEGANSIFSGFLLFPD------SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL------TGDALLELNYGQEVWLRLAK----GTIPAKFPPVTTFSGYLLYRT------:: :
:
:
* *:*
PSSM
• The PSSM is constructed by a logarithmic
transformation of a matrix giving the frequency
of each amino acid in the motif.
• If a good sampling of sequences is available, the
number of sequences is sufficiently large, and
the motif structure is not too complex, it should,
in principle, be possible to produce a PSSM that
is highly representative of the same motif in
other sequences also.
– Pseudocounts
– Sequence logo, information content
Exercise: construct a PSSM
Base
Column 1
frequency
Column 2
frequency
Column 3
frequency
Column 4
frequency
A
0.6
0.1
0.2
0.1
C
0.1
0.7
0.1
0.1
G
0.1
0.1
0.6
0.1
T
0.2
0.1
0.1
0.7
Exercise cntd
• (a) Assuming the background frequency is
0.25 for each base, calculate a log odds
score for each table position, i.e. log to the
base 2 of the ratio of each observed value
to the expected frequency.
Observed/expected frequencies
Base
Column 1
frequency
Column 2
frequency
Column 3
frequency
Column 4
frequency
A
0.6/0.25
=2.4
0.1/0.25
=0.4
0.1/0.25
=0.4
0.2/0.25
=0.8
0.1/0.25
=0.4
0.7/0.25
=2.8
0.1/0.25
=0.4
0.1/0.25
=0.4
0.2/0.25
=0.8
0.1/0.25
=0.4
0.6/0.25
=2.4
0.1/0.25
=0.4
0.1/0.25
=0.4
0.1/0.25
=0.4
0.1/0.25
=0.4
0.7/0.25
=2.8
C
G
T
Log-odds scores
Base
Column 1
frequency
Column 2
frequency
Column 3
frequency
Column 4
frequency
A
log22.4=1.26
log20.4=-1.32
log20.8=-0.32
log20.4=-1.32
C
log20.4=-1.32
log22.8=1.49
log20.4=-1.32
log20.4=-1.32
G
log20.4=-1.32
log20.4=-1.32
log22.4=1.26
log20.4=-1.32
T
log20.8=-0.32
log20.4=-1.32
log20.4=-1.32
log22.8=1.49
Exercise cntd
• (b) Align the matrix with each position in
the sequence TCACGTAA starting at
position 1,2, etc., and calculate the log
odds score for the matrix to that position.
Base
Column 1 Column 2
frequency frequency
Column 3 Column 4
frequency frequency
A
C
G
T
1.26
-1.32
-0.32
-1.32
-1.32
1.49
-1.32
-1.32
-1.32
-1.32
1.26
-1.32
-0.32
-1.32
-1.32
1.49
TCACGTAA: -0.32 + 1.49 + -0.32 + -1.32 = -0.47
TCACGTAA: -1.32 + -1.32 + -1.32 + -1.32 = -5.28
TCACGTAA: 1.26 + 1.49 + 1.26 + 1.49 = 4.5
TCACGTAA: -1.32 + -1.32 + -1.32 + -1.32 = -2.81
TCACGTAA: -1.32 + -1.32 + -0.32 + -1.32 = -4.28
(c) Calculate the probability of the
best matching position.
• p(TCACGTAA): 0.6*0.7*0.6*0.7=0.1764
Alignment of sequences with a structure: hidden Markov models
Hidden Markov Models
●
HMMs suit well on describing correlation among neighbouring sites
●
probabilistic framework allows for realistic modelling
●
well developed mathematical methods; provide
1. best solution (most probable path through the model)
2. confidence score (posterior probability of any single solution)
3. inference of structure (posterior probability of using model states)
we can align multiple sequence using complex models and
simultaneously predict their internal structure
●
Coin game
• Fair coin: p(1)=p(0)=0.5
– 01110001101011100…
• Biased coin: p(1)=0.8, p(0)=0.2
– 111011111100100111…
• Observed series: 010111001010011
– P(fair coin)=0.515
– P(biased coin)=0.880.27
– P(biased coin)/P(fair coin)=0.07
Markov process
• State depends only on previous state
• Markov models for sequences
– States have emission probabilities
– State transition probabilities
Hidden Markov model for two-state variable
t(1,1)
t(2,2)
t=transition probability
t(1,2)
1
t(2,end)
2
p1(a)
p1(b)
p2(a)
p2(b)
1  1  2  end
a
b
a
end
HMM
p=emission probability
state, p
observed sequence, x
t(1,1) t(1,2) t(2,end) p1(a) p1(b) p2(a) P(x,p | HMM)
profile-HMM
insert
begin
1
2
3
4
end
match
delete
match state emits one of 20 amino acids
insert state emits one of 20 amino acids
delete, begin, end states are mute
A possible hidden Markov model
for the protein ACCY.
HMM with multiple paths through the model for ACCY. The
highlighted path is only one of several possibilities.
Viterbi algorithm
• computes the sequence scores over the
most likely path rather than over the sum
of all paths.
Forward algorithm
• Similar to Viterbi except that a sum rather
than the maximum is computed
What the Score Means
Once the probability of a sequence has been
determined, its score can be computed.
Because the model is a generalization of how
amino acids are distributed in a related group (or
class) of sequences, a score measures the
probability that a sequence belongs to the class.
A high score implies that the sequence of
interest is probably a member of the class, and a
low score implies it is probably not a member.
Optimisation
• The Baum-Welch algorithm is a variation of the forward
algorithm described earlier. It begins with a reasonable
guess for an initial model and then calculates a score for
each sequence in the training set over all possible paths
through this model . During the next iteration, a new set
of expected emission and transition probabilities is
calculated. The updated parameters replace those in the
initial model, and the training sequences are scored
against the new model. The process is repeated until
model convergence, meaning there is very little change
in parameters between iterations.
• The Viterbi algorithm is less computationally expensive
than Baum-Welch.
Heuristics
• There is no guarantee that a model built
with either the Baum-Welch or Viterbi
algorithm has parameters which maximize
the probability of the training set. As in
many iterative methods, convergence
indicates only that a local maximum has
been found. Several heuristic methods
have been developed to deal with this
problem.
Heuristics: parallel trials
• start with several initial models and
proceed to build several models in parallel.
When the models converge at several
different local optimums, the probability of
each model given the training set is
computed, and the model with the highest
probability wins.
Heuristics: add noise
• add noise, or random data, into the mix at each
iteration of the model building process. Typically,
an annealing schedule is used. The schedule
controls the amount of noise added during each
iteration. Less and less noise is added as
iterations proceed. The decrease is either linear
or exponential. The effect is to delay the
convergence of the model. When the model
finally does converge, it is more likely to have
found a good approximation to the global
maximum.
Sequence weighting
Overfitting and regularization
CGGSLLNAN--TVLTAAHC
CGGSLIDNK-GWILTAAHC
CGGSLIRQG--WVMTAAHC
CGGSLIREDSSFVLTAAHC
Dirichlet mixtures
• A sophisticated application of this method is known as Dirichlet
mixtures. The mixtures are created by statistical analysis of the
distribution of amino acids at particular positions in a large number
of proteins. The mixtures are built from smaller components known
as Dirichlet densities.
• A Dirichlet density is a probability density over all possible
combinations of amino acids appearing in a given position. It gives
high probability to certain distributions and low probability to others.
For example, a particular Dirichlet density may give high probability
to conserved distributions where a single amino acid predominates
over all others. Another possibility is a density where high probability
is given to amino acids with a common identifying feature, such as
the subgroup of hydrophobic amino acids.
• When an HMM is built using a Dirichlet mixture, a wealth of
information about protein structure is factored into the parameter
estimation process. The pseudocounts for each amino acid are
calculated from a weighted sum of Dirichlet densities and added to
the observed amino acid counts from the training set. The
parameters of the model are calculated as described above for
simple pseudocounts.
Log-odds ratio
• This number is the log of the ratio between two
probabilities — the probability that the sequence was
generated by the HMM and the probability that the
sequence was generated by a null model, whose
parameters reflect the general amino acid distribution in
the training sequences.
Limitations of profile-HMM
• The HMM is a linear model and is unable to capture higher order
correlations among amino acids in a protein molecule.
• In reality, amino acids which are far apart in the linear chain may be
physically close to each other when a protein folds. Chemical and
electrical interactions between them cannot be predicted with a
linear model.
• Another flaw of HMMs lies at the very heart of the mathematical
theory behind these models: the probability of a protein sequence
can be found by multiplying the probabilities of the amino acids in
the sequence. This claim is only valid if the probability of any amino
acid in the sequence is independent of the probabilities of its
neighbors.
• In biology, this is not the case. There are, in fact, strong
dependencies between these probabilities. For example,
hydrophobic amino acids are highly likely to appear in proximity to
each other. Because such molecules fear water, they cluster at the
inside of a protein, rather than at the surface where they would be
forced to encounter water molecules.
A simplified hidden Markov model (HMM)
A 0.25
C 0.25
G 0.25
T 0.25
A 0.25
C 0.25
G 0.25
T 0.25
A 0.1
C 0.2
G 0.2
T 0.5
BEGIN
A 0.25
C 0.25
G 0.25
T 0.25
A 0.4
C 0.3
G 0.1
T 0.2
P=1
1.0
0.7
0.2
P=1
0.1
A 0.25
C 0.25
G 0.25
T 0.25
A 0.1
C 0.4
G 0.4
T 0.1
0.9
END
P=1
(a) Calculate the probability of the sequence TAG by
following a path through the model’s three match states.
A 0.25
C 0.25
G 0.25
T 0.25
A 0.25
C 0.25
G 0.25
T 0.25
A 0.1
C 0.2
G 0.2
T 0.5
BEGIN
A 0.25
C 0.25
G 0.25
T 0.25
1.0
0.7
0.2
A 0.1
C 0.4
G 0.4
T 0.1
A 0.4
C 0.3
G 0.1
T 0.2
P=1
P=1
0.1
A 0.25
C 0.25
G 0.25
T 0.25
0.9
END
P=1
0.7*0.5*0.7*0.4*0.7*0.4*0.9=0.0247
(b) Repeat a for a path that goes first to the insert state, then to a match
state, then to a delete state, then to a match state and End.
A 0.25
C 0.25
G 0.25
T 0.25
A 0.25
C 0.25
G 0.25
T 0.25
A 0.1
C 0.2
G 0.2
T 0.5
BEGIN
A 0.25
C 0.25
G 0.25
T 0.25
1.0
0.7
0.2
A 0.1
C 0.4
G 0.4
T 0.1
A 0.4
C 0.3
G 0.1
T 0.2
P=1
P=1
0.1
A 0.25
C 0.25
G 0.25
T 0.25
0.9
END
P=1
0.1*0.25*1*0.1*0.2*1*1*0.4*0.9=0.00018
(c) Which of the two paths is the more probable
one, and what is the ratio of the probability of the
higher to the lower one?
• p(path1)/p(path2) = 0.0247 / 0.00018 =
137.2
• The highest-scoring path is the best
alignment of the sequence with the model.
The Viterbi algorithm is similar to dynamic
programming and finds the highest-scoring
path.
Improving the model
• Adjust the scores for the states and transition
probabilities by aligning additional sequences
with the model using an HMM adaptation of the
expectation maximization algorithm.
• In the expectation step, calculate all of the
possible paths through the model, sum the
scores, and then calculate the probability of
each path. Each state and transition probability
is then updated by the maximization step of the
algorithm to make the model better predict the
new sequence.
Information theory primer
• Information, Uncertainty
• Suppose we have a device that can produce 3
symbols, A, B, or C  “uncertainty of 3 symbols”
– Uncertainty = log2(M) with M being the number of
symbols
• Logarithm base 2 gives units in bits.
– Example: In reading mRNA, if the ribosome
encounters any one of 4 equally likely bases, then the
uncertainty is 2 bits.
Surprisal u
• Let symbols have probabilities Pi
• Surprisal Ui = -log2(Pi)
• For an infinite string of symbols, the
average surprisal
• H=S Piui = - S Pi log2 Pi (bits per symbol)
• Sum over all symbols I
• H is Shannon’s entropy
H function in the case of two symbols
H
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
If all symbols are equally likely?
• Hequiprobable = - S 1/M log2 (1/M) = log2 M
Coding
• Shorter codes using few bits for common
symbols and many bits for rare symbols.
• M=4: A C G T with probabilities
• P(A)=1/2, P(C)=1/4, P(G)=1/8, P(T)=1/8
• Surprisals U(A)=1 bit, U(C)=2 bits,
U(G)=U(T)=3 bits
• Uncertainty is
H=1/2*1+1/4*2+1/8*3+1/8*3=1.75 (bits per
symbol)
Recode symbols so that the number of
binary digits equals the surprisal
•
•
•
•
A=1
C = 01
G = 000
T = 001
• The string ACATGAAC is coded as
1.01.1.001.000.1.1.01
• 14 / 8 = 1.75 bits per symbols
Noisy communication channels
• Sender ----[noise]---- receiver
– Two equally likely symbols sent at a rate of 1
bit per second
– if x=0 is sent, probability of receiving y=0 is
0.99 and probability of receiving y=1 is 0.01,
and vice versa
– Then the uncertainty after receiving a symbol
is Hy(x)=-0.99log20.99-0.01log20.01=0.081
– So the actual rate of transmission is
R=1-0.081=0.919 bits per second
Exercise: information content
Base
Column 1
frequency
Column 2
frequency
Column 3
frequency
Column 4
frequency
A
0.6
0.1
0.2
0.1
C
0.1
0.7
0.1
0.1
G
0.1
0.1
0.6
0.1
T
0.2
0.1
0.1
0.7
Information content of a scoring matrix by the relative
entropy method (ignores background frequencies)
• (a) calculate the entropy or uncertainty (Hc)
for each column and for the entire matrix.
Hc=-S pic log2(pic), where pic is the
frequency of amino acid type i in column c.
•
•
•
•
Column 1: Hc=-(0.6 log2(0.6) + 2*0.1 log2(0.1)+0.2 log2(0.2))=1.571
Column 2: Hc=-(0.7 log2(0.7) + 3*0.1 log2(0.1)=1.358
Column 3: Hc=-(0.6 log2(0.6) + 2*0.1 log2(0.1)+0.2 log2(0.2))=1.571
Column 4: Hc=-(0.7 log2(0.7) + 3*0.1 log2(0.1)=1.358
• Log2(x)=y y=log(x)/log(2)
• (b) calculate the decrease in uncertainty or
amount of information (Rc) for column 1
due to these data (for DNA, Rc=2-Hc and
for proteins, Rc=4.32-Hc).
• If it’s DNA: Rc=2-Hc =2-1.57=0.43
• If it’s protein: Rc=2-Hc =4.32-1.57=2.75
• (c) calculate the amount that the
uncertainty is reduced (or the amount of
information contributed) for each base in
column 1.
• f(A)=0.6, HA,1=-0.6log2(0.6)=0.442
• f(G)=f(C)=0.1, HG,1= HC,1 =-0.1log2(0.1)=0.332
• f(T)=0.2, HT,1=-0.2log2(0.2)=0.464
Database searching
• The first and most common operation in protein
informatics...and the only way to access the information
in large databases
• Primary tool for inference of homologous structure and
function
• Improved algorithms to handle large databases quickly
• Provides an estimate of statistical significance
• Generates alignments
• Definitions of similarity can be tuned using different
scoring matrices and algorithm-specific parameters
Types of alignment
• Sequence-sequence
– Target distribution = generic substitution matrix
• Sequence-profile
– Position-specific target distributions
• Profile-profile
– Observed frequencies from multiple alignment
– Position-specific target distributions
– Average both ways
• Pair HMM
– Probability that two HMMs generate same sequence
PSI-Blast
• Position-specific-iterated Blast
Steps in a PSI-Blast search
• Constructs a multiple alignment from a Gapped
Blast search and generates a profile from any
significant local alignments found
• The profile is compared to the protein database
and PSI-BLAST estimates the statistical
significance of the local alignments found, using
"significant" hits to extend the profile for the next
round
• PSI-BLAST iterates step 2 an arbitrary number of
times or until convergence
PHI-Blast
• Pattern-hit-initiated Blast