Alignment and Algorithms

Download Report

Transcript Alignment and Algorithms

Alignment and Algorithms
•Scoring Matrices: models of evolution
•Dynamic Program: speed
•Pairwise Sequence Alignment, variants
•Significance
•Pattern recognition, Hidden Markov
Models (HMM)
•Multiple Sequence Alignment (MSA)
•Protein families
•Structure features
Scoring Models and Evolution
• Several systems of scoring matrices.
• Statistically based, curated data: I.e.,
based on confident alignments of
multiple sequences.
Two best known:
PAM matrices: Dayhoff et al., 1967
BLOSUM matrices: Henikoff & Henikoff,
1992
Scoring matrices, cont.
Recall: using p(a,b) = prob. that a (amino
acid residue) mutated into b.
PAM (Point Accepted Mutation) uses
sequence data for proteins which can
be aligned with 99% identical sequence.
In such aligned sequences, one counts
the frequency of each residue, and of
each pair of residues in a “column”.
Scoring matrices, cont.
Basically, the score s(a,b) from this data
is a “LOD score”:
s(a,b) = log(p(a,b)/p(a)p(b)).
Comparison of two “models”: significant
association vs. independent
randomness.
Pruned: scaled to make matrices integervalued.
Scoring matrices, cont.
Interpretation: Dayhoff, et al. define a
short unit of evolutionary time, a PAM
unit. Only very conservative changes
will be seen in the composition of
evolving proteins in this scale.
To extend to more distant similarities,
used time independent Markov
assumption.
Scoring matrices, cont.
p(a,b) = p(a-->b|t=1) gives the transition
probability from a --> b (or b --> a: we
cannot keep track of directionality) in
one unit of PAM time.
The time independent Markov model
assumes p(a-->b|t=2) in two units is
simply the sum over all pairs of
intermediate one time unit transitions:
a --> c followed by c --> b
Scoring matrices, cont.
If P(1) is the original 20x20 matrix of
amino acid transition probabilities, then
the above says simply
P(n) = P(1) x P(1)x…x P(1) (n times)
(matrix multiplication)
Scoring matrices, cont.
So, PAM(n) matrix of LOD scores from
transitions in PAM time n.
Available: n=1 to n=500: smaller n for
more closely related sequences.
Most commonly used: 50 to 250.
Critique: accurate for small time, goes off
at larger times. Compare nucleic acid
changes to amino acid changes.
Scoring matrices, cont.
BLOSUM Matrices
From the BLOCKS database.
BLOSUM = BLOCKS Substitution Matrix
DB of aligned sequence. Conserved regions are
marked. The BLOSUM(L)
transition probabilities are figured from
sequences with L% identity of sequence.
Score is a LOD score again. So, large
BLOSUM L <--> small PAM n.
Warning: corrections for oversampling of closely
related sequence!
Scoring matrices, cont.
If the BLOSUM62 matrix is compared to
PAM160 (it's closest equivalent) then it is
found that the BLOSUM matrix is less tolerant
of substitutions to or from hydrophilic amino
acids, while more tolerant of hydrophobic
changes and of cysteine and tryptophan
mismatches.
PAM is less sensitive to the effect of amino acid
“return” substitutions, over long time periods.
Dynamic Programming
Problem: find (1) best score and (2) best
alignment
Naïve: evaluate each alignment and
compare
Too many alignments!
Need more speed: DP
Dynamic Programming, cont.
No. of alignments is exponentially large.
Use: problem has a linear (left to right)
structure to it, from the linearity of
proteins, nucleic acids.
Analogous to the problem of distance;
originally arose in CS, “edit distance”.
Dynamic Programming, cont.
The basic algorithms here are:
(1) Needleman-Wunsch (global alignment)
(2) Smith-Waterman (local alignment)
[BLAST is a heuristic version of SW.]
(3) Many variants.
Dynamic Programming, cont.
NW algorithm can be understood pictorially.
It has three basic steps:
•
Initialization
•
Iteration
•
Traceback
Basic insight: you only have to optimize the score one
step at a time, and can discard looking at most
alignments.
Dynamic Programming, cont.
Missing scoring feature: gap penalties.
Not as well understood as substitutions. Use “geometric
random variable models”.
s(a,-) is independent of a, say s(a,-) = -d = “gap penalty”
For longer gaps, two methods:
(a) linear: k gaps in a row gives penalty k x (-d).
(b) affine: k gaps in a row gives penalty
e + (k-1) x (-d),
Where e is the “gap initiation penalty”.
Dynamic Programming, cont.
Notice: affine gap does not just depend on the position, since you
have to know whether you are continuing a gap from the left.
First example of dependency in scoring.
Key to the iterative step is to realize there are only three ways to
extend an alignment:
R1
R1
r1
r1
We have to update the score by taking the max of three
possibilities: max(s0 + s(R1,r1), s0 + s(R1,-), s0+ s(-,r1)).
Then to get the scores of all possible subalignments takes just
about 5 MN computations, instead of an exponential number.
Dynamic Programming, cont.
To derive the actual alignment requires storage: one stores a
traceback arrow from each position in the comparison array.
This information can be translated into the correct alignment.
Variants:
Smith-Waterman: local alignment. Adjustment is to 0-out and start
over again any time a NW score goes below zero, and
procede. One chooses the largest scoreof a segment, and
traces back as earlier.
Linear storage: saves storage.
Repeats: allows for duplication of segments.
Dynamic Programming, cont.
Variants (cont.):
Non-overlapping: Repeats SW, avoiding previous best alignments.
Sampling of highest hits: the best alignment score maybe an
artifact of our scoring system, so many alignments may be
high scoring.
Affine gap penalties: actually have to have more states carried
along, a precursor to Hidden Markov Models.
Significance
E-value in BLAST discussed last time.
Equivalent to statistical concept of a pvalue.
S =score for best alignment is a random
variable.
The p-value for an alignment is the
probability that such a high score could
have been found randomly.
Significance, cont.
This can be estimated two ways:
Ungapped local alignment: Karlin-Altschul
statistics.
Generally, KA not available, rely instead on
simulations: take two sequences A and B,
and randomly “shuffle” one and recompare.
This eliminates composition bias in the teest
samples, but randomizes the rest of the
biological information in the sequence,
namely the order of the residues (or nt’s).
SW computes this for 100’s of shuffles.
HMM’s
These are a special case of a general CS
problem: pattern recognition, or machine
learning. Can one make the machine a
master at recognizing or classifying certain
things. There are a large family of such
graphical models, this is one of the simplest.
Basic problem in MB: can one recognize
different functional or structural regions in a
protein from it’s a.a. sequence? It’s fold? Or
less specifically, what are the common
features of a group of proteins.
HMM’s, cont.
For example, there exist very many G-protein coupled
receptors, which fall into seven classes. Can one
classify a new protein into (a) being a GPCR, and (b)
assigning its class to it from sequence data alone?
Simplest example problem: dishonest casino. You are
given a sequence of observed random variables or
states of the system (faces of a die), but you don’t
see hidden states (e.g., whether a casino is using a
fair or biased die). Notice that this kind of example
has a linear structure to it, like our a.a. sequences.
HMM’s, cont.
The problem is: given an observed path of (observable states)
determine the underlying path through the hidden states. I.e.,
given the record of the coin flips, say when the biased coin was
used and when the fair. Notice that in this model we know how
many visible states there are and we are assuming we know
how many hidden states there are.
Of course, we cannot know with cerrtainty when the fair/biased
coin is used, we must seek a probable answer, and the
algorithms show how to choose a most probable path through
the model.
The harder problem (biological problem) is when we do not know
what the hidden states are, except where we can define them,
e.g., structurally. For example, you could run through protein
a.a. sequence and ask a HMM to decide which residues are
situated in alpha helices of the protein, or when they llie in the
cell membrane.
HMM’s, cont.
In the first problem, if we know all the parameters, meaning mainly
the transition problems for the Hidden Markov states, then it is
easy to find a dynamic programming algorithm for finding the
best path. This is the Viterbi algorithm.
[Sometimes called parsing the hidden markov model, because the
method originated in speech recognition work.]
The harder problem is to figure out what the parameters are for the
hidden states. This is called training the model, and is done by
the Baum-Welch algorithm. This is a version of the so-called EM
method, or algorithm (Expectation Maximization algorithm).
BW, or EM, is a heuristic optimization algorithm. That is, it starts
from a given initial guess for the parameters involved and
performs a calculation which is guaranteed to improve (make no
worse, actually) the previous guess. One runs for a certain time
and turns things off by a suitable threshold.
HMM’s, cont.
Finally, a harder problem still is to “learn” the correct
architecture of the model. There are packages for this
in the case of protein families/ multiple sequence
alignments.
In the case of structure, for example, one sees a
classical problem in statistics: one can let the data
tell directly what the patterns should be
(“unsupervised learning”) or you could use expertise
to guide the data. This allows fainter structural
signals to be “heard”. Problem: is this fittingthe data
too tightly, making the model generated harder to be
generally applicable? Example of TMHMM and
unusual, but necessary faint signals.