Fasta, Blast, Probabilities
Download
Report
Transcript Fasta, Blast, Probabilities
Fasta, Blast, Probabilities
.
Reminder
Last
classes we discussed dynamic programming
algorithms for
global alignment
local alignment
Multiple alignment
All
of these assumed a scoring rule:
: ( {}) ( {})
that determines the quality of perfect matches,
substitutions, insertions, and deletions.
2
Alignment in Real Life
One
of the major uses of alignments is to find
sequences in a “database.”
The current protein database contains about 100
millions (i.e.,108) residues ! So searching a 1000
long target sequence requires to evaluate about
1011 matrix cells which will take about three hours in
the rate of 10 millions evaluations per second.
Quite annoying when, say, one thousand target
sequences need to be searched because it will take
about four months to run.
3
Heuristic Fast Search
Instead,
most searches rely on heuristic procedures
These are not guaranteed to find the best match
Sometimes, they will completely miss a high-scoring
match
We now describe the main ideas used by the best
known of these heuristic procedures.
4
Basic Intuition
Almost
all heuristic search procedures are based
on the observation that real-life matches often
contain long strings with gap-less matches.
These
heuristic try to find significant gap-less
matches and then extend them.
5
Banded DP
that we have two strings s[1..n] and
t[1..m] such that nm
If the optimal alignment of s and t has few gaps,
then path of the alignment will be close to diagonal
Suppose
s
t
6
Banded DP
To
find such a path, it suffices to search in a
diagonal region of the matrix.
If the diagonal band has width k, then the dynamic
programming step takes O(kn).
Much faster than O(n2) of standard DP.
s
V[i,i+k/2]
V[i, i+k/2+1]
Out of range
V[i+1, i+k/2 +1]
Note that for diagonals i-j = constant.
t
k
7
Banded DP for local alignment
Problem: Where is the banded diagonal ? It need
not be the main diagonal when looking for a good
local alignment.
s
How do we select which
subsequences to align
using banded DP?
k
t
We heuristically find potential
diagonals and evaluate them
using Banded DP.
This is the main idea of FASTA.
8
Finding Potential Diagonals
Suppose that we have a relatively long gap-less match
AGCGCCATGGATTGAGCGA
TGCGACATTGATCGACCTA
Can we find “clues” that will let us find it quickly?
Each such sequence defines a potential diagonal (which is
then evaluated using Banded DP.
9
Signature of a Match
Assumption: good matches contain several “patches” of
perfect matches
AGCGCCATGGATTGAGCTA
TGCGACATTGATCGACCTA
s
Since this is a gap-less alignment,
all perfect match regions
should be on one diagonal
t
10
FASTA-finding ungapped matches
Input: strings s and t, and a parameter ktup
Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup]
Locate sets of pairs that are on the same diagonal
By sorting according to the difference i-j
Compute the score for the diagonal that contains all these
pairs
s
t
11
FASTA-finding ungapped matches
Input: strings s and t, and a parameter ktup
Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup]
Step one: prepare an index of the database and the
query sequence such that given a sequence of
length ktup, one gets the list of positions. (Linear
time).
Step two: for each ktup from the query add 1 in the
diagonal (i-j) in which it appears. Then find
contiguous (possibly with mismatch) ktup in s
diagonals.
t
12
FASTA- using banded DP
Step 3:
Select the ten high scoring contiguous
segments
Try and score all combinations of
these ten segments in order to
constitute a pass into the matrix
Step 4:
Run banded DP on the region
containing the best scoring pass (say
with width 12).
s
2
Hence, the algorithm may combine
some diagonals into gapped
t
matches (in the example below
combine diagonals 2 and 3).
3
1
13
FASTA- practical choices
Most applications of FASTA use very small ktup (1-2 for
proteins, and 4-6 for DNA).
Higher values are faster, yielding less diagonal to search
around, but increase the chance to miss the optimal local
alignment.
s
t
Some implementation choices /tricks
have not been explicated herein.
14
FASTA-summary
Input: strings s and t, and a parameter ktup = 1,2,4,5, or 6
depending on the application.
Output: A highly scored local alignment
1.
2.
3.
Find pairs of matching substrings s[i..i+ktup]=t[j..j+ktup]
Extend to ungapped diagonals
Extend to gapped matches using banded DP
15
BLAST Overview
(Basic Local Alignment Search Tool)
Input: strings s and t, and a parameter T = threshold value
Output: A highly scored local alignment
Definition: Two strings s and t of length k are a high scoring
pair (HSP) if d(s,t) > T (usually consider un-gapped
alignments only).
1.
2.
3.
Find high scoring pairs of substrings such that d(s,t) > T
These words serve as seeds for finding longer matches
Extend to ungapped diagonals (as in FASTA)
Extend to gapped matches
16
BLAST Overview (cont.)
Step 1: Find high scoring pairs of substrings such that
d(s,t) > T (The seeds):
Find all strings of length k which score at
least T with substrings of s in a gapless
alignment (k = 4 for proteins, 11 for DNA)
(note: possibly, not all k-words must be tested, e.g. when
such a word scores less than T with itself).
Find in t all exact matches with each of the
above strings.
17
Extending Potential Matches
Once a seed is found, BLAST attempts to find
a local alignment that extends the seed.
Seeds on the same diagonal are combined
(as in FASTA), then extended as far as
possible in a greedy manner without gap.
During the extension phase, the
search stops when the score passes
below some lower bound computed by
BLAST (to save time).
For the best ungap alignment do a
banded SW an assign a probabilistic
score.
s
t
18
19
20
21
22
23
24
25
Why use probability to define and/or
interpret a scoring function ?
• Similarity is probabilistic in nature because
biological changes like mutation, recombination, and
selection, are not deterministic.
• We could answer questions such as:
• How probable two sequences are similar?
• Is the similarity found significant or random?
• How to change a similarity score when, say,
mutation rate of a specific area on the
chromosome becomes known ?
26
A Probabilistic Model
For
now, we will focus on alignment without indels.
For now, we assume each position (nucleotide
/amino-acid) is independent of other positions.
We consider two options:
M: the sequences are Matched (related)
R: the sequences are Random (unrelated)
27
Unrelated Sequences
Our
random model of unrelated sequences is simple
Each position is sampled independently from a
distribution over the alphabet
We assume there is a distribution q() that
describes the probability of letters in such
positions.
Then:
P( s[1..n], t[1..n] | R) q( s[i]) q(t[i])
i
28
Related Sequences
We
assume that each pair of aligned positions
(s[i],t[i]) evolved from a common ancestor
Let p(a,b) be a distribution over pairs of letters.
p(a,b) is the probability that some ancestral letter
evolved into this particular pair of letters.
P( s[1..n], t[1..n] | M ) p( s[i], t[i])
i
29
Odd-Ratio Test for Alignment
p( s[i ], t[i ])
P ( s, t | M )
p ( s[i ], t[i ])
Q
P( s, t | R) q ( s[i ]) q (t[i ])
q( s[i ]) q (t[i ])
i
i
i
If Q > 1, then the two strings s and t are more likely to
be related (M) than unrelated (R).
If Q < 1, then the two strings s and t are more likely to
be unrelated (R) than related (M).
30
Log Odd-Ratio Test for Alignment
Taking logarithm of Q yields
Score(s[i],t[i])
P ( s, t | M )
p ( s[i ], t[i ])
p ( s[i ], t[i ])
log
log
log
P ( s, t | R )
q ( s[i ]) q (t[i ])
i
i q ( s[i ]) q (t[i ])
If log Q > 0, then s and t are more likely to be related.
If log Q < 0, then they are more likely to be unrelated.
How can we relate this quantity to a score function ?
31
Probabilistic Interpretation of Scores
We
define the scoring function via
p (a , b )
(a , b ) log
q (a )q (b )
Then,
the score of an alignment is the log-ratio
between the two models:
Score > 0
Score < 0
Model is more likely
Random is more likely
32
Estimating Probabilities
we are given a long string s[1..n] of
letters from
We want to estimate the distribution q(·) that
generated the sequence
How should we go about this?
Suppose
We build on the theory of parameter estimation in
statistics using either maximum likelihood
estimation or the Bayesian approach .
33
Estimating q()
Suppose
from
we are given a long string s[1..n] of letters
s can be the concatenation of all sequences in our
database
We
want to estimate the distribution q()
n
Na
L
(
q
|
s
)
q
(
s
[
i
])
q
(
a
)
Likelihood function:
i 1
That
a
is, q is defined per letter
34
Estimating q() (cont.)
Likelihood function: L(q | s )
How do we define q?
n
q(s[i]) q(a)
i 1
a
ML parameters
MAP parameters
(Maximum Likelihood)
(Maximum A posteriori Probability)
Na
q(a)
n
Na
Na 1
q(a)
n | |
35
Estimating p(·,·)
Intuition:
Find pair of aligned sequences s[1..n], t[1..n],
Estimate probability of pairs:
p (a , b )
Na ,b
n
Number of times a is
aligned with b in (s,t)
Again,
s and t can be the concatenation of many
aligned pairs from the database
36
Problems in Estimating p(·,·)
How
do we find pairs of aligned sequences?
How far is the ancestor ?
earlier divergence low sequence similarity
later divergence high sequence similarity
Does one letter mutate to the other or are they both
mutations of a common ancestor having yet
another residue/nucleotide acid ?
37
47
BLOSUM 62
50