Scoring functions: PAM and BLOSUM

Download Report

Transcript Scoring functions: PAM and BLOSUM

Computational Genomics
Lecture #3a (revised 24/3/09)
1.
2.
Intro to ML and Scoring functions
PAM and BLOSUM AA scoring matrices
Background Readings: Section 2.7, 2.8 in Biological Sequence
Analysis, Durbin et al., 2001.
Sections 15.7-15.11 in Algorithms on Strings, Trees, and Sequences,
Gusfield, 1997.
This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir.
Changes
made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.
.
Scoring Functions

So far, we discussed dynamic programming
algorithms for
 global alignment
 local alignment
 other, related versions
 All
of these assumed a scoring function:
σ : (  {})  (  {})  
that determines the value of substitutions,
insertions, and deletions.
2
Where does the scoring function come from ?
An additive scoring function is defined by specifying a function
( ,  ) such that
 (x,y) is the score of replacing x by y (including x=y)
 (x,-) is the score of deleting x
 (-,x) is the score of inserting x
But how do we come up with the “correct” score ?
Idea: By encoding experience of what are similar
sequences for the task at hand. Similarity depends
on time, evolution trends, and sequence types.
3
Why probability setting is appropriate to
define and interpret a scoring function ?
• Similarity is probabilistic in nature because biological
changes like mutation, recombination, and selection, are
random events.
• We could answer questions such as:
• How probable it is for two sequences to be similar?
• Is the similarity found significant or spurious?
• How to change a similarity score when, say,
mutation rate of a specific area on the chromosome
becomes known ?
4
A Probabilistic Model
 For
starters, will focus on alignment without indels
(implying aligned sequences are equal length).
 For now, we assume each position (nucleotide /aminoacid) is independent of other positions.
This is not a very realistic assumption, BUT it makes
our life a lot easier, and is a good start.
 We consider two options:
M: the sequences are Matched (related)
R: the sequences are Random (unrelated)
5
Unrelated Sequences (R)


Our random model R of unrelated sequences is
simple
 Each position is sampled independently from a
fixed distribution q() over the alphabet 
 We assume there is a distribution q() that
describes the probability of single letters.
Then:
P(s[1..n], t[1..n] | R )   q(s[i])q( t[i])
i
6
Related Sequences (M)
 We
assume that each pair of aligned positions
(s[i],t[i]) evolved from a common ancestor
 Let p(a,b) be a distribution over pairs of letters.
 p(a,b) is the probability that some ancestral letter
evolved into this particular pair of letters.
P(s[1..n], t[1..n] | M)   p(s[i], t[i])
i
Compare to:
P(s[1..n], t[1..n] | R )   q(s[i])q(t[i])
i
7
Odd-Ratio Test for Alignment
p(s[i], t[i])

P(s, t | M)
p(s[i], t[i])
Q


P(s, t | R )  q(s[i])q(t[i])
q(s[i])q(t[i])
i
i
i
If Q > 1, then the two strings s and t are more likely to
be related (M) than unrelated (R).
If Q < 1, then the two strings s and t are more likely to
be unrelated (R) than related (M).
8
Log Odd-Ratio Test for Alignment
Taking logarithm of Q yields
Score(s[i],t[i])
P ( s, t | M )
p ( s[i ], t[i ])
p ( s[i ], t[i ])
log
 log 
  log
P ( s, t | R )
q ( s[i ]) q (t[i ])
i
i q ( s[i ]) q (t[i ])
If log Q > 0, then s and t are more likely to be related.
If log Q < 0, then they are more likely to be unrelated.
(usually we want some constant positive threshold to
“declare” relatedness).
How can we relate this quantity to a scoring function ?
9
Probabilistic Interpretation of Scores
 We
define the scoring function via
p ( a, b)
 (a, b)  log
q(a)q(b)
 Then,
the score of an alignment is the log-ratio
between the two models:
 Score > 0
 Score < 0


Model is more likely
Random is more likely
10
Probabilistic Interpretation of Scores

We’ve defined the scoring function via
p ( a, b)
score (a, b)  log
q(a)q(b)



For example, suppose q(a)=0.1, q(b)=0.1,
f(a,b)=0.001, f(a,a)=0.08, f(b,b)=0.07.
Then score (a,b)=log 0.1 < 0;
score (a,a)=log 8 > 0;
score (b,b)=log 7 > 0;
Logarithms are convenient in order to have an additive
scoring function under assumption of independence
(log of product = sum of logs).
11
Modeling Assumptions
 It
is important to note that this interpretation
depends on our modeling assumption!!
 For
example, if we assume that the letter in each
position depends on the letter in the preceding
position, then the likelihood ratio will have a
different form.
 If
we assume, for proteins, some joint distribution of
letters that are nearby in 3D space after protein
folding, then likelihood ratio will again be different.
12
Estimating Probabilities
we are given a long string s[1..n] of
letters from 
 We want to estimate the single letters distribution,
q(·), that best corresponds to the sequence
 How should we go about this?
 Suppose
The theory of parameter estimation in statistics
Deals with this problem in detail.
But in our case, there is no need for heavy tools.
13
Estimating q()



Suppose we are given a long string s[1..n] of
letters from 
s is the concatenation of many sequences in our database
We want to estimate the distribution q()
Likelihood function: L(s | q ) 
n
 q(s[i])   q(a )
i 1

Na
a
That is, q is defined per single letters
14
Estimating q() (cont.)
Likelihood function: L(s | q) 
How do we define q?
ML parameters
(Maximum Likelihood)
n
 q(s[i])   q(a)
i 1
Na
a
Na
q(a) 
n
Namely q(a) is simply the observed
frequency of a in the sequence
15
Estimating p(·,·)
Intuition:
 Find pair of aligned sequences s[1..n], t[1..n],
 Estimate probability of pairs:
p (a , b ) 
Na ,b
n
Number of times a is
aligned with b in (s,t)
 The
sequences s and t are be the concatenation of
many aligned pairs of sequences from the database
16
Problems in Estimating p(·,·)
 How
do we find pairs of aligned sequences?
 How far is the ancestor of the aligned sequences?
 earlier divergence  low sequence similarity
 recent divergence  high sequence similarity
 Does one letter mutate to the other or are they both
mutations of a common ancestor having yet
another residue/nucleotide acid ?
17
Estimating p(·,·) for amino acids
Definition: An accepted point mutation is an
observed substitution in a gapless alignment of
closely related (homologous) protein sequences. For
example, Hemoglobin alpha chain in humans and
other mammalians.
Note that we cannot claim such mutation actually
occurred (if we observe an A to B substitution, the
“evolutionary truth” could be A to C to B (one
substitution, two hidden mutations).
18
PAM-1 matrices (M. Dayhoff, 1970)
We take pairs of sequences that are no more than
“1 unit of evolutionary distance apart”. This is
interpreted as having substitutions in 1 percent
(1/100) of the aligned sites (all other are matches,
as we assumed no gaps).
In this ensemble of alignments, we count the
frequencies of single letters, q() , and the
frequencies of substitutions, p(·,·) .
The (a,b) entry in the PAM1 matrix is then
p ( a, b)
PAM1 (a, b)  log
q(a)q(b)
19
PAM-1 matrices
Values in matrix usually multiplied by 10 & rounded.
Example: A portion of PAM1 matrix
1.
2.
3.
4.
5.
6.
7.
8.
A R N D C Q E
A 7 -15 -11 -11 -14 -12 -9
R -15 9 -14 -27 -16 -9 -26
N -11 -14 9 -5 -28 -11 -10
D -11 -27 -5 9 -32 -10 -4
C -14 -16 -28 -32 10 -31 -31
Q -12 -9 -11 -10 -31 9 -6
E -9 -26 -10 -4 -31 -6 9
Most off diagonal entries are negative, but not all.
For example score(B,N)=score(B,D)=7 (out of our
small portion).
20
PAM-n matrices
Generalize PAM1: Take pairs of sequences that are
no more than “n unit of evolutionary distance apart”.
Interpret this as having substitutions in n percent
(n/100) of the aligned sites (all other are matches,
as we assumed no gaps).
Nice idea, but does not quite work.
Sequences with more than 3 or 4 percent
substitutions cannot be aligned without gaps.
In addition, we may want evolutionary units that are
larger than 100 (e.g. PAM250 exists and useful).
21
PAM-n matrices
Let M ( A, B)  Pr( B | A) be the conditional probability
that in 1 evolutionary time unit, an A mutates to a B.
In our terminology,
M ( A, B)  Pr( B | A)
 Pr( B  A) / Pr( A)  q ( A, B) / p ( A)
What is then the probability that in n evolutionary
time units, an A mutates to a B ?
We should start with an A (an event whose
probability is q(A) ),
and then go through all possible paths that lead from
A to B in n steps.
22
PAM-n matrices
We should start with an A (an event whose
probability is q(A) ),
and then go through all possible paths that lead from
A to B in n steps. The probability of such process
is given exactly by the A,B entry of the n-th power
of the matrix M.
Therefore, the probability of starting with A and
n
moving to B in n steps equals q(A) M (A,B).
23
PAM-n matrices
Therefore, the probability of starting with A and
n
moving to B in n steps equals q(A) M (A,B).
Recall the random (unrelated sequences) model, where
the probability of having S[i]=A and T[i]=B equals
q(A) q(B).
So PAMn (A,B), which is the log odds score, equals
n
n
log( q(A) M (A,B) / q(A)q(b) ) = log(M (A,B) / q(b) )
Notice that for n=1 this is the same as PAM1.
24
PAM-250 matrices
Values in matrix usually multiplied by 10 & rounded.
Example: A portion of PAM250 matrix
1.
2.
3.
4.
5.
6.
7.
8.
A
R
N
D
C
Q
E
A R
2 -2
-2 6
0 0
0 -1
-2 -4
0 1
0 -1
N D
0 0
0 -1
2 2
2 4
-4 -5
1 2
1 3
C
-2
-4
-4
-5
12
-5
-5
Q E
0 0
1 -1
1 1
2 3
-5 -5
4 2
2 4
Many more off diagonal entries are positive
(compare to PAM1).
25
The BLOSUM substitution matrices
BLOSUM stands for BLOcks Substitution
Matrices. The basic principles are similar to PAM,
but the sequences used to compute substitution
frequencies differ.
BLOSUM uses a database containing multiple
sequence alignments from related (by function)
but possibly remote (evolutionary) proteins.
Conserved blocks (local alignments) are detected,
and substitution frequencies are computed based on
them.
BLOSUMn reflects more conserved sequences
for higher values of n.
26
BLOSUM vs. PAM
BLOSUM 62 is the default matrix in BLAST 2.0.
27