Transcript class04

Defining Scoring Functions,
Multiple Sequence Alignment
Lecture #4
Background Readings: Chapter 6 in Biological Sequence Analysis,
Durbin et al., 2001.
Chapter 3.4, 3.5 in Introduction to Computational Molecular
Biology, Setubal and Meidanis, 1997.
This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger and
Shlomo
Moran.
.
Reminder: A Probabilistic Model for
Defining the Scoring Function
Assume alignment without indels.
 Assume each position (nucleotide /amino-acid) is
independent of other positions.
 Consider two options:
M: the sequences are Matched (related)
R: the sequences are Random (unrelated)

2
R: Unrelated Sequences
 Our
random model of unrelated sequences is simple
 Each position is sampled independently from a
distribution over the alphabet 
 We assume there is a distribution q() that
describes the probability of letters in such
positions.
 Then:
P( s[1..n], t[1..n] | R)   q( s[i]) q(t[i])
i
3
M: Related Sequences
 We
assume that each pair of aligned positions
(s[i],t[i]) evolved from a common ancestor
 Let p(a,b) be a distribution over pairs of letters.
 p(a,b) is the probability that some ancestral letter
evolved into this particular pair of letters.
P( s[1..n], t[1..n] | M )   p( s[i], t[i])
i
4
Odds-Ratio Test for Alignment
p( s[i ], t[i ])

P ( s, t | M )
p ( s[i ], t[i ])
Q


P( s, t | R)  q ( s[i ]) q (t[i ])
q( s[i ]) q (t[i ])
i
i
i
If Q > 1, then the two strings s and t are more likely to
be related (M) than unrelated (R).
If Q < 1, then the two strings s and t are more likely to
be unrelated (R) than related (M).
5
Log Odds-Ratio Test for Alignment
Taking logarithm of Q yields
Score(s[i],t[i])
Pr( s, t | M )
p ( s[i ], t[i ])
p ( s[i ], t[i ])
log
 log 
  log
Pr( s, t | R )
q ( s[i ]) q (t[i ])
i
i q ( s[i ]) q (t[i ])
If log Q > 0, then s and t are more likely to be related.
If log Q < 0, then they are more likely to be unrelated.
Next we estimate Pr((a,b)|M)=P(a,b) for amino acids
6
Estimating p(·,·) for proteins
(Setubel et al, p. 81-84)
We need to define for each pair (a,b) of amino
acids (possibly a=b) the probability p(a,b) that a
is aligned with b given that the sequences are
related.
p(a,b) should be determined by the data we have.
The next slides present one way to define p(a,b) .
7
Estimating p(·,·) for proteins
First, we define for each letter a a probability qa,
which is the relative frequency of the amino-acid
a in a collection of proteins which is considered
reliable for this purpose.
Specifically, qa = na/n where na is the number
of occurrences of letter a and n is the total
number of letters in the collection, so n = ana.
8
Estimating p(·,·) for proteins
For ab, we use as a data a collection of accepted
mutations: Formally, it is a multiset of the form
{(a,b): a,b are amino acids and ab}.
We consider the mutations as unordered pairs.
The collection of accepted mutations contains
mutations in alignments of closely related protein
sequences. For example, Hemoglobin alpha chain in
humans and other organisms (homologous proteins).
9
Estimating p(·,·) for proteins
Using the collection of accepted mutations, which we consider
as unordered pairs of amino acids. Since the pairs are unordered,
we denote them by {a,b}.Some parameters:
Mutation counts
fab=fba is the number of mutations {a, b}: ab in the above
collection.
f is the number of mutations (unordered pairs) in the collection.
fab /f is the conditional probability:
Pr({ x, y}  {a, b} |{ x, y} is a mutation)
= Pr({ x, y}  {a, b} | x  y ).
10
Estimating p(·,·) for proteins
p( a , b) in the M model for a  b is given by :
p( a , b)  Pr({ x, y}  { a, b} | M )
= Pr({ x, y}  { a, b} | x  y )  Pr( x  y | M )
f ab

 Pr( x  y | M ).
f
We still have to define the relative mutability, Pr(xy|M). This is
the probability that in the “match” model M, a random pair (x,y) is
a mutation. The relative mutability is a user defined parameter,
denoted r.
11
PAM-1 matrices
[1-Percent Accepted Mutation].
In PAM-1 matrix it is assumed that the relative mutability r is 1%.
That is: 1/100 of the amino acids are mutated. This is modeled by
letting n, the total number of pairs of amino acids, be 100f. Using
this, we now can compute faa , the number of {a,a} pairs.
[HW: prove that, by our assumptions, 99 f of these pairs
are of the form { x, x}, and hence f aa  nqa


{ b:a  b }
2
f ab
].
In this model, p(a,b) is given as the relative frequency of
the (unordered) pairs {a,b}:
f ab
f ab
p( a, b)  Pr({ x, y}  {a, b} | M ) 

.
n 100 f
12
The log odds-ratio for PAM-1:
f ab
P(( a , b ) | M ) f ab / 100 f
Q( a, b) 


P(( a , b )) | R )
qa qb
100 fqa qb
 The
entries in PAM-1 matrix are the logarithms of
the odd-ratio:
f ab
p( a , b | M )
[ PAM  1]ab  log[
]  log[
]
p( a , b | R )
100 fqa qb
The actual values are 10 times the logarithm, rounded to integer.
13
Evolutionary distance
The choice that r=1% of amino acids change is quite arbitrary. It
could fit specific set of proteins whose evolutionary distance is such
that indeed 1% of the letters have mutated.
This is a unit of evolutionary change, not time because evolution acts
differently on distinct sequence types.
What is the substitution matrix for k units of evolutionary time ?
14
Model of Evolution
We make some assumptions:
1. Each position changes independently of the rest
2. The probability of mutations is the same in each
position
3. Evolution does not “remember”
T
A
T
A
T
C
C
C
t
t+
t+2
t+3
G
G
t+4
Time
15
Model of Evolution
 How
do we model such a process?
 This process is called a Markov Chain
A chain is defined by the transition probability
 P(Xt+ =b|Xt=a) - the probability that the next state
is b given that the current state is a
 We often describe these probabilities by a matrix:
M[]ab = P(Xt+ =b|Xt=a)
16
Multi-Step Changes
on Pab, we can compute the probabilities of
changes over two time periods
 Based
P( X t  2   b | X t  a) 
 c P ( X t  2   b | X t    c , X t  a ) P ( X t    c | X t  a )
Using Conditional independence (No memory)
P( X t 2   b | X t   c ) P( X t   c | X t  a )

 P P

c
c
ac cb
Thus P[2] = P[]P[]
 By
induction (HMW exercise): P[n] = P[]
n
17
More on Markov Chains – next week
18
Comments regarding PAM
 Historically
researchers use PAM-250. (The only
one published in the original paper.)
 Original PAM matrices were based on small
number of proteins (circa 1978). Later versions
use many more examples.
 Used to be the most popular scoring rule, but
there are some problems with PAM matrices.
19
Degrees of freedom in PAM definition
With r=1/100 the 1-PAM matrix is obtained from the matrix
M []ab 
f ab
100 f  qa qb
With r=1/50 the basic matrix is different, namely:
f ab
M '[2]ab 
50 f  qa qb
Thus we have two different ways to estimate the matrix M[4] :
Use the PAM-1 matrix to the fourth power: M[4] = M[] 4
Or
Use the K=50 matrix to the second power: M[4] = M’[2] 2
20
Problems in building distance
matrices
 How
do we find pairs of aligned sequences?
 How far is the ancestor ?
 earlier divergence  low sequence similarity
 later divergence  high sequence similarity
E.g., M[250] is known not to reflect well long period
changes (see p. 43 at Durbin et al).
 Does
one letter mutate to the other or are they both
mutations of a third letter ?
21
BLOSUM Outline
(Durbin et. al, p. 43-44)
PAM matrices does not capture the difference between short and long
time mutations. BLOck-SUbstitution-Matrices (1992) attempt to
model long time mutations.
 Idea: use aligned ungapped regions of protein families.These are
assumed to have a common ancestor. Similar ideas but better
statistics and modeling.
 Procedure:
 Cluster together sequences in a family whenever more than L%
identical residues are shared.
 Count number of substitutions across different clusters (in the
same family).
 Estimate frequencies using the counts.
 Practice: Blosum50 (ie L=50) and Blosum62 are wildly used. (See
page 43-44 in Durbin et al).
22
Multiple Sequence Alignment
S1=AGGTC
S2=GTTCG
Possible alignment
S3=TGAAC
AGG T - C - G - T T C G
TG - A A C -
Possible alignment
AGG T - C GTT - - C G
- TGA A A C
23
Multiple Sequence Alignment
Aligning more than two sequences.
Definition: Given strings S1, S2, …,Sk a multiple (global) alignment
map them to strings S’1, S’2, …,S’k that may contain gaps, where:
1. |S’1|= |S’2|=…= |S’k|
2. The removal of gaps from S’i leaves Si
24
Multiple alignments
We use a matrix to represent the alignment of k sequences,
K=(x1,...,xk). We assume no columns consists solely of gaps.
The common scoring functions give a score to each column,
and set: score(K)= ∑i score(column(i))
x1
M
Q
_
I
L
L
L
x2
M
L
R
-
L
L
-
x3
M
K
_
I
L
L
L
x4
M
P
P
V
L
I
L
The scoring function is symmetric - the order of arguments need
not matter: score(I,_,I,V) = score(_,I,I,V). Still, for k=10, a
scoring function has > 2k -1 > 1000 entries to specify.
25
SUM OF PAIRS
A common scoring function is SP – sum of scores of the
projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).
Note that we need to
specify the score(-,-)
because a column may
have several blanks (as
long as not all entries are
blanks).
M
Q
_
I
L
L
L
M
L
R
-
L
L
-
M
K
_
I
L
L
L
M
P
P
V
L
I
L
In order for this score to be written as ∑i score(column(i)),
we set score(-,-) = 0. Why ?
Because these entries appear in the sum of columns but not
in the sum of projected pairwise alignments (lines).
26
SUM OF PAIRS
Definition: The sum-of-pairs (SP) value for a multiple global
alignment A of k strings is the sum of the values of
k 
all  
2
projected
pairwise alignments induced by A where the pairwise alignment
function score(xi,xj) is additive.
M
Q
_
I
L
L
L
M
L
R
-
L
L
-
M
K
_
I
L
L
L
M
P
P
V
L
I
L
27
Example
Consider the following alignment:
ac-cdb-c-adbd
a-bcdad
Using the edit distance  x, x  0 and  x, y   1 for x  y ,
this alignment has a SP value of
3+
+44 + 5 = 12
28
Multiple Sequence Alignment
Given k strings of length n, there is a natural generalization of the
dynamic programming algorithm that finds an alignment that
maximizes
SP-score(K) = ∑i<j score(xi,xj).
Instead of a 2-dimensional table, we now have a k-dimensional
table to fill.
For each cell i =(i1,..,ik), compute an optimal multiple alignment
for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).
The adjacent cells are those that differ in their indices by one or zero.
Each cell depends on 2k-1 adjacent cells.
29
The idea via K=2
Recall the notation:
V [i , j ]  d (s [1..i ],t [1.. j ])
and the following recurrence for V:
V [i , j ]   (s [i  1 ],t [ j  1 ]) 


V [i  1 , j  1 ]  max V [i , j  1 ]   (s [i  1 ],) 
V [i  1 , j ]   ( ,t [ j  1 ]) 


V[i,j]
V[i+1,j]
V[i,j+1]
V[i+1,j+1]
Note that the new cell index
(i+1,j+1) differs from previous
indices by one of 2k-1 non-zero
binary vectors (1,1), (1,0), (0,1).
30
The idea for arbitrary k
Order the cells i =(i1,..,ik) by increasing order of the sum ∑ij.
Set s(0,..,0)=0, and for i > (0,...,0):
s(i)  max{s(i  b)  score(column(i, b))}
b
Where
• b ranges over all non-zero binary k-tuples.
•The cell i-b is the non-negative difference of i and b.
• column(i,b) is the k-tuple (c1,…,ck) where:
cj= xj(ij) if bi=1, and cj= ‘-’ otherwise.
(Reflecting that bj=1 iff the sequence xj was extended by a
letter in the corresponding case).
31
Complexity of the DP approach
Number of cells nk.
Number of adjacent cells O(2k).
Computation of SP score for each column(i,b) is O(k2)
Total run time is O(k22knk) which is utterly unacceptable !
Not much hope for a polynomial algorithm because the
problem has been shown to be NP complete.
Need heuristic to reduce time.
32
Time saving heuristics: Relevance tests
Heuristic: Avoid computing score(i) for irrelevant cells.
x1
M
Q
_
I
L
L
L
x2
M
L
R
-
L
L
-
x3
M
K
_
I
L
L
L
x4
M
P
P
V
L
I
L
Let L be a lower bound on the optimal SP score of a multiple
alignment of the k sequences (L can be obtained from an arbitrary
multiple alignment, computed in any way).
33
Time saving heuristics: Relevance tests
Heuristic: Avoid computing score(i) for irrelevant cells.
Relevance test for cell i=(..iu,..iv…): For each two indices
1  u < v  k, compute the optimal score of an alignment of
s=xu and t=xv, which goes via cell i. If the sum of these scores
over all pairs (u,v) is smaller than L, then i is irrelevant.
An alignment via s=xu and t=xv, which goes via cell i is an
alignment of s and t which has a prefix which aligns
(s[1],…s[iu]) with (t[1],..,t[iv]).
How do we find an optimal such alignment?
34
Recall the Linear Space algorithm
 V[i,j]
= d(s[1..i],t[1..j])
 B[i,j] = d(s[i+1..n],t[j+1..m])
 F[i,j] + B[i,j] = score of best alignment through (i,j)
t
These computations
done in linear space.
Build such a table a x u x v (iu , iv )
for every two sequences
s=xu and t=xv, 1  u, v  k.
This entry encodes the
optimum through (iu,iv).
s
35
Time saving heuristics:
Relevance test
Let L be a lower bound on SP-score( x1 ,.., x k ).
We say that cell i  (i1 ,.., ik ) is irrelevant if the
score of any alignment through i is less than L.
Definition: L(i )   a xu xv (iu , iv )
1u  v  k
Claim 1: If L> L(i ) then i  (i1 ,.., ik ) is irrelevant.
But can we go over all cells determine if they are relevant or not ?
No. Start with (0,…,0) and add to the list relevant entries until
reaching (n1,…,nk)
36
Multiple Sequence Alignment –
Approximation Algorithm
In tutorial time you saw an O(k2n2) multiple
alignment algorithm for the SP-score that errs by a
factor of at most 2(1-1/k) < 2.
37
Star Alignments
Rather then summing up all pairwise alignments,
select a fixed sequence x0 as a center, and set
Star-score(K) = ∑j>0score(x0,xj).
The algorithm to find optimal alignment: at each step,
add another sequence aligned with x0, keeping old
gaps and possibly adding new ones.
38
Tree Alignments
Assume that there is a tree T=(V,E) whose leaves are
the sequences.
• Associate a sequence in each internal node.
• Tree-score(K) = ∑(i,j)Escore(xi,xj).
Finding the optimal assignment of sequences to the
internal nodes is NP Hard.
We will meet again this problem in the study of
Phylogenetic trees
39