Scoring Functions, Reminder

Download Report

Transcript Scoring Functions, Reminder

Computational Genomics
Lecture #2b
1.
2.
Scoring functions and DNA and AAs
Multiple sequence alignment
Background Readings: Chapters 2.5, 2.7 in the text book, Biological
Sequence Analysis, Durbin et al., 2001.
Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational
Molecular Biology, Setubal and Meidanis, 1997.
Chapter 15 in Gusfield’s book.
Much of this class has been edited from Nir Friedman’s lecture which is available at
www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.
Scoring Functions, Reminder
• So far, we discussed dynamic programming
algorithms for
– global alignment
– local alignment
• All of these assumed a scoring function:
 : (  { })  (  { })  
that determines the value of perfect matches,
substitutions, insertions, and deletions.
Where does the scoring function come
from ?
We have defined an additive scoring function by
specifying a function ( ,  ) such that
– (x,y) is the score of replacing x by y
– (x,-) is the score of deleting x
– (-,x) is the score of inserting x
But how do we come up with the “correct” score ?
Answer: By encoding experience of what are similar
sequences for the task at hand. Similarity depends
on time, evolution trends, and sequence types.
Why probability setting is
appropriate to define and interpret a
scoring function ?
• Similarity is probabilistic in nature because biological
changes like mutation, recombination, and selection, are
random events.
• We could answer questions such as:
• How probable it is for two sequences to be similar?
• Is the similarity found significant or spurious?
• How to change a similarity score when, say,
mutation rate of a specific area on the chromosome
becomes known ?
A Probabilistic Model
• For starters, will focus on alignment without
indels.
• For now, we assume each position (nucleotide
/amino-acid) is independent of other positions.
• We consider two options:
M: the sequences are Matched (related)
R: the sequences are Random (unrelated)
Unrelated Sequences
• Our random model R of unrelated sequences
is simple
– Each position is sampled independently from a
distribution over the alphabet 
– We assume there is a distribution p() that
describes the probability of letters in such
positions.
• Then:
P(s[1..n], t[1..n]| R)   p(s[i])p(t[i])
i
Related Sequences
• We assume that each pair of aligned positions
(s[i],t[i]) evolved from a common ancestor
• Let q(a,b) be a distribution over pairs of letters.
• q(a,b) is the probability that some ancestral letter
evolved into this particular pair of letters.
Pr(s[1..n], t[1..n] | M ) 
 q(s[i], t[i])
i
Compare to:
Pr(s[1..n], t[1..n] | R ) 
 p(s[i])p(t[i])
i
Odd-Ratio Test for Alignment
q(s[i], t[i])

Pr(s, t | M)
q(s[i], t[i])
Q


Pr(s, t | R )  p(s[i])p(t[i])
p(s[i])p(t[i])
i
i
i
If Q > 1, then the two strings s and t are more likely to
be related (M) than unrelated (R).
If Q < 1, then the two strings s and t are more likely to
be unrelated (R) than related (M).
Log Odd-Ratio Test for
Alignment
Taking logarithm of Q yields
Score(s[i],t[i])
P(s, t | M)
q(s[i], t[i])
q(s[i], t[i])
log
 log 
  log
P(s, t | R)
p(s[i])p(t[i])
i
i p(s[i]) p(t[i])
If log Q > 0, then s and t are more likely to be related.
If log Q < 0, then they are more likely to be unrelated.
How can we relate this quantity to a score function ?
Probabilistic Interpretation of
Scores
• We define the scoring function via
q ( a, b)
 (a, b)  log
p(a) p(b)
• Then, the score of an alignment is the logratio between the two models:

– Score < 0 
– Score > 0
Model is more likely
Random is more likely
Modeling Assumptions
• It is important to note that this interpretation
depends on our simplified modeling assumption!!
• For example, if we assume that the letter in each
position depends on the letter in the preceding
position, then the likelihood ratio will have a
different form.
• If we assume, for proteins, some joint distribution
of letters that are nearby in 3D space after
protein folding, then likelihood ratio will again be
different.
Estimating Probabilities
• Suppose we are given a long string s[1..n]
of letters from 
• We want to estimate the distribution q(·)
that generated the sequence
• How should we go about this?
We build on the theory of parameter
estimation in statistics using either
maximum likelihood estimation or the
Bayesian approach (later on).
Estimating q()
• Suppose we are given a long string s[1..n]
of letters from 
– s can be the concatenation of all sequences in our
database
• We want to estimate the distribution
q()
n
Na
Likelihood function: L(s | q)   q(s[i])   q(a )
i 1
• That is, q is defined per single letters
a
Estimating q() (cont.)
n
Na
L(s
|
q)

q(s[i])

q(a)
Likelihood function:


i 1
How do we define q?
ML parameters
(Maximum Likelihood)
Intuitively
Na
q(a) 
n
a
Estimating p(·,·)
Intuition:
• Find a pair of aligned sequences s[1..n],
t[1..n],
• Estimate probability of pairs:
Na ,b
p (a , b ) 
n
Number of times a is
aligned with b in (s,t)
• The sequences s and t can be the
concatenation of many aligned pairs from the
database
Problems in Estimating p(·,·)
• How do we find pairs of aligned sequences?
• How far is the ancestor ?
– earlier divergence  low sequence similarity
– recent divergence  high sequence similarity
• Does one letter mutate to the other or are
they both mutations of a common ancestor
having yet another residue/nucleotide acid ?
Scoring Matrices
Deal with DNA first (simpler)
then AA (not too bad either)
What is it & why ?
• Let alphabet contain N letters
– N = 4 and 20 for nucleotides and amino acids
• N x N matrix
• (i,j) shows the relationship between i-th and j-th
letters.
– Positive number if letter i is likely to mutate into letter j
– Negative otherwise
– Magnitude shows the degree of proximity
• Symmetric
Scoring Matrices for DNA
A
C
G
T
A
1
0
0
0
C
0
1
0
0
G
0
0
1
0
T
0
0
0
1
identity
A
C
G
T
A
1
-3
-3
-3
C
-3
1
-3
-3
G
-3
-3
1
-3
T
-3
-3
-3
1
BLAST
A
C
G
T
A
1
-5
-1
-5
C
-5
1
-5
-1
G
-1
-5
1
-5
T
-5
-1
-5
1
Transitions &
transversions
The BLOSUM45 Matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
5
-2
-1
-2
-1
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-2
-2
0
R
-2
7
0
-1
-3
1
0
-2
0
-3
-2
3
-1
-2
-2
-1
-1
-2
-1
-2
N
-1
0
6
2
-2
0
0
0
1
-2
-3
0
-2
-2
-2
1
0
-4
-2
-3
D
-2
-1
2
7
-3
0
2
-1
0
-4
-3
0
-3
-4
-1
0
-1
-4
-2
-3
C
-1
-3
-2
-3
12
-3
-3
-3
-3
-3
-2
-3
-2
-2
-4
-1
-1
-5
-3
-1
Q
-1
1
0
0
-3
6
2
-2
1
-2
-2
1
0
-4
-1
0
-1
-2
-1
-3
E
-1
0
0
2
-3
2
6
-2
0
-3
-2
1
-2
-3
0
0
-1
-3
-2
-3
G
0
-2
0
-1
-3
-2
-2
7
-2
-4
-3
-2
-2
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
0
-3
1
0
-2
10
-3
-2
-1
0
-2
-2
-1
-2
-3
2
-3
I
-1
-3
-2
-4
-3
-2
-3
-4
-3
5
2
-3
2
0
-2
-2
-1
-2
0
3
L
-1
-2
-3
-3
-2
-2
-2
-3
-2
2
5
-3
2
1
-3
-3
-1
-2
0
1
K
-1
3
0
0
-3
1
1
-2
-1
-3
-3
5
-1
-3
-1
-1
-1
-2
-1
-2
M
-1
-1
-2
-3
-2
0
-2
-2
0
2
2
-1
6
0
-2
-2
-1
-2
0
1
F
-2
-2
-2
-4
-2
-4
-3
-3
-2
0
1
-3
0
8
-3
-2
-1
1
3
0
P
-1
-2
-2
-1
-4
-1
0
-2
-2
-2
-3
-1
-2
-3
9
-1
-1
-3
-3
-3
S
1
-1
1
0
-1
0
0
0
-1
-2
-3
-1
-2
-2
-1
4
2
-4
-2
-1
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
2
5
-3
-1
0
W
-2
-2
-4
-4
-5
-2
-3
-2
-3
-2
-2
-2
-2
1
-3
-4
-3
15
3
-3
Y
-2
-1
-2
-2
-3
-1
-2
-3
2
0
0
-1
0
3
-3
-2
-1
3
8
-1
V
0
-2
-3
-3
-1
-3
-3
-3
-3
3
1
-2
1
0
-3
-1
0
-3
-1
5
Scoring Matrices for Amino Acids
• Chemical similarities
–
–
–
–
Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)
Polar, Hydrophilic (S, T, C, Y, N, Q)
Electrically charged (D, E, K, R, H)
Requires expert knowledge
• Genetic code: Nucleotide substitutions
– E: GAA, GAG
– D: GAU, GAC
– F: UUU, UUC
• Actual substitutions
– PAM
– BLOSUM
Scoring Matrices: Actual
Substitutions
•
•
•
•
Manually align proteins
Look for amino acid substitutions
Entry ~ log (freq(observed) / freq(expected))
Log-odds matrices
BLOSUM
BLOcks Substitution Matrices
Henikoff & Henikoff, 1992
Next slides taken from lecture notes by Tamer Kahveci,
CISE Department, University of Florida
(www.cise.ufl.edu/~tamer/teaching/ fall2004/lectures/03-CAP5510-Fall04.ppt
BLOSUM Matrix
• Begin with a set of protein sequences and obtain
aligned blocks.
– ~2000 blocks from 500 families of related proteins
• A block is the ungapped alignment of a highly
conserved region of a family of proteins.
• MOTIF program is used to find blocks
• Substitutions in these blocks are used to compute
BLOSUM matrix
block 1
WWYIR
WFYVR
WYYVR
WYFIR
…
block 2
block 3
CASILRKIYIYGPV
CASILRHLYHRSPA
AAAVARHIYLRKTV
AASICRHLYIRSPA
GVSRLRTAYGGRKNRG
GVGSITKIYGGRKRNG
GVGRLRKVHGSTKNRG
GIGSFEKIYGGRRRRG
…
Constructing the Matrix
• Count the frequency of occurrence of each amino acid.
This gives the background distribution pa
• Count the number of times amino acid a is aligned with amino
acid b: fab
– A block of width w and depth s contributes ws(s-1)/2 pairs.
– Denote by np the total number of pairs.
• Compute thei joccurrence
probability of each pair: qab = fab/ np
>= i
• Compute thea≠bexpected probability of occurrence of each pair
eab = 2papb, if a ≠ b
papb otherwise
• Compute twice (?) the log likelihood ratios, normalize, and
round to nearest integer.
2* log2 qab / eab
Computation of BLOSUM-X
• The amount of similarity in blocks has a great effect
on the BLOSUM score. BLOSUM-X is generated by
taking only blocks with %X identity.
• For example, a BLOSUM62 matrix is calculated from
protein blocks with 62% identity.
• So BLOSUM80 represents closer sequences (more
recent divergence) than BLOSUM62.
• On the web, Blast uses BLOSUM80, BLOSUM62
(the default), or BLOSUM45.
a b
BLOSUM 62 Matrix
Check scores for
M I L V
-small hydrophobic
N D E Q
-acid, hydrophilic
H R K
-basic
F Y W
-aromatic
S T P A G
-small hydrophilic
C
-sulphydryl
PAM vs. BLOSUM
Equivalent PAM and BLOSSUM matrices:
PAM100
PAM120
PAM160
PAM200
PAM250
=
=
=
=
=
Blosum90
Blosum80
Blosum60
Blosum52
Blosum45
BLOSUM62 is the default matrix to use.
And Now
Ladies and Gentlemen
Boys and Girls
the holy grail
Multiple Sequence Alignment
Multiple Sequence Alignment
S1=AGGTC
S2=GTTCG
Possible alignment
S3=TGAAC
Possible alignment
AGG T - C GTT - - C G
- TGA A A C
AGG T - C - G - T T C G
TG - A A C -
Multiple Sequence Alignment
Aligning more than two sequences.
Definition: Given strings S1, S2, …,Sk a multiple (global) alignment
map them to strings S’1, S’2, …,S’k that may contain blanks, where:
1. |S’1|= |S’2|=…= |S’k|
2. The removal of spaces from S’i leaves Si
Multiple alignments
We use a matrix to represent the alignment of k
sequences, K=(x1,...,xk). We assume no columns
consists solely of blanks.
The common scoring functions give a score to each column,
and set: score(K)= ∑i score(column(i))
x1
M Q _ I L L L
x2
x3
x4
M L R - L
M K _ I L
M P P V L
L
L
I
L
L
For k=10, a scoring function has 2k -1 > 1000 entries to specify.
The scoring function is symmetric - the order of arguments need
not matter: score(I,_,I,V) = score(_,I,I,V).
SUM OF PAIRS
A common scoring function is SP – sum of scores of the
projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).
Note that we need to
specify the score(-,-)
because a column may
have several blanks (as
long as not all entries are
blanks).
M
M
M
M
Q
L
K
P
_
R
_
P
I
I
V
L
L
L
L
L
L
L
I
L
L
L
In order for this score to be written as ∑i score(column(i)),
we set score(-,-) = 0. Why ?
Because these entries appear in the sum of columns but not
in the sum of projected pairwise alignments (lines).
SUM OF PAIRS
Definition: The sum-of-pairs (SP) value for a multiple global
alignment A of k strings is the sum of the values of
k 
all   projected
2
pairwise alignments induced by A where the pairwise alignment
function score(xi,xj) is additive.
M
M
M
M
Q
L
K
P
_
R
_
P
I
I
V
L
L
L
L
L
L
L
I
L
L
L
Example
Consider the following alignment:
ac-cdb-c-adbd
a-bcdad
Using the edit distance  x, x   0 and  x, y   1 for x  y ,
this alignment has a SP value of
3+
+44 + 5 = 12
Multiple Sequence Alignment
Given k strings of length n, there is a natural generalization of the
dynamic programming algorithm that finds an alignment that
maximizes
SP-score(K) = ∑i<j score(xi,xj).
Instead of a 2-dimensional table, we now have a k-dimensional
table to fill.
For each vector i =(i1,..,ik), compute an optimal multiple alignment
for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).
The adjacent entries are those that differ in their index by one or
zero. Each entry depends on 2k-1 adjacent entries.
The idea via K=2
Recall the notation:
V [i , j ]  d (s [1..i ],t [1..j ])
and the following recurrence for V:
V [i , j ]   (s [i  1 ],t [ j  1 ]) 


V [i  1 , j  1 ]  maxV [i , j  1 ]   (s [i  1 ],) 
V [i  1 , j ]   ( ,t [ j  1 ]) 


V[i,j]
V[i,j+1]
V[i+1,j]
V[i+1,j+1]
Note that the new cell index
(i+1,j+1) differs from previous
indices by one of 2k-1 non-zero
binary vectors (1,1), (1,0), (0,1).
Multiple Sequence Alignment
Given k strings of length n, there is a generalization of the dynamic
programming algorithm that finds an optimal SP alignment.
Computational Cost:
• Instead of a 2-dimensional table we now have a k-dimensional
table to fill.
• Each dimension’s size is n+1. Each entry depends on 2k-1
adjacent entries.
Number of evaluations of
scoring function : O(2knk)
Complexity of the DP approach
Number of cells nk.
Number of adjacent cells O(2k).
Computation of SP score for each column(i,b) is o(k2)
Total run time is O(k22knk) which is totally unacceptable !
Maybe one can do better?
But MSA is Intractable
Not much hope for a polynomial algorithm because the
problem has been shown to be NP complete (proof is quite
Tricky and recent. Some previous proofs were bogus).
Look at Isaac Elias presentation of NP completeness proof.
Need heuristic or approximation to reduce time.
Multiple Sequence Alignment –
Approximation Algorithm
Now we will see an O(k2n2) multiple alignment
algorithm for the SP-score that approximate
the optimal solution’s score by a factor of at
most 2(1-1/k) < 2.
Star Alignments
Rather then summing up all pairwise alignments,
select a fixed sequence S1 as a center, and set
Star-score(K) = ∑j>0score(S1,Sj).
The algorithm to find optimal alignment: at each step,
add another sequence aligned with S1, keeping old
gaps and possibly adding new ones.
Multiple Sequence Alignment –
Approximation Algorithm
Polynomial time algorithm:
assumption: the function δ is a distance
•
•
•
 ( x, y)   ( y, x)  0
 ( x, x )  0
 ( x, z)   ( y, z)   ( x, y)
function:
(triangle inequality)
Let D(S,T) be the value of the minimum global alignment
between S and T.
Multiple Sequence Alignment –
Approximation
Algorithm
(cont.)
Polynomial time algorithm:
The input is a set Γ of k strings Si.
1. Find “center string” S1 that minimizes  D S1 ,S
S
2. Call the remaining strings S2, …,Sk.
3. Add a string to the multiple alignment that initially
contains only S1 as follows:
• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1.
Add Si by running dynamic programming algorithm on
S’1 and Si to produce S’’1 and S’i.
• Adjust S’2, …,S’i-1 by adding spaces to those columns
where spaces were added to get S’’1 from S’1.
• Replace S’1 by S’’1.
Multiple Sequence Alignment –
Approximation Algorithm (cont.)
Time analysis:
k 
• Choosing S1 – running dynamic programming algorithm  
2
2
2
times – O(k n )
• When Si is added to the multiple alignment, the length of S1
is at most in, so the time to add all k strings is
k 1

2 2


O
in

n

O
k
n

i 1

Multiple Sequence Alignment –
Approximation Algorithm (cont.)
Performance analysis:
• M - The alignment produced by this algorithm.
• d(i,j) - the distance M induces on the pair Si,Sj.
k
k
• vM    d i, j 
i 1 j 1
j i
• M* - optimal alignment.
For all i, d(1,i)=D(S1,Si)
(we performed optimal alignment between S’1 and Si and  (,)  0 )
Multiple Sequence Alignment –
Approximation Algorithm (cont.)
Performance
analysis:
Triangle
inequality
k
k
vM    d i, j    d 1, i   d 1, j 
k
k
i 1 j 1
j i
i 1 j 1
j i
k
k
l 2
l 2
 2(k  1) d 1, l   2(k  1) DS1 , Sl 
 
v M *   d * i, j    DSi , S j  
k
k
k
i 1 j 1
j i
k
i 1 j 1
j i
Definition
of S1
  DS1 , S j   k  DS1 , S j 
k
k
i 1 j 1
j i
k
j 2
v( M ) 2(k  1)

2
*
v( M )
k
Multiple Sequence Alignment –
Approximation Algorithm
Algorithm relies heavily on scoring function
being a distance. It produced an alignment
whose SP score is at most twice the minimum.
What if scoring function was similarity?
Can we get an efficient algorithm whose score
is half the maximum? Third of maximum? …
We dunno !
Tree Alignments
Assume that there is a tree T=(V,E) whose leaves are
the sequences.
• Associate a sequence in each internal node.
• Tree-score(K) = ∑(i,j)Escore(xi,xj).
Finding the optimal assignment of sequences to the
internal nodes is NP Hard.
We will meet again this problem in the study of
Phylogenetic trees (it is related to the parsimony problem).
Multiple Sequence Alignment Heuristics
Example - 4 sequences A, B, C, D.
A.
A
B
C
D
Perform all 6 pair wise
alignments. Find scores.
Build a “similarity tree”.
distant
B.
B
D
A
C
similar
Multiple alignment following the tree from A.
B
D
A
C
Align most similar pairs allowing
gaps to optimize alignment.
Align the next most similar pair.
Now, “align the alignments”, introducing gaps if
necessary to optimize alignment of (BD) with (AC).
The tree-based progressive method for multiple sequence
alignment, used in practice (Clustal)
(a) a tree (dendrogram) obtained by cluster analysis
(b) pairwise alignment of 2 sequences’ alignments.
(a)
DEPGG3
DEBYG3
DEZYG3
DEBSG F
L R R - A R T A S A
L - R G A R A A A E
DEHUG3
L W R D G R G A L Q
L W R G G R G A A Q
D W R - G R T A S G
(b)
(modified from Speed’s ppt presentation,
see p. 81 in Kanehisa’s book)
Visualization of Alignment Helps