Transcript F(i,j)
Probability Theory and Basic
Alignment of String Sequences
Chapter 1.1-2.3
S. Maarschalkerweerd & A. Tjhang
1
Overview
Probability Theory
-Maximum Likelihood
-Bayes Theorem
Pairwise Alignment
-The Scoring Model
-Alignment Algorithms
S. Maarschalkerweerd & A. Tjhang
2
Probability Theory
S. Maarschalkerweerd & A. Tjhang
3
Probability Theory
What is a probabilistic model?
Simple example:
What is probability of base sequence x1x2…xn?
p(xi),
p(x1), p(x2)…p(xn) independent of
each other
If pC = 0.3; pT = 0.2 and sequence is CTC:
P(CTC)=0.3*0.2*0.3=0.018
S. Maarschalkerweerd & A. Tjhang
4
Maximum Likelihood Estimation
Estimate parameters of the model from large sets
of examples (training set)
– For example: P(T) and P(C) are estimated from their
frequency in a database of residues
Avoid overfitting
– Database too small, model also fits to noise in the
training set
S. Maarschalkerweerd & A. Tjhang
5
Probability Theory
Conditional Probability
-P(X,Y) = P(X|Y) P(Y) (joint probability)
-P(X) = Y P(X,Y) = Y P(X|Y) P(Y)
(marginal probability)
S. Maarschalkerweerd & A. Tjhang
6
Bayes’ Theorem
P(Y|X) P(X)
P(X|Y) =
P(Y)
- Posterior probability
Example:
P(X)=Probability tumor visible on x-ray
P(C)=Probability breast-cancer = 0.01
P(X|C) = 0.9; P(X|¬C) = 0.05
- On the x-ray a tumor is seen. What is the
probability that the woman has breast-cancer?
S. Maarschalkerweerd & A. Tjhang
7
Pairwise Alignment
S. Maarschalkerweerd & A. Tjhang
8
Pairwise Alignment
Goal: determine whether 2 sequences are related
(homologous).
Issues regarding pairwise alignment:
1.
2.
3.
4.
What sorts of alignment should be considered?
The scoring system used to rank alignments.
The algorithm used to find optimal (or good) scoring
alignments.
The statistical methods to evaluate significance of an
alignment score.
S. Maarschalkerweerd & A. Tjhang
9
Example
You need a ‘smart’ scoring model to
distinguish b from c.
S. Maarschalkerweerd & A. Tjhang
10
The Scoring Model
S. Maarschalkerweerd & A. Tjhang
11
The Scoring Model
When sequences are related, then both
sequences have to be from a common
ancestor.
– Due to mutation sequences can change.
Substitutions
Gaps (insertions or deletions)
– Natural selection ensures that some mutations
are seen more often than others. (Survival of
the fittest)
S. Maarschalkerweerd & A. Tjhang
12
The Scoring Model
Total score of an alignment:
– Sum of terms for each aligned pair of residues
– Terms for each gap
Take the sum of those terms
S. Maarschalkerweerd & A. Tjhang
13
Substitution Matrices
We need a matrix with the scores for every
possible pair of residues (e.g. bases or amino
acids)
We can compute these score by:
s(a,b) =
pab
log(q q )
a b
pab = probability that residues a and b have been derived
independently from some unknown original residue c.
qa = frequency of a
S. Maarschalkerweerd & A. Tjhang
14
BLOSUM50
S. Maarschalkerweerd & A. Tjhang
15
Gap Penalties
(g) = -gd (linear score)
(g) = -d-(g-1)e (affine score)
– d = gap-open penalty
– e = gap-extension penalty
– g = gap length
P(gap) = f(g)i ingapqxi
S. Maarschalkerweerd & A. Tjhang
16
Alignment Algorithms
S. Maarschalkerweerd & A. Tjhang
17
Alignment Algorithms
Needleman-Wunsch (global alignment)
Smith-Waterman (local alignment)
Repeated matches
Overlap matches
Hybrid match conditions
S. Maarschalkerweerd & A. Tjhang
18
Dynamic Programming
Enormous amount of possible alignments
Algorithm for finding optimal alignment:
Use Dynamic Programming
Save sub-results for later reuse, avoiding
calculation of same problem
S. Maarschalkerweerd & A. Tjhang
19
Needleman-Wunsch Algorithm
Global alignment
For sequences of size n and m, make (n+1)x(m+1)
matrix
Fill in from top left to bottom right
F(i-1, j-1) + s(xi,yj)
F(i,j) = max
F(i-1, j) – d
F(i, j-1) – d
Keep pointer to cell that is used to derive F(i,j)
Takes O(nm) time and memory
{
S. Maarschalkerweerd & A. Tjhang
20
Matrix
0
-8
-8
-2
-8
-8
S. Maarschalkerweerd & A. Tjhang
-2
21
Matrix
Traceback
S. Maarschalkerweerd & A. Tjhang
22
Smith-Waterman Algorithm
Local alignment
Two differences with Needleman-Wunsch:
0
F(i-1, j-1) + s(xi,yj)
1. F(i,j) = max
F(i-1, j) – d
F(i, j-1) – d
2. Local alignment can end anywhere, so choose
highest value in matrix from where traceback
starts (not necessarily bottom right cell)
{
S. Maarschalkerweerd & A. Tjhang
23
Matrix
S. Maarschalkerweerd & A. Tjhang
24
Smith-Waterman Algorithm
Expected score for a random match s(a,b)
must be negative
There must be some s(a,b) greater than 0 or
no alignment is found
S. Maarschalkerweerd & A. Tjhang
25
Repeated Matches
Many local alignments possible if one or
both sequences are long. Smith-Waterman
only finds one of them
Find parts of sequence in the other sequence
Not every alignment is useful
threshold
S. Maarschalkerweerd & A. Tjhang
26
Repeated Matches
F(i,j) = max
{
F(i,0) = max
{
F(i, 0)
F(i-1, j-1) + s(xi,yj)
F(i-1, j) – d
F(i, j-1) – d
F(i-1, 0)
F(i-1, j) – T, j = 1,…m;
S. Maarschalkerweerd & A. Tjhang
27
Matrix
Threshold T = 20
S. Maarschalkerweerd & A. Tjhang
28
Overlap Matches
Find match between start of a sequence and end of
a sequence (can be the same)
Alignment begins on left-hand or top border of the
matrix and ends on right-hand or bottom border
S. Maarschalkerweerd & A. Tjhang
29
Overlap Matches
F(0,j) = 0, for j = 1,…,m
F(i,0) = 0, for i = 1,…,n
F(i,j) = max
{
F(i-1, j-1) + s(xi,yj)
F(i-1, j) – d
F(i, j-1) – d
S. Maarschalkerweerd & A. Tjhang
30
Matrix
S. Maarschalkerweerd & A. Tjhang
31
Hybrid Match Conditions
Different types of alignment can be created by
– adjusting rhs of this formula:
F(i,j) = max {….
– adjusting the traceback
Example:
– We want to align two sequences from the beginning of
both the sequences until local alignment has been
found.
S. Maarschalkerweerd & A. Tjhang
32
Summary
Probability theory is important for sequence
analysis
Goal: determine whether 2 sequences are
related
For that, we need to find an optimal alignment
between those sequences using algorithms
Scoring model is required to rank different
alignments
Different algorithms for different types of
alignments
– use dynamic programming
S. Maarschalkerweerd & A. Tjhang
33