Transcript F(i,j)

Probability Theory and Basic
Alignment of String Sequences
Chapter 1.1-2.3
S. Maarschalkerweerd & A. Tjhang
1
Overview

Probability Theory
-Maximum Likelihood
-Bayes Theorem

Pairwise Alignment
-The Scoring Model
-Alignment Algorithms
S. Maarschalkerweerd & A. Tjhang
2
Probability Theory
S. Maarschalkerweerd & A. Tjhang
3
Probability Theory

What is a probabilistic model?
 Simple example:
What is probability of base sequence x1x2…xn?
 p(xi),
p(x1), p(x2)…p(xn) independent of
each other
If pC = 0.3; pT = 0.2 and sequence is CTC:
P(CTC)=0.3*0.2*0.3=0.018
S. Maarschalkerweerd & A. Tjhang
4
Maximum Likelihood Estimation

Estimate parameters of the model from large sets
of examples (training set)
– For example: P(T) and P(C) are estimated from their
frequency in a database of residues

Avoid overfitting
– Database too small, model also fits to noise in the
training set
S. Maarschalkerweerd & A. Tjhang
5
Probability Theory

Conditional Probability
-P(X,Y) = P(X|Y) P(Y) (joint probability)
-P(X) = Y P(X,Y) = Y P(X|Y) P(Y)
(marginal probability)
S. Maarschalkerweerd & A. Tjhang
6
Bayes’ Theorem

P(Y|X) P(X)
P(X|Y) =
P(Y)
- Posterior probability
 Example:
P(X)=Probability tumor visible on x-ray
P(C)=Probability breast-cancer = 0.01
P(X|C) = 0.9; P(X|¬C) = 0.05
- On the x-ray a tumor is seen. What is the
probability that the woman has breast-cancer?
S. Maarschalkerweerd & A. Tjhang
7
Pairwise Alignment
S. Maarschalkerweerd & A. Tjhang
8
Pairwise Alignment


Goal: determine whether 2 sequences are related
(homologous).
Issues regarding pairwise alignment:
1.
2.
3.
4.
What sorts of alignment should be considered?
The scoring system used to rank alignments.
The algorithm used to find optimal (or good) scoring
alignments.
The statistical methods to evaluate significance of an
alignment score.
S. Maarschalkerweerd & A. Tjhang
9
Example

You need a ‘smart’ scoring model to
distinguish b from c.
S. Maarschalkerweerd & A. Tjhang
10
The Scoring Model
S. Maarschalkerweerd & A. Tjhang
11
The Scoring Model

When sequences are related, then both
sequences have to be from a common
ancestor.
– Due to mutation sequences can change.
 Substitutions
 Gaps (insertions or deletions)
– Natural selection ensures that some mutations
are seen more often than others. (Survival of
the fittest)
S. Maarschalkerweerd & A. Tjhang
12
The Scoring Model

Total score of an alignment:
– Sum of terms for each aligned pair of residues
– Terms for each gap

Take the sum of those terms
S. Maarschalkerweerd & A. Tjhang
13
Substitution Matrices

We need a matrix with the scores for every
possible pair of residues (e.g. bases or amino
acids)
 We can compute these score by:
s(a,b) =
pab
log(q q )
a b
pab = probability that residues a and b have been derived
independently from some unknown original residue c.
qa = frequency of a
S. Maarschalkerweerd & A. Tjhang
14
BLOSUM50
S. Maarschalkerweerd & A. Tjhang
15
Gap Penalties
(g) = -gd (linear score)
 (g) = -d-(g-1)e (affine score)

– d = gap-open penalty
– e = gap-extension penalty
– g = gap length

P(gap) = f(g)i ingapqxi
S. Maarschalkerweerd & A. Tjhang
16
Alignment Algorithms
S. Maarschalkerweerd & A. Tjhang
17
Alignment Algorithms

Needleman-Wunsch (global alignment)
 Smith-Waterman (local alignment)
 Repeated matches
 Overlap matches
 Hybrid match conditions
S. Maarschalkerweerd & A. Tjhang
18
Dynamic Programming

Enormous amount of possible alignments
 Algorithm for finding optimal alignment:
Use Dynamic Programming
 Save sub-results for later reuse, avoiding
calculation of same problem
S. Maarschalkerweerd & A. Tjhang
19
Needleman-Wunsch Algorithm






Global alignment
For sequences of size n and m, make (n+1)x(m+1)
matrix
Fill in from top left to bottom right
F(i-1, j-1) + s(xi,yj)
F(i,j) = max
F(i-1, j) – d
F(i, j-1) – d
Keep pointer to cell that is used to derive F(i,j)
Takes O(nm) time and memory
{
S. Maarschalkerweerd & A. Tjhang
20
Matrix
0
-8
-8
-2
-8
-8
S. Maarschalkerweerd & A. Tjhang
-2
21
Matrix
Traceback
S. Maarschalkerweerd & A. Tjhang
22
Smith-Waterman Algorithm

Local alignment
 Two differences with Needleman-Wunsch:
0
F(i-1, j-1) + s(xi,yj)
1. F(i,j) = max
F(i-1, j) – d
F(i, j-1) – d
2. Local alignment can end anywhere, so choose
highest value in matrix from where traceback
starts (not necessarily bottom right cell)
{
S. Maarschalkerweerd & A. Tjhang
23
Matrix
S. Maarschalkerweerd & A. Tjhang
24
Smith-Waterman Algorithm

Expected score for a random match s(a,b)
must be negative
 There must be some s(a,b) greater than 0 or
no alignment is found
S. Maarschalkerweerd & A. Tjhang
25
Repeated Matches

Many local alignments possible if one or
both sequences are long. Smith-Waterman
only finds one of them
 Find parts of sequence in the other sequence
 Not every alignment is useful
threshold
S. Maarschalkerweerd & A. Tjhang
26
Repeated Matches
F(i,j) = max
{
F(i,0) = max
{
F(i, 0)
F(i-1, j-1) + s(xi,yj)
F(i-1, j) – d
F(i, j-1) – d
F(i-1, 0)
F(i-1, j) – T, j = 1,…m;
S. Maarschalkerweerd & A. Tjhang
27
Matrix
Threshold T = 20
S. Maarschalkerweerd & A. Tjhang
28
Overlap Matches

Find match between start of a sequence and end of
a sequence (can be the same)

Alignment begins on left-hand or top border of the
matrix and ends on right-hand or bottom border
S. Maarschalkerweerd & A. Tjhang
29
Overlap Matches
F(0,j) = 0, for j = 1,…,m
 F(i,0) = 0, for i = 1,…,n


F(i,j) = max
{
F(i-1, j-1) + s(xi,yj)
F(i-1, j) – d
F(i, j-1) – d
S. Maarschalkerweerd & A. Tjhang
30
Matrix
S. Maarschalkerweerd & A. Tjhang
31
Hybrid Match Conditions

Different types of alignment can be created by
– adjusting rhs of this formula:
F(i,j) = max {….
– adjusting the traceback

Example:
– We want to align two sequences from the beginning of
both the sequences until local alignment has been
found.
S. Maarschalkerweerd & A. Tjhang
32
Summary





Probability theory is important for sequence
analysis
Goal: determine whether 2 sequences are
related
For that, we need to find an optimal alignment
between those sequences using algorithms
Scoring model is required to rank different
alignments
Different algorithms for different types of
alignments
– use dynamic programming
S. Maarschalkerweerd & A. Tjhang
33