Transcript Document

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to bioinformatics
2007
Lecture 5
Pair-wise Sequence Alignment
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense
except in the light of Biology”
The Genetic Code
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
* *
* **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc---***
**** **** **
******
Searching for similarities
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial
experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the
unknown sequence
– Identify similarities and differences
– For long proteins: first identify domains
Is similarity really interesting?
Evolutionary and functional relationships
Reconstruct evolutionary relation:
•Based on sequence
-Identity (simplest method)
-Similarity
•Homology (common ancestry: the ultimate goal)
•Other (e.g., 3D structure)
Functional relation:
Sequence Structure
Function
Searching for similarities
Common ancestry is more interesting:
Makes it more likely that genes share
the same function
Homology: sharing a common ancestor
– a binary property (yes/no)
– it’s a nice tool:
When (an unknown) gene X is homologous to (a
known) gene G it means that we gain a lot of
information on X: what we know about G can be
transferred to X as a good suggestion.
How to go from DNA to protein
sequence
A piece of double stranded DNA:
5’ attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
DNA direction is from 5’ to 3’
How to go from DNA to protein
sequence
6-frame translation using the codon table (last lecture):
5’ attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
Bioinformatics tool
Algorithm
Data
tool
Biological
Interpretation
(model)
Example today: Pairwise sequence
alignment needs sense of evolution
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Amino Acid Exchange
Matrix
Gap penalties
(open,extension)
How to determine similarity?
• Frequent evolutionary events:
1. Substitution
2. Insertion, deletion We’ll only use these
3. Duplication
4. Inversion
• Evolution at work
Common ancestor,
usually extinct
Z
X
available
Y
Alignment - the problem
algorithmsforgenomes
||||||||||
algorithms4genomes
Algorithmsforgenomes
|||||||
algorithms4genomes
algorithmsforgenomes
||||||||||? |||||||
algorithms4--genomes
Operations:
– substitution,
– Insertion
– deletion
GTACT--C
GGA-TGAC
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
* *
* **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc---***
**** **** **
******
Dynamic programming
Scoring alignments
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum of the scores of all
alignment columns
Dynamic programming
Scoring alignments
– Substitution (or match/mismatch)
affine
• DNA
concave
linear
penalty
• proteins
– Gap penalty
,
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum
of the scores over all alignment columns
Gap length
•/
General alignment score:
Sa,b = l s(ai, bj )  k N  gp(k )
k
Substitution Matrices: DNA
define a score for match/mismatch of letters
Simple:
A
C
G
T
A
1
-1
-1
-1
C
-1
1
-1
-1
G
-1
-1
1
-1
T
-1
-1
-1
1
Used in genome alignments:
A
C
G
T
A
91
-114
-31
-123
C
-114
100
-125
-31
G
-31
-125
100
-114
T
-123
-31
-114
91


Amino acids are not equal:
1. Some are similar and easily
substituted:
• biochemical properties
• structure
2. Some mutations occur more
often due to similar codons
The two above give us substitution
matrices
http://www.cimr.cam.ac.uk/links/codon.htm
http://www.people.virginia.edu/~rjh9u/aminacid.html
Substitution matrices for a.a.
orange: nonpolar and hydrophobic.
green: polar and hydrophilic
magenta box are acidic
light blue box are basic
BLOSUM 62 substitution matrix
http://www.carverlab.org/testing/epp.html
Constant vs. affine gap scoring
Gap
intro
extension
Constant
-2
-2
Affine
-3
-1
Scoring
…and +1 for match
Seq1
Seq2
G
-
Const
-2 –2 1
Affine
T
-
–4
A
A
1
T
G
G
-
–2 –2 (SUM = -7)
-4
(SUM = -7)
A
T
T
G
A
-
-2 –2 1
-2 –2 (SUM = -7)
-3 –3 1
-3 –3 (SUM = -11)
Dynamic programming
Scoring alignments
T D W V T A L K
T D W L - - I K
2020
10
Amino Acid Exchange Matrix
1
Affine gap penalties (open,
extension)
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px +
+s(L,I)+s(K,K)
Amino acid exchange matrices
2020
How do we get one?
And how do we get associated gap penalties?
First systematic method to derive a.a.
exchange matrices by Margaret Dayhoff et
al. (1968) – Atlas of Protein Structure
A
2
R -2
6
N
0
0
2
D
0 -1
2
PAM250 matrix
4
C -2 -4 -4 -5 12
Q
0
1
1
2 -5
4
E
0 -1
1
3 -5
2
4
G
1 -3
0
1 -3 -1
0
2
1 -3
1 -2
H -1
2
3
5
6
I -1 -2 -2 -2 -2 -2 -2 -3 -2
5
L -2 -3 -3 -4 -6 -2 -3 -4 -2
2
1
0 -5
1
amino acid
exchange matrix
(log odds)
0 -2
6
K -1
3
M -1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F -4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
9
1
0 -1 -1 -3
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
2
T
1 -1
0
0 -2 -1
0
0 -1
0 -1 -3
0
1
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
Y -3 -4 -2 -4
V
0 -2 -3 -1 -2 -5
0 -2
0 -4 -4 -5
Positive exchange values
denote mutations that are
more likely than randomly
expected, while negative
numbers correspond to
avoided mutations compared
to the randomly expected
situation
5
P
W -6
0 -1 -1
0 -2 -3
0 -1 -1 -4 -2
6
3
0 -6 -2 -5 17
7 -5 -3 -3
0 10
0 -2 -2 -2 -2 -2 -2 -1 -2
4
2 -2
2 -1 -1 -1
0 -6 -2
4
A
I
L
M
T
V
R
N
D
C
Q
E
G
H
K
F
P
S
W
Y
Amino acid exchange matrices
Amino acids are not equal:
1. Some are easily substituted because they have
similar:
• physico-chemical properties
• structure
2. Some mutations between amino acids occur more
often due to similar codons
The two above observations give us ways to define
substitution matrices
Pair-wise alignment
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
~
=
n
22n
(2n)!
(n!)2
n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
Technique to overcome the
combinatorial explosion:
Dynamic Programming
• Alignment is simulated as Markov process,
all sequence positions are seen as
independent
• Chances of sequence events are independent
– Therefore, probabilities per aligned position
need to be multiplied
– Amino acid matrices contain so-called log-odds
values (log10 of the probabilities), so
probabilities can be summed
To say the same more statistically…
•To perform statistical analyses on messages or sequences, we need a
reference model.
•The model: each letter in a sequence is selected from a defined
alphabet in an independent and identically distributed (i.i.d.) manner.
•This choice of model system will allow us to compute the statistical
significance of certain characteristics of a sequence, its subsequences,
or an alignment.
•Given a probability distribution, Pi, for the letters in a i.i.d. message,
the probability of seeing a particular sequence of letters i, j, k, ... n is
simply Pi Pj Pk···Pn.
•As an alternative to multiplication of the probabilities, we could sum
their logarithms and exponentiate the result. The probability of the
same sequence of letters can be computed by exponentiating log Pi +
log Pj + log Pk+ ··· + log Pn.
•In practice, when aligning sequences we only add log-odds values
(residue exchange matrix) but we do not exponentiate the final score.
Sequence alignment
History of Dynamic
Programming algorithm
1970 Needleman-Wunsch global pair-wise
alignment
Needleman SB, Wunsch CD (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.
1981 Smith-Waterman local pair-wise
alignment
Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences.
J. Mol. Biol. 147, 195-197.
Pairwise sequence alignment
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Amino Acid Exchange
Matrix
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Gap penalties
(open,extension)
Global dynamic programming
j-1 j
i-1
i
H(i,j) = Max
This is a recursive
formula
Value from residue
exchange matrix
H(i-1,j-1) + S(i,j)
H(i-1,j) - g
H(i,j-1) - g
diagonal
vertical
horizontal
Global alignment
The Needleman-Wunsch algorithm
• Goal: find the maximal scoring alignment
• Scores: m match, -s mismatch, -g for insertion/deletion
• Dynamic programming
– Solve smaller subproblem(s)
– Iteratively extend the solution
• The best alignment for X[1…i] and Y[1…j] is called
M[i, j]
X1 … Xi-1 - Xi
Y1 … Yj-1 Yj -
The algorithm




Goal: find the maximal scoring alignment
Scores: m match, -s mismatch, -g for
insertion/deletion
The best alignment for X[1…i] and Y[1…j]
is called M[i, j]
3 ways to extend the alignment
X[1…i-1] X[i] X[1…i]
X[1…i-1] X[i]
Y[1…j-1] Y[j] Y[1…j-1]Y[j] Y[1…j]
M[i,j] = M[i-1, j-1] +- m
s
M[i, j-1] - g
M[i-1, j] - g
The algorithm – final equation
Corresponds to:
*
M[i-1, j-1] + score(X[i],Y[j])
M[i, j] = max
M[i, j-1] – g
M[i-1, j] – g
j-1
i-1
i
*
-g
j
-g
X1…Xi-1
Xi
Y1…Yj-1
X1…XiY j
Y1…Yj-1
Yj
X1…Xi-1
Xi
Y1…Yj -
Example:
global alignment of two
sequences
• Align two DNA sequences:
– GAGTGA
– GAGGCGA (note the length difference)
• Parameters of the algorithm:
– Match: score(A,A) = 1
– Mismatch: score(A,T) = -1
– Gap: g = -2
M[i-1, j-1] ± 1
M[i, j] = max
M[i, j-1] – 2
M[i-1, j] – 2
The algorithm. Step 1: init
M[i-1, j-1] ± 1
• Create the matrix
• Initiation
– 0 at [0,0]
– Apply the
equation…
M[i, j] = max
M[i, j-1] – 2
M[i-1, j] – 2
j
i
0
-
1
G
2
A
3
G
4
G
5
C
6
G
7
A
0
1
2
3
4
5
6
-
G
A
G
T
G
A
0
The algorithm. Step 1: init
M[i-1, j-1] ± 1
M[i, j] = max
M[i, j-1] – 2
M[i-1, j] – 2
j
• Initiation of the
matrix:
– 0 at pos [0,0]
– Fill in the first row
using the “” rule
– Fill in the first
column using “”
i
-
G
A
G
T
G
A
-
0
-2
-4
-6
-8
-10 -12
G
-2
A
-4
G
-6
G
-8
C
-10
G
-12
A
-14
The algorithm. Step 2: fill in
M[i-1, j-1] ± 1
M[i, j] = max
M[i, j-1] – 2
M[i-1, j] – 2
j
• Continue filling in of
the matrix,
remembering from
which cell the result
comes (arrows)
i
-
G
A
G
T
G
A
-
0
-2
-4
-6
-8
-10 -12
G
-2
1
-1
-3
A
-4
-1
2
G
-6
G
-8
C
-10
G
-12
A
-14
The algorithm. Step 2: fill in
M[i-1, j-1] ± 1
M[i, j] = max
M[i, j-1] – 2
M[i-1, j] – 2
j
• We are done…
• Where’s the
result?
The lowest-rightmost
cell
i
-
G
A
G
T
G
A
-
0
-2
-4
-6
-8
-10 -12
G
-2
1
-1
-3
-5
-7
-9
A
-4
-1
2
0
-2
-4
-6
G
-6
-3
0
3
1
-1
-3
G
-8
-5
-2
1
2
2
0
C
-10
-7
-4
-1
0
1
1
G
-12
-9
-6
-3
-2
1
0
A
-14 -11
-8
-5
-4
-1
2
The algorithm. Step 3: traceback
• Start at the last cell of
the matrix
• Go against the direction
of arrows
• Sometimes the value
may be obtained from
more than one cell
(which one?)
M[i-1, j-1] ± 1
M[i, j] = max
M[i, j-1] – 2
M[i-1, j] – 2
j
i
-
G
A
G
T
G
A
-
0
-2
-4
-6
-8
-10 -12
G
-2
1
-1
-3
-5
-7
-9
A
-4
-1
2
0
-2
-4
-6
G
-6
-3
0
3
1
-1
-3
G
-8
-5
-2
1
2
2
0
C
-10
-7
-4
-1
0
1
1
G
-12
-9
-6
-3
-2
1
0
A
-14 -11
-8
-5
-4
-1
2
The algorithm. Step 3: traceback
• Extract the
alignments
The ‘low’ and the ‘high’ road
j
a)
GAGT-GA
GAGGCGA
b)
-
GA-GTGA
GAGGCGA
i
You can also jump from the high to
the low road in the middle
(following the arrow): to what
alignment does that lead?
G
A
G
T
G
A
-
0
a -2
-4
-6
-8
-10 -12
G
-2
b1
-1
-3
-5
-7
-9
A
-4
-1
2
0
-2
-4
-6
G
-6
-3
0
3
1
-1
-3
G
-8
-5
-2
1
2
2
0
C
-10
-7
-4
-1
0
1
1
G
-12
-9
-6
-3
-2
1
0
A
-14 -11
-8
-5
-4
-1
2
Affine gaps
• Penalties:
go - gap opening (e.g. -8)
ge - gap extension (e.g. -1)
M[i-1, j-1]
X1… Xi
Y1… Yj
M[i, j] = score(X[i], Y[j]) + max
Ix[i-1, j-1]
Iy[i-1, j-1]
X1… Y1… Yj
Ix[i, j] = max
Iy[i, j] = max
M[i, j-1] + go
Ix[i, j-1] + ge
M[i-1, j] + go
Iy[i-1, j] + gx
• @ home: think of boundary values M[*,0], I[*,0] etc.
X1…Xi Y1…Yj-1
Yj
X1… - Y1…Yj-1
Yj
Semi-global pairwise alignment
• Global alignment: all gaps are penalised
• Semi-global alignment: N- and C-terminal (5’ and
3’) gaps (end-gaps) are not penalised
End-gaps
MSTGAVLIY--TS-------GGILLFHRTSGTSNS
End-gaps
Variation on global alignment
• Global alignment: previous algorithms is called global
alignment, because it uses all letters from both sequences.
CAGCACTTGGATTCTCGG
CAGC-----G-T----GG
• Semi-global alignment: don’t penalize for end gaps
CAGCA-CTTGGATTCTCGG
---CAGCGTGG-------– Applications of semi-global:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
– Danger! – really bad alignments for divergent seqs
seq X:
seq Y:
Semi-global pairwise alignment
Applications of semi-global:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
Danger: if gap penalties high -- really bad
alignments for divergent sequences
Protein sequences have N- and
C-terminal amino acids that are
often small and hydrophilic
Semi-global alignment
• Ignore 5’ or Nterminal end gaps
– First row/column set
to 0
• Ignore C-terminal or
3’ end gaps
– Read the result from
last row/column
-
G
A
G
T
G
-
0
0
0
0
0
0
A
0
-1
1
-1
-1
-1
G
0
1
-2
2
0
-2
T
0
-1
0
0
3
1
Take-home messages
• Homology
• Why are we interested in similarity?
• Pairwise alignment: global alignment and
semi-global alignment
• Three types of gap penalty regimes: linear,
affine and concave
• Make sure you can calculate alignment
using the DP algorithm
• a heuristic
– Heuristics: A rule of thumb that often helps in
solving a certain class of problems, but makes
no guarantees.
Perkins, DN (1981) The Mind's Best Work
Global dynamic programming
PAM250, Gap =6 (linear)
S
P
E
A
R
E
0
-6
-12
-18
-24
-30
-36
S
-6
2
-4
-10
-16
-22
-28
0
H
-12
-4
2
-3
-9
-14
-20
-24
3
0
A
-18
-10
-3
0
-1
-7
-13
-18
-1
4
K
-24
-16
-9
-3
-1
2
-4
-12
E
-30
-22
-15
-5
-3
-2
6
-6
-30
-24
-18
-12
-6
0
S
P
E
A
R
E
S
2
1
0
1
0
0
H
-1
0
1
-1
2
1
A
1
1
0
2
-2
K
0
-1
0
-1
E
0
-1
4
0
These values are copied from
the PAM250 matrix (see earlier
slide)
Higgs & Attwood, p. 124
The extra bottom row and rightmost column give the
penalties that would need to be applied due to end gaps
Global dynamic programming
Linear, Affine or Concave gap penalties
j-1
All penalty schemes
are possible because
the exact length of
the gap is known
i-1
Gap
opening
penalty
Si,j = si,j + Max
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
Gap extension penalty
Global dynamic programming
Gapo=10, Gape=2
D
W
V
T
A
L
K
0
-12
-14
-16
-18
-20
-22
-24
T
-12
8
-9
-6
-5
-9
-11
-14
5
D
-14
0
9
2
2
3
-5
-3
-34
10
6
W
-16
-13
25
11
5
4
9
0
-21
6
14
5
V
-18
-10
-4
37
21
19
19
15
-16
7
5
13
L
-20
-14
-2
23
46
31
37
26
1
K
-22
-12
-9
17
33
53
39
50
14
-34
-29
-1
17
39
27
50
D
W
V
T
A
L
K
T
8
3
8
11
9
9
8
D
12
1
6
8
8
4
8
W
1
25
2
3
2
6
V
6
2
12
8
8
L
4
6
10
9
K
8
5
6
8
These values are copied from
the PAM250 matrix (see earlier
slide), after being made nonnegative by adding 8 to each
PAM250 matrix cell (-8 is the
lowest number in the PAM250
matrix)
The extra bottom row and rightmost column give the
final global alignment scores
Easy DP recipe for using affine
gap penalties
j-1
i-1
• M[i,j] is optimal alignment (highest scoring
alignment until [i,j])
• Check
– preceding row until j-2: apply appropriate gap penalties
– preceding row until i-2: apply appropriate gap penalties
– and cell[i-1, j-1]: apply score for cell[i-1, j-1]
DP is a two-step process
• Forward step: calculate scores
• Trace back: start at highest score and
reconstruct the path leading to the highest
score
– These two steps lead to the highest scoring
alignment (the optimal alignment)
– This is guaranteed when you use DP!
Global dynamic programming
Semi-global dynamic programming
- two examples with different gap penalties These values are copied from
the PAM250 matrix (see earlier
slide), after being made nonnegative by adding 8 to each
PAM250 matrix cell (-8 is the
lowest number in the PAM250
matrix)
Global score is 65 –
10 – 1*2 –10 – 2*2
Local dynamic programming
(Smith & Waterman, 1981)
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
AGSTVIVG
A-STILCG
Gap penalties
(open, extension)
Local dynamic programming
(Smith & Waterman, 1981)
j-1
i-1
Gap
opening
penalty
Si,j = Max
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
Gap extension penalty
Local dynamic programming
Dot plots
• Way of representing (visualising) sequence
similarity without doing dynamic
programming (DP)
• Make same matrix, but locally represent
sequence similarity by averaging using a
window
Comparing two sequences
We want to be able to choose the best alignment between two
sequences.
A simple method of visualising similarities between two sequences
is to use dot plots. The first sequence to be compared is assigned to
the horizontal axis and the second is assigned to the vertical axis.
Dot plots can be filtered by
window approaches (to
calculate running averages)
and applying a threshold
They can identify
insertions, deletions,
inversions