Transcript Document

Sequence comparaison
A little bit of theory does not
hurt!
Ahmed Rebaï
Definitions
Comparative method:
analyze similarities and
differences between sequence residues in order to infer
structural, functionnal or evolutionary relationship
Sequence alignment:
procedure of comparing 2 or
more sequences by searching for a series of individual
residues or residue patterns that are in the same order in
the sequences
Global alignment:
stretched over the entire
sequence length to include as many matching residues as
possible up to the sequence ends
Local alignment:
stops at the ends of regions of
identity or strong similarity and priority is given to find
conserved patterns than to extend the alignment to
include neighbouring residues (we will see how!)
Homology
Measuring Similarity
Measuring the extent of similarity between
two sequences
 Based on percent sequence identity
 Based on conservation
Percent Sequence Identity
The extent to which two nucleotide or
amino acid sequences are invariant
AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
Making a Scoring Matrix
Scoring matrices are created based on
biological evidence.
Alignments can be thought of as two
sequences that differ due to mutations in
the sequence.
Some of these mutations have little effect
on the organism’s function, therefore
some penalties will be less harsh than
others.
Scoring Matrix: Example
A
R
N
K
A
5
-2
-1
-1
R
-
7
-1
3
N
-
-
7
0
K
-
-
-
6
• Notice that although
R and K are different
amino acids, they
have a positive score.
• Why? They are both
positively charged
amino acids will not
greatly change
function of protein.
Conservation
Amino acid changes that preserve the
physico-chemical properties of the original
residue
 Polar to polar
 aspartate  glutamate
 Nonpolar to nonpolar
 alanine  valine
 Similarly behaving residues
 leucine to isoleucine
Alignment scores
A global score is calculated as:




a score is attributed to each position of the
alignment according to:
 Match/mismatch/gap in DNA sequences
 Substitution of an aminoacids by another
 gap for protein and DNA sequences
We add scores for the whole alignment:
S=  S_elem +  S_Gap
Gap scores are negative because these are
penalities
By choosing these scores we can either favour
alignments with few gaps (or ungapped) or with
many gaps
Scoring Indels: Naive Approach
A fixed penalty σ is given to every indel:
 -σ when there is 1 indel, -2σ for 2
consecutive indels, -3σ for 3
consecutive indels, etc.
Can be too severe penalty for a series of
100 consecutive indels
Affine Gap Penalties
In nature, very often indels come as a
unit, not just at 1 nucleotide at a time.
In nature, this
is more likely.
Normal scoring would
give the same score
for both alignments
Gap scores
The Gap score depends on its length
P= x+ y l
x: Gap-opening penality
y: elementary gap extension penality
y<< x (x/10) for example by default:
x=-11, y=-1 (blastp) and x=-5, y=-2 (Blastn)
 By this choice a large gap is more likely
than many small gaps of te same length
(which is consistent with biological
phenomena)
Score matrix or substitution matrix
for DNA sequences: 1 for a match, -2
for a mismatch
for protein sequences: 20x20 matrix



Based on model of protein evolution
Based on analysis of highly conserved
segments in proteins (Blocks)
Based on physico-chemical or structural
properties
PAM Matrices (Dayhoff et al., 1978)
list the likelihood of change from one aa to
another in homologus protein sequences
during evolution
The change of an aa A by B is assumed the
same regardless of any pevious change at
that site and the position of A in the protein
Changes are in closely related protein and
are assumed to represent aa substitution
tha do not change the protein function and
are called «accepted mutations »
PAM: point accepted mutation
1 PAM = PAM1 = 1% average change of all
amino acid positions
 After 100 PAMs of evolution, not every residue
will have changed
 some residues may have mutated several
times
 some residues may have returned to their
original state
 some residues may mot changed at all
PAM matrix
Relative mutability of aa is evaluated by
counting (in a group of related sequences) the
number of changes of each aa and dividing by
the « exposure to mutation » of that aa
Exposure is calculated as the product of the
frequency of occurrence of the aa in the group
of seq. and the total number of aa changes the
occured in that group per 100 sites
Calculated from etimations of aa susbtitution
that occured in a group of evolving porteins
using 1572 changes in 71 groups of proteins
sequences that were at least 85% similar.
PAM matrix
The most mutable aa are Asn, Ser, Asp and
Glu and the least mutable are Cys and Try
One we get PAM1 matrix we should multiply
it by itself N times if we want to predict
changes for more distantly related proteins
that have undergone N mutations
The commonl used is PAM250 which
represents a level of 250 changes for 100 aa
expected in 2500 my. Such sequences are
expected to have about 20% similarity
PAM matrix
260 changes were observed between Phe and
Tyr among the 1572 changes
The scoe of changing Phe to Tyr wa 0.0021
(0.9946 for not changing Phe)
PAM-250 the proba of Phe to Tyr chnage is
0.15 an that of no change is 0.32
The frequency of Phe was 0.04 so the relative
frequency of of change is 0.15/0.04=3.75 .
10Log10(3.75)=5.7. The score for a change of
Tyr to phe was found 8.3 so use the average
value of these two which is 7
PAM250
PAM matrix
Values are between -8 to 17
0 indicates the the frequency of the
subsitution between these two aa is expected
by chance
negative means that the frequency of
substitution is less than expected by chance
For example no cnages between Gly (G) to
Trp (W) resulting in a score of -7
BLOSUM (Henikoff et al, 1993)
Blocks Substitution Matrix
Scores derived from observations of the
frequencies of substitutions in blocks of
local alignments in related proteins
Matrix name indicates evolutionary
distance
 BLOSUM62 was created using
sequences sharing no more than 62%
identity
BLOSUM matrix
An alternative that does not involve an
explicit model of evolutionary change is
to just find blocks of similar sequence
and count how often one amino acid is
substituted by another
Good for general comparisons without
consideration of evolutionary distance
Comparison PAM-BLOSUM
Choice of a Matrix!
BLOSUM90
PAM30
Rat versus
mouse RBP
BLOSUM80
PAM120
BLOSUM62
PAM180
BLOSUM45
PAM240
Rat versus
bacterial
lipocalin
Methods for aligning 2 sequences
Dot matrix analysis
Dynamic programming algorithm
Word or k-tuple methods such that used in
the programs BLAST and FASTA
Bayesian sequence alignment algorithms
The Dot-plot: a first step !
Dot plot
Dynamic programming
Compoutational method that provides
de very best (optimal) alignment
between two sequences
Compare every pair of residues in the
two sequences and includes: matches,
mismatches and gaps
ATGT- AATGCATG
|||| |||
|
ATGTGAAT
--
A
Alignment as a Path in the Edit
Graph
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
(0,0) , (1,1)
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
Alignment as a Path in the Edit
Graph
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
(0,0) , (1,1) , (2,2)
7
T
_
6
7
_
C
7
Alignment as a Path in the Edit
Graph
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
(0,0) , (1,1) , (2,2), (2,3),
(3,4)
Alignment as a Path in the Edit
Graph
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
- End Result (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6),
(7,6), (7,7)
Alignments in Edit Graph
and
represent
indels in v and w
• Score 0.
represent exact
matches.
• Score 1.
Alignments in Edit Graph
The score of the
alignment path in the
graph is 5.
Alignment as a Path in the Edit Graph
Every path in the edit
graph corresponds to
an alignment:
Alignment as a Path in the Edit Graph
Old Alignment
0122345677
v= AT_GTTAT_
w= ATCGT_A_C
0123455667
New Alignment
0122345677
v= AT_GTTAT_
w= ATCG_TA_C
0123445667
Alignment as a Path in the Edit
Graph
0122345677
v= AT_GTTAT_
w= ATCGT_A_C
0123455667
(0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6),
(7,6), (7,7)
Alignment: Dynamic Programming
Use this scoring algorithm
si,j =
max
si-1, j-1+1 if vi = wj
si-1, j
si, j-1
Dynamic Programming Example
• There are no
matches in the
beginning of the
sequence
• Label column i=1 to
be all zero, and row
j=1 to be all zero
Dynamic Programming Example
Si,j =
max
Si-1, j-1 value from NW +1, if v = w
i
j
Si-1, j  value from North (top)
Si, j-1
 value from West (left)
Alignment: Backtracking
Arrows
show where the score
originated from.
if from the top
if from the left
if vi = wj
Backtracking Example
Find a match in row and column 2.
i=2, j=2,5 is a match (T).
j=2, i=4,5,7 is a match (T).
Since vi = wj, S(i,j) = S(i-1,j-1) +1
S(2,2)
S(2,5)
S(4,2)
S(5,2)
S(7,2)
=
=
=
=
=
[S(1,1)
[S(1,4)
[S(3,1)
[S(4,1)
[S(6,1)
=
=
=
=
=
1]
1]
1]
1]
1]
+
+
+
+
+
1
1
1
1
1
Backtracking Example
Continuing with the
scoring algorithm
gives this result.
Local vs. Global Alignment
• The Global Alignment Problem tries to find
the longest path between vertices (0,0)
and (n,m) in the edit graph.
• The Local Alignment Problem tries to find
the longest path among paths between
arbitrary vertices (i,j) and (i’, j’) in the
edit graph.
Local vs. Global Alignment (cont’d)
Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | |||
|| | | | | ||||
|
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
Local Alignment—better alignment to find
conserved segment
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
Local Alignments: Why?
Two genes in different species may be
similar over short conserved regions and
dissimilar over remaining regions.
Example:
 Homeobox genes have a short region
called the homeodomain that is highly
conserved between species.
 A global alignment would not find the
homeodomain because it would try to
align the ENTIRE sequence
The Local Alignment Problem
Goal: Find the best local alignment
between two strings
Input : Strings v, w and scoring matrix δ
Output : Alignment of substrings of v & w
whose alignment score is maximum
among all possible alignment of all
possible substrings
The Local Alignment Recurrence
The largest value of si,j over the whole edit
graph is the score of the best local
alignment.
The recurrence is shown below:
0
si,j = max
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
Notice there is only
this change from the
original recurrence of
a Global Alignment
Affine Gap Penalty Recurrences
si,j =
s i-1,j - σ
s i-1,j –(ρ+σ)
si,j =
Continue Gap in v (insertion)
s i,j-1 - σ
s i,j-1 –(ρ+σ) Start Gap in v (insertion):from middle
max
max
si,j =
max
Continue Gap in w (deletion)
Start Gap in w (deletion): from middle
Match or Mismatch
si-1,j-1 + δ (vi, wj)
End deletion: from top
s i,j
End insertion: from bottom
s i,j