O - BIOTEC - Biotechnology Center TU Dresden

Download Report

Transcript O - BIOTEC - Biotechnology Center TU Dresden

Sequence comparison
based on
Chapter 4
Lesk, Introduction to Bioinformatics
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Contents
Motivation
Sequence comparison and alignments
Dot plots
Dynamic programming
Substitution matrices
Dynamic programming: Local and global alignments
and gaps
BLAST
Significance of alignments
Multiple sequence alignments
By Michael Schroeder, Biotec
2
Motivation
From where are we?
Recent Africa vs. Multi-regional Hypothese
In 1999 Encephalitis caused by the West Nile Virus
broke out in New York. How did the virus come to
New York?
How did the nucleus get into the eucaryotic cells?
To answer such questions we will need sequence
comparison and phylogenetic trees
By Michael Schroeder, Biotec
3
Sequence alignment
Substitutions, insertions and deletions can be
interpreted in evolutionary terms
But: distinguish chance similarity and real biological
relationship
CCGTAA
TCGTAA
TTGTAA
TCGTAC
TAGTAC
By Michael Schroeder, Biotec
CCGTAT
TCGTAG
4
Evolution
Convergent evolution: same sequence evolved from
different ancestors
Back evolution - mutate to a previous sequence
CCGTAA
TCGTAA
TAGTAA
TAGTAC
By Michael Schroeder, Biotec
TCGTAC
TAGTAC
CCGTAT
TCGTAG
CCGTAA
5
Sequence alignments
Given two or more sequences, we wish to
Measure their similarity
Determine the residue-residue correspondences
Observe patterns of conservation and variability
Infer evolutionary relationships
By Michael Schroeder, Biotec
6
What is the best alignment?
-------gctgaacg
ctataatc------Without gaps:
gctgaacg
ctataatc
With gaps:
gctga-a--cg
--ct-ataatc
Another one:
gctg-aa-cg
-ctataatcFormally: The best alignments have only a minimal
number of mismatches (insertions, deletions,
replace)
We need a method to systematically explore and to
compute alignments
Uninformative:
By Michael Schroeder, Biotec
7
Scores for an alignment
Sequence Identity: Percentage of matches
Score each match, mismatch, gap opening, gap extension
[attg]
a
t
t
g
[accc]
a
c
Example
match +1
mismatch -1
Gap opening -3
Gap extension -1
Uninformative: 0%, score= -19
Without gaps: 25%, score= -4
With gaps: 0%, score= -19
Another one: 50%, score= -8
By Michael Schroeder, Biotec
-------gctgaacg
ctataatc------gctgaacg
ctataatc
gctga-a--cg
--ct-ataatc
gctg-aa-cg
-ctataatc8
Scores for an alignment
Sequence Identity: Percentage of matches
Score each match, mismatch, gap opening, gap extension
[attg]
a
t
t
g
[accc]
a
c
Example
match +2
mismatch -1
Gap opening -1
Gap extension -1
Uninformative: 0%, score= -15
Without gaps: 25%, score= -2
With gaps: 0%, score= -11
Another one: 50%, score= 5
By Michael Schroeder, Biotec
-------gctgaacg
ctataatc------gctgaacg
ctataatc
gctga-a--cg
--ct-ataatc
gctg-aa-cg
-ctataatc9
Dot plots
By Michael Schroeder, Biotec
10
Dot plots
A convenient way of comparing 2 sequences visually
Use matrix, put 1 sequence on X-axis, 1 on Y-axis
Cells with
identical characters filled with a ‘1’,
non-identical with ‘0’
(simplest scheme - could have weights)
By Michael Schroeder, Biotec
11
Dot plots
D O R O T H Y C R O W F O O T H O D G K I
N
D
O
R
O
T
H
Y
H
O
D
G
K
I
N
By Michael Schroeder, Biotec
12
Dot plots
D O R O T H Y C R O W F O O T H O D G K I
D D
O
O
O
R
R
O
O
O
T
T
H
H
Y
Y
H
H
O
O
O
D D
G
K
I
N
By Michael Schroeder, Biotec
N
D
O
O O
O
O
O O
O
R
T
H
H
O
O O
O
D
G
K
I
N
13
Interpreting dot plots
What do identical sequences look like?
What do unrelated sequences look like?
What do distantly related sequences look like?
What does reverse sequence look like?
Relevant for detections of stems in RNA structure
What does a palindrome look like?
Relevant for restriction enzymes
What do repeats look like?
What does a protein with domains A and B and another
one with domains B and C look like?
By Michael Schroeder, Biotec
14
Dot plot for identical sequences
D
D
O
R
O
T
H
H
O
D
O
D
G
K
I
N
D
O
R
O
O
O
O
R
O
O
T
T
H
H
Y
H
Y
H
H
O
D
Y
O
D
G
K
I
N
By Michael Schroeder, Biotec
O
H
O
D
G
K
I
N
15
Dotplot for unrelated sequences
D O R O T H Y H O D G K I
O
O
O
O
T
T
T
T
O
O
N
O
O
D D
D
I
I
E
T
T
E
R
By Michael Schroeder, Biotec
R
16
Dotplot for distantly related sequences
D
O
R
O
T
T
H
Y
H
O
D
G
K
I
N
T
I
I
M
O
O
O
T
O
T
H
H
Y
H
Y
J
E
N
N
K
K
I
I
N
N
By Michael Schroeder, Biotec
17
Dotplot for reverse sequences
Relevant to identify stems in RNA structures
Plot sequence against its reverse complement
By Michael Schroeder, Biotec
18
Dotplot for reverse sequences
D
O
R
O
T
H
Y
H
O
D
G
K
I
N
N
I
I
K
K
G
D
G
D
O
D
O
O
O
H
H
Y
H
Y
H
H
T
H
T
O
O
R
O
O
O
O
R
O
D
N
O
D
By Michael Schroeder, Biotec
19
Dotplot of a Palindrome
M
M
A
A
M
A
M
M
A
D
A
D
A
M
D
A
M
By Michael Schroeder, Biotec
A
M
20
Dotplot of repeats
T
T
W
E
T
W
T
T
W
E
N
T
Y
T
E
N
T
T
O
T
E
N
T
T
T
Y
T
W
W
N
T
Y
T
T
W
T
T
W
O
W
O
T
W
O
T
T
W
E
T
T
W
E
N
E
N
T
W
E
N
N
T
Y
T
T
Y
O
T
Y
O
N
E
E
T
E
W
T
N
W
Y
T
O
W
N
T
Y
T
E
T
N
N
E
By Michael Schroeder, Biotec
O
N
N
E
E
21
Dotplot of Repeats/Palindrome
M
M
A
A
M
A
M
A
A
A
D
A
M
A
M
M
A
A
D
A
M
A
M
I
A
M
M
M
M
I
M
A
M
A
D
A
A
D
A
M
M
D
A
M
I
M
D
M
D
A
M
By Michael Schroeder, Biotec
A
D
A
A
M
M
A
M
22
Dotplot for shared domain
D
D
O
O
O
T
H
Y
H
O
D
G
K
I
N
D
O
R
O
R
O
O
O
O
R
O
T
T
H
H
Y
H
Y
M
I
I
L
L
E
R
R
By Michael Schroeder, Biotec
23
Dot plot
Result
By Michael Schroeder, Biotec
dorothycrowfoothodgkin
d*
*
o * *
* ** *
r *
*
o * *
* ** *
t
*
*
h
*
*
y
*
h
*
*
o * *
* ** *
d*
*
g
*
k
*
i
*
n
*
24
Dotplots
Window size 15
Dot if
6 matches in window
By Michael Schroeder, Biotec
25
Window size 15
Dot if
6 matches in window
Cacain and
Caricain, two
proteases from
papaya
By Michael Schroeder, Biotec
26
Cacain and Caricain, two proteases from papaya
>gi|1942644|pdb|1MEG|
64
Length = 216
Crystal Structure Of A Caricain D158e Mutant In Complex With E-
Score = 271 bits (693), Expect = 1e-73
Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 61
IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60
+PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS
LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60
YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120
+GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G
KT GV +VQP N+G
HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120
Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176
LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV
G YILIKNS
Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180
Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212
WGT WGE GYIRIKR GNS GVCGLY SS+YP KN
Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216
1
61
121
181
lpenvdwrkk
hgckggyppy
llnaiakqpv
wgtawgekgy
gavtpvrhqg
aleyvakngi
svvveskgrp
irikrapgns
By Michael Schroeder, Biotec
scgscwafsa
hlrskypyka
fqlykggife
pgvcglykss
vatveginki rtgklvelse qelvdcerrs
kqgtcrakqv ggpivktsgv grvqpnnegn
gpcgtkveha vtavgygksg gkgyilikns
yyptkn
27
Window size 15
Dot if
6 matches in window
Cacain and
Cruzain, a
protease from
Trypanosoma
cruzi
By Michael Schroeder, Biotec
28
Cacain and Cruzain, a protease from Trypanosoma
>gi|2624670|pdb|1AIM|
Fluoromethylketone
Length = 215
Cruzain Inhibited By Benzoyl-Tyrosine-Alanine-
Score = 121 bits (303), Expect = 3e-28
Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%)
Query: 2
Sbjct: 2
Query: 62
Sbjct: 62
PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61
P VDWR +GAVT VK+QG CGSCWAFSA+ +E
+
L
SEQ L+ CD+
PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61
GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115
GC+GG
+A + + Q
++ ++YPY
EG+
C +
A T V
Q
GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121
Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171
Q A
++ N PV+V + A+
+ Y GG+
+DH V VGY +
Y
Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178
Query: 172 LIKNSWGTGWGENGYIRIKRGT 193
+IKNSW T WGE GYIRI +G+
Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200
By Michael Schroeder, Biotec
29
Window size 15
Dot if
6 matches in window
Cacain and
Cathepsin, a
human
protease
By Michael Schroeder, Biotec
30
Cacain and Cathepsin, a human protease
gi|7546546|pdb|1EF7|B
Length = 242
Chain B, Crystal Structure Of Human Cathepsin X
Score = 52.0 bits (123), Expect = 2e-07
Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%)
Query: 1
Sbjct: 1
Query: 52
Sbjct: 61
IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51
+P+ DWR
V
+ +NQ
CGSCWA ++
+
I I+
S Q
LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60
ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103
++DC
C GG
S
Q+GI
Y+
+ C
+ K +
NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119
Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157
A +
+ +V Y
+ +
AN P+S + A +
Y GGI+
+
Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178
Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199
+H V+ G+
G Y +++NSWG WGE G++RI
K G G Y +
Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229
By Michael Schroeder, Biotec
31
Window size 5
Dot if
2 matches in window
Cacain and
Cathepsin, a
human
protease
By Michael Schroeder, Biotec
32
Window size 1
Dot if
1 match in window
Cacain and
Cathepsin, a
human
protease
By Michael Schroeder, Biotec
33
Dynamic programming
By Michael Schroeder, Biotec
34
From Dotplots to Alignments
DOROTHYCROWFOOTHODGKIN
DOROTHY--------HODGKIN
Obvious best alignment:
D O R O T
D
H
O W F
O O T
H O D G K I
D
O
O
O
R
O
N
D
R
O
O
O O
O
O
O O
O
R
O
T
T
H
T
H
Y
H
Y
H
H
O
D
Y C R
O
O
D
G
K
I
H
O
O O
O
D
G
K
I
N
By Michael Schroeder, Biotec
N
35
From Dotplots to Alignments
Find “best” path from top left corner to bottom right
Moving “east” corresponds to “-” in the second
sequence
Moving “south” corresponds to “-” in the first
sequence
Moving “southeast” corresponds to
a match (if the characters are the same) or
a mismatch (otherwise)
Can we automate this?
By Michael Schroeder, Biotec
36
From Dotplots to Alignments
Algorithm (Dynamic Programming):
Insert a row 0 and column 0 initialised with 0
Starting from the top left, move down row by row from row 1 and
right column by column from column 1 visiting each cell
Consider
The value of the cell north
The value of the cell west
The value of the cell northwest if the row/column character
mismatch
1 + the value of the cell northwest if the row/column
character match
Put down the maximum of these values as the value for the
current cell
Trace back the path with the highest values from the bottom right
to the top left and output the alignment
By Michael Schroeder, Biotec
37
From Dotplots to Alignments
0
1
T
2
G
3
C
4
A
5
T
6
A
0
1A
2T
3C
4T
5G
6A
7T
By Michael Schroeder, Biotec
38
From Dotplots to Alignments
0
0
0
1A
2T
3C
4T
5G
6A
7T
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
Insert a row 0 and column 0 initialised with 0
By Michael Schroeder, Biotec
39
From Dotplots to Alignments
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
0
0
0
0
0
0
0
0
By Michael Schroeder, Biotec
2
G
0
0
3
C
0
0
4
A
0
1
5
T
0
1
6
A
0
1
•Consider
•Value north
•Value west
•Value northwest if the row/column character mismatch
•1 + value northwest if the row/column character match
•Put down the maximum of these values for current celll
40
From Dotplots to Alignments
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
By Michael Schroeder, Biotec
41
Reading the Alignment
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
-tgcat-aat-c-tgat
By Michael Schroeder, Biotec
42
Reading the Alignment:
there are more than one possibility
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
2
G
3
C
4
A
5
T
6
A
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
1
1
1
2
2
2
0
0
1
2
2
2
2
2
0
1
1
2
2
2
3
3
0
1
2
2
3
3
3
4
1
2
2
3
3
4
4
---tgcata
atctg-atBy Michael Schroeder, Biotec
43
Formally:
Longest Common Subsequence LCS
What is the length s(V,W) of the longest common
subsequence of two sequences
V=v1..vn and W=w1..wm ?
Find sequences of indices
1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ m
such that vit = wjt for 1 ≤ t ≤ k
How? Dynamic programming:
si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and
si-1,j
si,j = max
si,j-1
si-1,j-1 + 1, if vi = wj
Then s(V,W) = sn,m is the length of the LCS
{
By Michael Schroeder, Biotec
44
Example LCS
0
1
T
2
G
3
C
4
A
5
T
6
A
0
1A
2T
3C
4T
5G
6A
7T
By Michael Schroeder, Biotec
45
Example LCS:
0
0
0
1A
2T
3C
4T
5G
6A
7T
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
Initialisation:
si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m
By Michael Schroeder, Biotec
46
Example LCS:
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
0
0
1
1
1
Computing each cell:
si,j = max
By Michael Schroeder, Biotec
{
si-1,j
si,j-1
si-1,j-1 + 1, if vi = wj
47
Example LCS:
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
Computing each cell:
si,j = max
By Michael Schroeder, Biotec
{
si-1,j
si,j-1
si-1,j-1 + 1, if vi = wj
48
LCS Algorithm
LCS(V,W)
For i = 0 to n
si,0 = 0
For j = 0 to m
s0,j = 0
For i = 1 to n
For j = 1 to m
Complexity:
LCS has quadratic complexity:
O(n m)
If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then
si,j = si-1,j-1 +1
bi,j = North West
Else if si-1,j ≥ si,j-1 Then
si,j = si-1,j
bi,j = North
Else
si,j = si,j-1
bi,j = West
Return s and b
By Michael Schroeder, Biotec
49
Printing the alignment of LCS
PRINT-LCS(b,V,i,j)
If i=0 or j=0 Then Return
If bi,j = North West Then
PRINT-LCS(V,b,i-1,j-1)
Print vi
Else if bi,j = North Then
PRINT-LCS(V,b,i-1,j)
Else
PRINT-LCS(V,b,i,j-1)
By Michael Schroeder, Biotec
50
Rewards/Penalities
We can use different schemes:
-1 for insert/delete/mismatch
+1 for match
…Consider
-1 + the value of the cell north
-1 + the value of the cell west
-1 + the value of the cell northwest if the row/column
character mismatch
+1 + the value of the cell northwest if the row/column
character match
Put down the maximum of these values as the value for
the current cell
By Michael Schroeder, Biotec
51
Reading the Alignment
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
-1
1
0
1
0
-1
1
-1
0
-1
0
2
1
0
-1
-1
1
0
1
1
0
1
0
0
0
0
2
1
0
2
1
1
0
1
3
1
1
1
0
0
1
2
---tgcata
atctg-atBy Michael Schroeder, Biotec
52
Rewards/Penalities
Let’s refine the schemes:
Transition mutations are more common:
purine<->purine, a<->g
pyrimidine<->pyrimidine, t<->c
Transversions (purine<->pyrimidine) are less
common
Use a subsitutation matrix to rate mismatches:
A
A
G C
2 -2 0 -2
T -2 2 -2 0
G
-2 for insert/delete
Mismatch/match according to substitution matrix
T
0 -2 2 -2
C -2 0 -2 2
…Consider
-2 + the value of the cell north
-2 + the value of the cell west
Corresponding value of the substion matrix
+ the value of the cell northwest
Put down the maximum of these values as the
value for the current cell
By Michael Schroeder, Biotec
53
Reading the Alignment
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
-2
2
0
2
0
-2
2
0
0
0
0
4
2
0
-2
0
2
0
2
2
2
2
0
0
0
0
4
2
0
4
2
2
0
2
6
2
2
2
0
2
2
4
---tgcata
atctg-atBy Michael Schroeder, Biotec
54
Substitution matrixes
By Michael Schroeder, Biotec
55
How to derive a substitution matrix for
amino acids?
Amino acids can be classified by physiochemical
properties
Hydrophobic
A
G
P
I
L
Polar
S
T
N
Q
V
Acidic
D
E
Y
C
W
M
F
By Michael Schroeder, Biotec
K
H
R Basic
Aromatic
56
Cys
Ser
Thr
Pro
Ala
Gly
Asn
Asp
Glu
Gln
His
Arg
Lys
Met
Ile
Leu
Val
Phe
Tyr
Trp
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
C
PAM 250 matrix
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
S
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
T
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
A
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
By Michael Schroeder, Biotec
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
N
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
>0, likely mutation
0, random mutation
<0, unlikely
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
6
2
0
-2
-2
-2
-2
-2
0
-3
H
6
3
0
-2
-3
-2
-4
-4
2
R
5
0 6
-2 2 5
-3 4 2
-2 2 4
-5 0 1
-4 -2 -1
-3 -4 -5
K M I
6
2 4
2 -1 9
-1 -2 7 10
-2 -6 0 0 17
L V F Y W
57
Cys
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
Ser
0
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
Thr
-2
1
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
Pro
-3
1
0
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
Ala
-2
1
1
1
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
Gly
-3
1
0
-1
1
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
Asn
-4
1
0
-1
0
0
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
Asp
-5
0
0
-1
0
1
2
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
Glu
-5
0
0
-1
0
0
1
3
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
Gln
-5
-1
-1
0
0
-1
1
2
2
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
His
-3
-1
-1
0
-1
-2
2
1
1
3
6
2
0
-2
-2
-2
-2
-2
0
-3
Arg
-4
0
-1
0
-2
-3
0
-1
-1
1
2
6
3
0
-2
-3
-2
-4
-4
2
Lys
-5
0
0
-1
-1
-2
1
0
0
1
0
3
5
0
-2
-3
-2
-5
-4
-3
Met
-5
-2
-1
-2
-1
-3
-2
-3
-2
-1
-2
0
0
6
2
4
2
0
-2
-4
Ile
-2
-1
0
-2
-1
-3
-2
-2
-2
-2
-2
-2
-2
2
5
2
4
1
-1
-5
Leu
-6
-3
-2
-3
-2
-4
-3
-4
-3
-2
-2
-3
-3
4
2
6
2
2
-1
-2
Val
-2
-1
0
-1
0
-1
-2
-2
-2
-2
-2
-2
-2
2
4
2
4
-1
-2
-6
Phe
-4
-3
-3
-5
-4
-5
-4
-6
-5
-5
-2
-4
-5
0
1
2
-1
9
7
0
Tyr
0
-3
-3
-5
-3
-5
-2
-4
-4
-4
0
-4
-4
-2
-1
-1
-2
7
10
0
Trp
-8
-2
-5
-6
-6
-7
-4
-7
-7
-5
-3
2
-3
-4
-5
-2
-6
0
0
17
Average
StDev
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
-2.8
-0.5
-0.7
-1.2
-0.9
-1.6
-0.7
-1.1
-1.1
-0.8
-0.3
-0.7
-0.9
-0.8
-0.8
-1.4
-0.8
-1.9
-1.5
-3.1
4
1.5
1.7
2.6
1.9
2.7
2.1
3
2.8
2.6
2.3
2.6
2.5
2.6
2.4
3
2.3
4.1
3.8
5.4
By Michael Schroeder, Biotec
58
Cys
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
Ser
0
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
Thr
-2
1
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
Pro
-3
1
0
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
Ala
-2
1
1
1
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
Gly
-3
1
0
-1
1
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
Asn
-4
1
0
-1
0
0
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
Asp
-5
0
0
-1
0
1
2
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
Glu
-5
0
0
-1
0
0
1
3
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
Gln
-5
-1
-1
0
0
-1
1
2
2
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
His
-3
-1
-1
0
-1
-2
2
1
1
3
6
2
0
-2
-2
-2
-2
-2
0
-3
Arg
-4
0
-1
0
-2
-3
0
-1
-1
1
2
6
3
0
-2
-3
-2
-4
-4
2
Lys
-5
0
0
-1
-1
-2
1
0
0
1
0
3
5
0
-2
-3
-2
-5
-4
-3
Met
-5
-2
-1
-2
-1
-3
-2
-3
-2
-1
-2
0
0
6
2
4
2
0
-2
-4
Ile
-2
-1
0
-2
-1
-3
-2
-2
-2
-2
-2
-2
-2
2
5
2
4
1
-1
-5
Leu
-6
-3
-2
-3
-2
-4
-3
-4
-3
-2
-2
-3
-3
4
2
6
2
2
-1
-2
Val
-2
-1
0
-1
0
-1
-2
-2
-2
-2
-2
-2
-2
2
4
2
4
-1
-2
-6
Phe
-4
-3
-3
-5
-4
-5
-4
-6
-5
-5
-2
-4
-5
0
1
2
-1
9
7
0
Tyr
0
-3
-3
-5
-3
-5
-2
-4
-4
-4
0
-4
-4
-2
-1
-1
-2
7
10
0
Trp
-8
-2
-5
-6
-6
-7
-4
-7
-7
-5
-3
2
-3
-4
-5
-2
-6
0
0
17
Average
StDev
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
-2.8
-0.5
-0.7
-1.2
-0.9
-1.6
-0.7
-1.1
-1.1
-0.8
-0.3
-0.7
-0.9
-0.8
-0.8
-1.4
-0.8
-1.9
-1.5
-3.1
4
1.5
1.7
2.6
1.9
2.7
2.1
3
2.8
2.6
2.3
2.6
2.5
2.6
2.4
3
2.3
4.1
3.8
5.4
By Michael Schroeder, Biotec
59
PAM 250: Interpretation
Immutable:
Cysteine (Avg=-2.8): known to have several unique,
indispensable functions
attachment site of heme group in cytochrome and of iron
sulphur FeS in ferredoxins
Cross links in proteins such as chymotrypsin or ribonuclease
Seldom without unique function
Glycine (Avg=-1.6): small size maybe advantageous
Mutable:
Serine often functions in active site, but can be easily replaced
Self-alignment:
Tryptophan with itself scores very high, as W occurs rarely
By Michael Schroeder, Biotec
60
Point Accepted Mutations PAM
Substitution matrix using explicit evolutionary model of how
amino acids change over time
Use parsimony method to determine frequency of mutations
Entry in PAM matrix: Likelihood ratio for residues a and b: Probability
a-b is a mutation / probability a-b is chance
PAM x: Two sequences V, W have evolutionary distance of x PAM if
a series of accepted point mutations (and no insertions/deletions)
converts V into W averaging to x point mutation per 100 residues
Mutations here = mutations in the DNA
Because of silent mutations and back mutations n can be >100
PAM 250 most commonly used
By Michael Schroeder, Biotec
61
PAM and Sequence Similarity
PAM
0
30
80
110
200
250
% identiy
100
75
60
50
25
20
By Michael Schroeder, Biotec
62
PAM
Dayhoff, Eck, Park: A model of evolutionary change in proteins,
1978
Accepted point mutation = substitution of an amino acid
accepted by natureal selection
Assumption: X replacing Y as likely as Y replacing X
Used cytochrome c, hemoglobin, myoglobin, virus coat
proteins, chymotrypsinogen, glyceraldehyde 3-phosphate
dehrydogenase, clupeine, insulin, ferredoxin
Sequences which are too distantly related have been
omitted as they are more likely to contain multiple mutations
per site
By Michael Schroeder, Biotec
63
PAM: Step 1
Step 1: Construct a multiple alignment
Example
ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
By Michael Schroeder, Biotec
64
PAM: Step 2
Create a phylogenetic tree (parsimony method)
ACGCTAFKI
A->G
I->L
GCGCTAFKI
A->G
GCGCTGFKI
By Michael Schroeder, Biotec
ACGCTAFKL
A->L
GCGCTLFKI
C->S
ASGCTAFKL
G->A
ACACTAFKL
65
PAM: Step 3
Relative mutability mi
Probability that residue i will mutate
Relative mutability depends on
Substatutability: Relative
ri mutability should increase with
increasing substatutability
Residue probability: Relative mutability should decrease
with increasing residue probability
From relative mutability final PAM score is derived
By Michael Schroeder, Biotec
66
BLOSUM
Different approach to PAM
BLOcks SUbstitution Matrix (based on BLOCKS
database)
Generation of BLOSUM x
Group highly similar sequences and replace them by a
representative sequences.
Only consider sequences with no more than x % similarity
Align sequences (no gaps)
For any pair of amino acids a,b and for all columns c of the
alignment, let q(a,b) be the number of co-occurrences of a,b in all
columns c.
Let p(a) be the overall probability of a occurring
BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )
BLOSUM 50 and BLOSUM 62 widely used
By Michael Schroeder, Biotec
67
LCS Algorithm (Longest Common
Subsequence) Revisited
Algorithm (Dynamic Programming) with Substitution Matrix:
Insert a row 0 and column 0 initialised with 0
Starting from the top left, move down row by row from row 1 and
right column by column from column 1 visiting each cell
Consider
The value of the cell north
The value of the cell west
The value of the cell northwest if the row/column character
mismatch
s + the value of the cell northwest, where s is the value
in the subsitution matrix for the residues in row/column
Put down the minimum of these values as the value for the
current cell
Trace back the path with the highest values from the bottom right
to the top left and output the alignment
By Michael Schroeder, Biotec
68
LCS Revisited: Formally
What is the length s(V,W) of the longest common subsequence
of two sequences
V=v1..vn and W=w1..wm ?
Find sequences of indices
1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ m
such that vit = wjt for 1 ≤ t ≤ k
How? Dynamic programming:
si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and
si-1,j
si,j = max
si,j-1
si-1,j-1 + t, where t is the value for vi and
wj in
the substitution matrix
Then s(V,W) = sn,m is the length of the LCS
{
By Michael Schroeder, Biotec
69
Dynamic programming revisited:
local and global alignments and gap
By Michael Schroeder, Biotec
70
Evolution and Alignments
Alignments can be interpreted in evolutionary terms
Identical letters are aligned.
Interpretation: part of the same ancestral sequence
and not changed
Non-identical letters are aligned (substitution)
Interpretation: Mutation
Gaps
Interpretation: Insertions and deletions (indels)
By Michael Schroeder, Biotec
71
Evolution and Alignments
Specific problems aligning DNA:
“Frame shift”:
DNA triplets code amino acids
Indel of one nucleotide shifts the whole sequence of
triplets
Thus may have a global effect and change all coded
amino acids
Silent mutation:
Substitution in DNA leaves transcribed amino acid
unchanged
Non-sense mutation:
Substitution to stop-codon
By Michael Schroeder, Biotec
72
Local and Global Alignments
Global alignment (Needleham-Wunsch) algorithm finds
overall best alignment
Example: members of a protein family, e.g. globins are very
conserved and have the same length in different organisms from
fruit fly to humans
Local alignment (Smith-Waterman) algorithm finds locally best
alignment
most widely used, as
e.g. genes from different organisms retain similar
exons, but may have different introns
e.g. homeobox gene, which regulates embryonic
development occurs in many species, but very different
apart from one region called homeodomain
e.g. proteins share some domains, but not all
By Michael Schroeder, Biotec
73
Local Alignment
LCS s(V,W) computes globally best alignment
Often it is better to maximise locally, i.e. compute
maximal s(vi…vi’ , wj… wi’ ) for all substrings of V
and W
Can we adapt algorithm?
Global alignment = longest path in matrix s from (0,0)
to (n,m)
Local alignment = longest path in matrix s from any
(i,j) to any (i’,j’)
Modify definition of s adding vertex of weight 0 from
source to every other vertex, creating a free “jump” to
any starting position (i,j)
By Michael Schroeder, Biotec
74
Local Alignment
Modify the definition of s as follows:
si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and
0
si-1,j
si,j = max
si,j-1
si-1,j-1 + t, where t is the value for vi wj
in the substitution matrix
Then s(V,W) = max { si,j } is the length of the local LCS
{
This computes longest path in edit graph
Several local alignment may have biological
significance (consider e.g. two multi-domain proteins
whose domains are re-ordered
By Michael Schroeder, Biotec
75
Aligning with Gap Penalties
Gap is sequence of spaces in alignment
So far, we consider only insertion and deletion of single
nucleotides or amino acids creating alignments with many gaps
So far, score of a gap of length l is l
Because insertion/deletion of monomers is evolutionary slow
process, large numbers of gaps do not make sense
Instead whole substrings will be deleted or inserted
We can generalise score of a gap to a score function A + B l,
where A is the penalty to open the gap and B is the penalty
to extend the gap
By Michael Schroeder, Biotec
76
Aligning with Gap Penalties
High gap penalties result in shorter, lower-scoring
alignments with fewer gaps and
Lower gap penalties give higher-scoring, longer
alignments with more gaps
Gap opening penalty A mainly influences number
of gaps
Gap extension penalty B mainly influences length
of gaps
E.g. if interested in close relationships, then choose
A, B above default values, for distant relationships
decrease default values
By Michael Schroeder, Biotec
77
Aligning with Gap Penalties
Adapt the definition of s as follows:
s-deli,j = max
s-insi,j = max
{
{
s-deli-1,j - B
si-1,j – (A+B)
s-insi,j-1 - B
si,j-1 – (A+B)
0
s-deli,j
si,j = max
s-insi,j
si-1,j-1 + t, where t is the value for vi, wj
in the substitution matrix
Then s(V,W) = max { si,j } is the length of the local LCS
with gap penalties A and B
{
By Michael Schroeder, Biotec
78
FASTA and BLAST
By Michael Schroeder, Biotec
79
Motivation
As in dotplots, the underlying data structure for dynamic
programming is a table
Given two sequences of length n dynamic programming
takes time proportional to n2
Given a database with m sequences, comparing a query
sequence to the whole database takes time proportional to m
n2
What does this mean?
Imagine you need to fill in the tables by hand and it takes 10
second to fill in one cell
Assume there are 1.000.000 sequences each 100 amino acids
long
How long does it take?
By Michael Schroeder, Biotec
80
1.000.000 x 100 x 100 x 10 sec = 1011 sec =
27.777.778h = 1157407days = 3170 years
Even if a computer does not take 10 sec, but just
0.1ms to fill in one cell, it would still be 12 days.
We cannot do something about the database size,
but can we do something about the table size?
By Michael Schroeder, Biotec
81
An idea: Prune the search space
By Michael Schroeder, Biotec
82
Another idea
Did we formulate the
problem correctly?
Do we need the alignments
for all sequences in the
database?
No, only for “reasonable”
hits  introduce a
threshold
A “reasonable” alignment
will contain short stretches
of perfect matches
Find these first, then
extend them to connect
them as best possible
By Michael Schroeder, Biotec
83
FASTA and BLAST
FASTA and BLAST faster than dynamic
programming (5 times and 50 times respectively)
Underlying idea for a heuristic:
High-scoring alignments will contain short stretches
of identical letters, called words
FASTA and BLAST first search for matches of words of
a given length and score threshold:
BLAST for words of length 3 for proteins and 11 for
DNA
FASTA for words of length 2 for proteins and 6 for
DNA
Next, matches are extended to local (BLAST) and
global (FASTA) alignments
By Michael Schroeder, Biotec
84
FASTA and BLAST
More formally:
If the strings V=v1..vm and W=w1..wm match with at
most k mismatches, then they share an p-tuple for
p = m/(k+1),
i.e. vi..vi+l-1 =wj..wj+l-1 for some 1 ≤ i,j ≤ m-p+1
FILTRATION ALGORITHM, which detects all
matching words of length m with up to k mismatches
Potential match detection: Find all matches of p-tuples
of V,W (can be done in linear time by inserting them into
a hash table)
Potential match verification: Verify each potential
match by extending it to the left and right until either the
first k+1 mismatches are found or the beginning or end
of the sequences are found
By Michael Schroeder, Biotec
85
Example for BLAST
Search SWISSPROT for Immunoglobulin:
SWISS_PROT:C79A_HUMAN P11912
By Michael Schroeder, Biotec
86
Example for BLAST
Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for P11912
Database: All non-redundant SwissProt sequences
1,292,592 sequences; 412,925,052 total letters
By Michael Schroeder, Biotec
87
Example for BLAST
Distribution of Hits:
By Michael Schroeder, Biotec
88
Example for BLAST: Top Hits
Score E Sequences producing significant alignments:
Score
gi|547896|sp|P11912|C79A_HUMAN B-cell antigen receptor comp...
473
gi|728993|sp|P40293|C79A_BOVIN B-cell antigen receptor comp...
312
gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor comp...
278
gi|728994|sp|P40259|C79B_HUMAN B-cell antigen receptor comp...
55
gi|125781|sp|P01618|KV1_CANFA IG KAPPA CHAIN V REGION GOM
38
gi|125361|sp|P17948|VGR1_HUMAN Vascular endothelial growth ...
37
gi|549319|sp|P35969|VGR1_MOUSE Vascular endothelial growth ...
36
gi|114764|sp|P15530|C79B_MOUSE B-cell antigen receptor comp...
36
gi|1718161|sp|P53767|VGR1_RAT Vascular endothelial growth f...
35
gi|125735|sp|P01681|KV01_RAT Ig kappa chain V region S211
35
gi|1730075|sp|P01625|KV4A_HUMAN IG KAPPA CHAIN V-IV REGION LEN 34
gi|1718188|sp|P52583|VGR2_COTJA Vascular endothelial growth...
33
gi|125833|sp|P06313|KV4B_HUMAN IG KAPPA CHAIN V-IV REGION J... 33
gi|125806|sp|P01658|KV3F_MOUSE IG KAPPA CHAIN V-III REGION ...
33
gi|125808|sp|P01659|KV3G_MOUSE IG KAPPA CHAIN V-III REGION ...
33
gi|1172451|sp|Q05793|PGBM_MOUSE Basement membrane-specific ...
33
gi|125850|sp|P01648|KV5O_MOUSE Ig kappa chain V-V region HP...
33
gi|125830|sp|P06312|KV40_HUMAN Ig kappa chain V-IV region p...
33
gi|2501738|sp|Q06639|YD03_YEAST Putative 101.7 kDa transcri...
33
By Michael Schroeder, Biotec
E-Value
e-133
3e-85
5e-75
1e-07
0.019
0.042
0.052
0.064
0.080
0.095
0.26
0.28
0.30
0.30
0.30
0.33
0.36
0.38
0.41
89
Example for BLAST: Alignment
>gi|126779|sp|P11911|C79A_MOUSE
B-cell antigen receptor complex
associated protein alpha-chain
precursor (IG-alpha) (MB-1 membrane glycoprotein)
(Surface-IGM-associated protein) (Membrane-bound
immunoglobulin associated protein) (CD79A)
Length = 220
Score = 278 bits (711), Expect = 5e-75
Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 56
MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60
MPGG
+ LL LS
LGPGCQAL +
P SL V+LGE+A
C N+
MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55
NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120
N N+TWW L N TWPP LGPG+
G L
VNK+ G
C+V E N
++SCG
NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114
Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180
TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D D
Sbjct: 115
TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174
Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226
+YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKP
Sbjct: 175
DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220
By Michael Schroeder, Biotec
90
Example for BLAST
Lineage Report
root
. cellular organisms
. . Eukaryota
[eukaryotes]
. . . Fungi/Metazoa group [eukaryotes]
. . . . Bilateria
[animals]
. . . . . Coelomata
[animals]
. . . . . . Gnathostomata
[vertebrates]
. . . . . . . Tetrapoda
[vertebrates]
. . . . . . . . Amniota
[vertebrates]
. . . . . . . . . Eutheria
[mammals]
. . . . . . . . . . Homo sapiens (man) ---------------------. . . . . . . . . . Bos taurus (bovine) .....................
. . . . . . . . . . Mus musculus (mouse) ....................
. . . . . . . . . . Canis familiaris (dogs) .................
. . . . . . . . . . Rattus norvegicus (brown rat) ...........
. . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) .
. . . . . . . . . Coturnix japonica ------------------------. . . . . . . . . Gallus gallus (chickens) ..................
. . . . . . . . Xenopus laevis (clawed frog) ---------------. . . . . . . Heterodontus francisci -----------------------. . . . . . Drosophila melanogaster ------------------------. . . . . Caenorhabditis elegans ---------------------------. . . . Saccharomyces cerevisiae (brewer's yeast) ----------. . . Marchantia polymorpha --------------------------------. . Agrobacterium tumefaciens str. C58 ---------------------. Human adenovirus type 3 ----------------------------------. Human adenovirus type 7 ...................................
By Michael Schroeder, Biotec
473 33 hits [mammals]
312 2 hits [mammals]
278 31 hits [mammals]
37 1 hit [mammals]
35 7 hits [mammals]
29 1 hit [mammals]
33 2 hits [birds]
31 4 hits [birds]
30 2 hits [amphibians]
28 1 hit [sharks and rays]
30 2 hits [flies]
29 1 hit [nematodes]
33 1 hit [ascomycetes]
29 1 hit [liverworts]
28 1 hit [a-proteobacteria]
30 1 hit [viruses]
30 1 hit [viruses]
91
How good is an alignment?
Be careful: Fitch/Smith found 17 alignments for alpha- and
beta-chains in chicken haemoglobins
Only one is the correct one (according to the structure)
Given an alignment, how good is it :
Percentage of matching residues, i.e. number of matches divided
by length of smallest sequence
Advantage: independent of sequence length
E.g.
AT–C –TGAT
4/6 = 66.67%
–TGCAT –A–
More general: also consider gaps, extensions,…
By Michael Schroeder, Biotec
92
Blast Raw Score
R = a I + b X - c O - d G, where
I is the number of identities in the alignment and a is
the reward for each identity
X is the number of mismatches in the alignment and
b is the “reward” for each mismatch
O is the number of gaps and c is the penalty for each
gap
G is the number of “-” characters in the alignment
and d is the penalty for each
The values for a,b,c,d appear at the bottom of a Blast
report. For BLASTn they are a=1, b=-3, c=5, d=2
By Michael Schroeder, Biotec
93
Example
Query: 1 atgctctggccacggcacttgcgga
||||||||||||||| |||| |||
Sbjt:107 atgctctggccacggatcttgtgga
tcccagggtgatctgtgcacctgcgata 53
|||||
|||| |||||||||||||||
tccca---tgatatgtgcacctgcgata 156
R = 1 x 46 + -3 x 4 - 5 x 1 - 2 x 3 = 23
So, given the scores: how significant is the
alignment?
By Michael Schroeder, Biotec
94
Significance of an alignment
Significance of an alignment needs to be defined with respect
to a control population
Pairwise alignment: How can we get control population?
Generate sequences randomly?  Not a good model of real
sequences
Chop up both sequences and randomly reassemble them
Database search: How can we get control population?
Control = whole database
Align sequence to control population and see how good result
is in comparison
This is captured by Z scores, P-values and E-values
By Michael Schroeder, Biotec
95
Z-score
Z-score normalises the score S:
Let m be mean of population and std its standard
deviation, then Z-score = (S – m) / std
Z-score of 0  no better than average, hence might
have occurred by chance
The higher the Z-score the better
By Michael Schroeder, Biotec
96
P-value
P-value: probability of obtaining a score ≥ S
Range: 0 ≤ P ≤ 1
Let m be the number of sequences in the control
population with score ≥ S
Let p be the size of the control population
Then P-value = m / p
Rule of thumb:
P ≤ 10-100 exact match,
10-100 ≤ P ≤ 10-50 nearly identical (SNPs)
10-50 ≤ P ≤ 10-10 homology certain
10-5 ≤ P ≤ 10-1 usually distant relative
P > 10-1 probably insignificant
By Michael Schroeder, Biotec
97
E-values
E-value takes also the database into account
E-value = expected frequency of a score ≥ S
Range: 0 ≤ E ≤ m, where m is the size of the database
Relationship to P: E = m P
E values are calculated from
the bit score
the length of the query
the size of the database
By Michael Schroeder, Biotec
98
Precision and Recall
How good are BLAST and FASTA?
True positives, tp = hits which are biologically meaningful
False positives, fp = hits which are not biologically meaningful
True negatives, tn = non-hits which are not biologically meaningful
False negatives, fn = non-hits which are biologically meaningful
Minimise fp and fn
Recall:
tp/(tp+fn)
Precision: tp/(tp+fp)
(meaningful hits / all meaningful)
(meaningful hits / all hits)
But: since no objective data available difficult to judge BLAST
and FASTA’s sensitivity and specificity
By Michael Schroeder, Biotec
99
Multiple Sequence Alignments
By Michael Schroeder, Biotec
100
Multiple Sequence Alignment
Align more than two sequences
Choice of sequences
If too closely related then large redundant
If very distantly related then difficult to generate good alignment
Additionally use colour for residues with similar properties
Yellow
Green
Small polar
Hydrophobic
Magenta
Red
Blue
Polar
Negatively charged
Positively charged
By Michael Schroeder, Biotec
GLy,Ala,Ser,Thr
Cys,Val,Ile,Leu,
Pro,Phe,Tyr,Met,Trp
Asn,Gln,His
Asp,Glu
Lys, Arg
101
Thioredoxins: WCGPC[K or R] motif
By Michael Schroeder, Biotec
102
Thioredoxins: Gly/Pro = turn
By Michael Schroeder, Biotec
103
Thioredoxins: every second
hydrophobic = beta strand
By Michael Schroeder, Biotec
104
Thioredoxins:
ca. every 4th hydrophobic = alpha helix
By Michael Schroeder, Biotec
105
By Michael Schroeder, Biotec
106
Summary
Evolutionary model: Indels and substitutions
Homologues vs. similarity
Dot plots
Easy visual exploration, but not scalable
Dynamic programming
Local, global, gaps
Substitution matrices (PAM, BLOSUM)
BLAST and FASTA
Scores and significance
Multiple Sequence Alignments
By Michael Schroeder, Biotec
107