Transcript O - Biotec

Sequence comparison
and Phylogeny
based on
Chapter 4
Lesk, Introduction to Bioinformatics
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Contents
 Motivation
 Sequence comparison and alignments
 Dot plots
 Dynamic programming
 Substitution matrices
 Dynamic programming: Local and global alignments
and gaps
 BLAST
 Significance of alignments
 Multiple sequence alignments
 Phylogenetic trees
By Michael Schroeder, Biotec, 2004
2
Motivation
 From where are we?
 Recent Africa vs. Multi-regional Hypothese
 In 1999 Encephalitis caused by the West Nile Virus
broke out in New York. How did the virus come to
New York?
 How did the nucleus get into the eucaryotic cells?
 To answer such questions we will need sequence
comparison and phylogenetic trees
By Michael Schroeder, Biotec, 2004
3
Sequence Similarity Searches
 Sequence similarity can be clue to common evolutionary
ancestor…
 E.g. globin genes in chimpanzees and humans
 … or common function
 E.g. v-sys onco genes in simian sarcoma virus leading to cancer
in monkeys and the seemingly unrelated growth stimulating
hormone PDGF, which stimulates cell growth (first success of
similarity idea, 1983)
 In general:
 If an unknown sequences is found, deduce its
function/structure indirectly by finding similar sequences, whose
function/structure is known
 Assumption: Evolution changes sequences “slowly” often
maintaining main features of a sequence’s function/structure
By Michael Schroeder, Biotec, 2004
4
Sequence alignment
 Substitutions, insertions and deletions can be
interpreted in evolutionary terms
 But: distinguish chance similarity and real biological
relationship
CCGTAA
TCGTAA
TCGTAC
TTGTAA
TAGTAC
By Michael Schroeder, Biotec, 2004
CCGTAT
TCGTAG
5
Evolution
 Convergent evolution: same sequence evolved from
different ancestors
 Back evolution - mutate to a previous sequence
CCGTAA
TCGTAA
TAGTAA
TAGTAC
TCGTAC
TAGTAC
By Michael Schroeder, Biotec, 2004
CCGTAT
TCGTAG
CCGTAA
6
Similarity vs. Homology
 Any sequence can be similar
 Sequences homologues if evolved from common
ancestor
 Homologous sequences:
 Orthologs: similar biological function
 Paralogs: different biological function (after gene
duplication), e.g. lysozyme and α-lactalbumin, a
mammalian regulatory protein
 Assumption: Similarity indicator for homology
 Note, altered function of the expressed protein will
determine if the organism will survive to reproduce, and
hence pass on the altered gene
By Michael Schroeder, Biotec, 2004
7
Sequence alignments
 Given two or more sequences, we wish to
 Measure their similarity
 Determine the residue-residue correspondences
 Observe patterns of conservation and variability
 Infer evolutionary relationships
By Michael Schroeder, Biotec, 2004
8
What is the best alignment?
 Uninformative:
-------gctgaacg
ctataatc------ Without gaps:
gctgaacg
ctataatc
 With gaps:
gctga-a--cg
--ct-ataatc
 Another one:
gctg-aa-cg
-ctataatc Formally: The best alignment have only a minimal
number of mismatches (insertions, deletions,
replace)
 We need a method to systematically explore and to
compute alignments
By Michael Schroeder, Biotec, 2004
9
Scores for an alignment
 Percentage of matches
 Score each match, mismatch, gap opening, gap extension
 Example




match +1
mismatch -1
Gap opening -3
Gap extension -1
 Uninformative: 0%, score= -21
 Without gaps:25%,score= -4
 With gaps: 0%, score= -23
 Another one: 50%, score=-12
By Michael Schroeder, Biotec, 2004
-------gctgaacg
ctataatc------gctgaacg
ctataatc
gctga-a--cg
--ct-ataatc
gctg-aa-cg
-ctataatc-
10
Scores for an alignment
 Percentage of matches
 Score each match, mismatch, gap opening, gap extension
 Example




match +2
mismatch -1
Gap opening -1
Gap extension -1
 Uninformative: 0%, score= -17
 Without gaps:25%,score= -2
 With gaps: 0%, score= -15
 Another one: 50%, score=0
By Michael Schroeder, Biotec, 2004
-------gctgaacg
ctataatc------gctgaacg
ctataatc
gctga-a--cg
--ct-ataatc
gctg-aa-cg
-ctataatc-
11
Dot plots
By Michael Schroeder, Biotec, 2004
12
Dot plots
 A convenient way of comparing 2 sequences visually
 Use matrix, put 1 sequence on X-axis, 1 on Y-axis
 Cells with
 identical characters filled with a ‘1’,
 non-identical with ‘0’
 (simplest scheme - could have weights)
By Michael Schroeder, Biotec, 2004
13
Dot plots
D O R O T H Y C R O W F O O T H O D G K I
N
D
O
R
O
T
H
Y
H
O
D
G
K
I
N
By Michael Schroeder, Biotec, 2004
14
Dot plots
D O R O T H Y C R O W F O O T H O D G K I
D D
O
O
O
R
R
O
O
O
T
T
H
H
Y
Y
H
H
O
O
O
D D
G
K
I
N
By Michael Schroeder, Biotec, 2004
N
D
O
O O
O
O
O O
O
R
T
H
H
O
O O
O
D
G
K
I
N
15
Interpreting dot plots
 What do identical sequences look like?
 What do unrelated sequences look like?
 What do distantly related sequences look like?
 What does reverse sequence look like?
 Relevant for detections of stems in RNA structure
 What does a palindrome look like?
 Relevant for restriction enzymes
 What do repeats look like?
 What does a protein with domains A and B and another
one with domains B and C look like?
By Michael Schroeder, Biotec, 2004
16
Dot plot for identical sequences
D
D
O
R
O
T
H
H
O
D
O
D
G
K
I
N
D
O
R
O
O
O
O
R
O
O
T
T
H
H
Y
H
Y
H
H
O
D
Y
O
D
G
K
I
N
By Michael Schroeder, Biotec, 2004
O
H
O
D
G
K
I
N
17
Dotplot for unrelated sequences
D O R O T H Y H O D G K I
O
O
O
O
T
T
T
T
O
O
N
O
O
D D
D
I
I
E
T
T
E
R
R
By Michael Schroeder, Biotec, 2004
18
Dotplot for distantly related sequences
D
O
R
O
T
T
H
Y
H
O
D
G
K
I
N
T
I
I
M
O
O
O
T
O
T
H
H
Y
H
Y
J
E
N
N
K
K
I
I
N
N
By Michael Schroeder, Biotec, 2004
19
Dotplot for reverse sequences
D
O
R
O
T
H
Y
H
O
D
G
K
I
N
N
I
I
K
K
G
D
G
D
O
D
O
O
O
H
H
Y
H
Y
H
H
T
H
T
O
O
R
O
O
O
O
R
O
D
N
O
D
By Michael Schroeder, Biotec, 2004
20
Dotplot for reverse sequences
 Relevant to identify stems in RNA structures
 Plot sequence against its reverse complement
By Michael Schroeder, Biotec, 2004
21
Palindromes and restriction enzymes
 Madam, I'm Adam
 Able was I ere I saw Elba (supposedly said by Napoleon)
 Doc note I dissent, a fast never prevents a fatness, I diet on cod.
 Because DNA is double stranded and the strands run
antiparallel, palindromes are defined as any double stranded
DNA in which reading 5’ to 3’ both are the same
 The HindIII cutting site:
– 5'-AAGCTT-3'
– 3'-TTCGAA-5'
 The EcoRI cutting site:
– 5'-GAATTC-3'
– 3'-CTTAAG-5'
By Michael Schroeder, Biotec, 2004
22
Dotplot of a Palindrome
M
M
A
A
M
A
M
M
A
D
A
D
A
M
D
A
M
By Michael Schroeder, Biotec, 2004
A
M
23
Dotplot of repeats
T
T
W
E
T
W
T
T
W
E
N
T
Y
T
E
N
T
T
O
T
E
N
T
T
T
Y
T
W
W
N
T
Y
T
T
W
T
T
W
O
W
O
T
W
O
T
T
W
E
T
T
W
E
N
E
N
T
W
E
N
N
T
Y
T
T
Y
O
T
Y
O
N
E
E
T
E
W
T
N
W
Y
T
O
W
N
T
Y
T
E
T
N
N
E
By Michael Schroeder, Biotec, 2004
O
N
N
E
E
24
Dotplot of Repeats/Palindrome
M
M
A
A
M
A
M
A
A
A
D
A
M
A
M
M
A
A
D
A
M
A
M
I
A
M
M
M
M
I
M
A
M
A
D
A
A
D
A
M
M
D
A
M
I
M
D
M
D
A
M
By Michael Schroeder, Biotec, 2004
A
D
A
A
M
M
A
M
25
Dotplot for shared domain
D
D
O
O
O
T
H
Y
H
O
D
G
K
I
N
D
O
R
O
R
O
O
O
O
R
O
T
T
H
H
Y
H
Y
M
I
I
L
L
E
R
R
By Michael Schroeder, Biotec, 2004
26
Dot plot
Result
dorothycrowfoothodgkin
d*
*
o * *
* ** *
r *
*
o * *
* ** *
t
*
*
h
*
*
y
*
h
*
*
o * *
* ** *
d*
*
g
*
k
*
i
*
n
*
By Michael Schroeder, Biotec, 2004
27
Dot plot
Result
dorothycrowfoothodgkin
d*
*
o * *
* ** *
r *
*
o * *
* ** *
t
*
*
h
*
*
y
*
h
*
*
o * *
* ** *
d*
*
g
*
k
*
i
*
n
*
By Michael Schroeder, Biotec, 2004
28
Dotplots
 Window size 15
 Dot if
 6 matches in window
By Michael Schroeder, Biotec, 2004
29
 Window size 15
 Dot if
 6 matches in window
By Michael Schroeder, Biotec, 2004
30
>gi|1942644|pdb|1MEG|
64
Length = 216
Crystal Structure Of A Caricain D158e Mutant In Complex With E-
Score = 271 bits (693), Expect = 1e-73
Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 61
IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60
+PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS
LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60
YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120
+GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G
KT GV +VQP N+G
HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120
Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176
LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV
G YILIKNS
Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180
Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212
WGT WGE GYIRIKR GNS GVCGLY SS+YP KN
Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216
1
61
121
181
lpenvdwrkk
hgckggyppy
llnaiakqpv
wgtawgekgy
gavtpvrhqg
aleyvakngi
svvveskgrp
irikrapgns
By Michael Schroeder, Biotec, 2004
scgscwafsa
hlrskypyka
fqlykggife
pgvcglykss
vatveginki rtgklvelse qelvdcerrs
kqgtcrakqv ggpivktsgv grvqpnnegn
gpcgtkveha vtavgygksg gkgyilikns
yyptkn
31
 Window size 15
 Dot if
 6 matches in window
By Michael Schroeder, Biotec, 2004
32
>gi|2624670|pdb|1AIM|
Fluoromethylketone
Length = 215
Cruzain Inhibited By Benzoyl-Tyrosine-Alanine-
Score = 121 bits (303), Expect = 3e-28
Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%)
Query: 2
Sbjct: 2
Query: 62
Sbjct: 62
PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61
P VDWR +GAVT VK+QG CGSCWAFSA+ +E
+
L
SEQ L+ CD+
PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61
GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115
GC+GG
+A + + Q
++ ++YPY
EG+
C +
A T V
Q
GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121
Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171
Q A
++ N PV+V + A+
+ Y GG+
+DH V VGY +
Y
Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178
Query: 172 LIKNSWGTGWGENGYIRIKRGT 193
+IKNSW T WGE GYIRI +G+
Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200
By Michael Schroeder, Biotec, 2004
33
 Window size 15
 Dot if
 6 matches in window
By Michael Schroeder, Biotec, 2004
34
gi|7546546|pdb|1EF7|B
Length = 242
Chain B, Crystal Structure Of Human Cathepsin X
Score = 52.0 bits (123), Expect = 2e-07
Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%)
Query: 1
Sbjct: 1
Query: 52
Sbjct: 61
IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51
+P+ DWR
V
+ +NQ
CGSCWA ++
+
I I+
S Q
LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60
ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103
++DC
C GG
S
Q+GI
Y+
+ C
+ K +
NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119
Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157
A +
+ +V Y
+ +
AN P+S + A +
Y GGI+
+
Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178
Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199
+H V+ G+
G Y +++NSWG WGE G++RI
K G G Y +
Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229
By Michael Schroeder, Biotec, 2004
35
 Window size 5
 Dot if
 2 matches in window
By Michael Schroeder, Biotec, 2004
36
 Window size 1
 Dot if
 1 match in window
By Michael Schroeder, Biotec, 2004
37
Dynamic programming
By Michael Schroeder, Biotec, 2004
38
From Dotplots to Alignments
 Obvious best alignment:
D O R O T H
D
Y C R O W F O O T H O D G K I
D
O
O
O
R
O
N
D
R
O
O
O O
O
O
O O
O
R
O
T
T
H
T
H
Y
H
Y
H
H
O
D
DOROTHYCROWFOOTHODGKIN
DOROTHY--------HODGKIN
O
O
D
G
K
I
H
O
O O
O
D
G
K
I
N
By Michael Schroeder, Biotec, 2004
N
39
From Dotplots to Alignments
 Find “best” path from top left corner to bottom right
 Moving “east” corresponds to “-” in the second
sequence
 Moving “south” corresponds to “-” in the first
sequence
 Moving “southeast” corresponds to a match if the
characters are the same or a mismatch otherwise
 Can we automate this?
By Michael Schroeder, Biotec, 2004
40
From Dotplots to Alignments
 Algorithm (Dynamic Programming):
 Insert a row 0 and column 0 initialised with 0
 Starting from the top left, move down row by row from row 1 and
right column by column from column 1 visiting each cell
 Consider
 The value of the cell north
 The value of the cell west
 The value of the cell northwest if the row/column character
mismatch
 1 + the value of the cell northwest if the row/column
character match
 Put down the maximum of these values as the value for the
current cell
 Trace back the path with the highest values from the bottom right
to the top left and output the alignment
By Michael Schroeder, Biotec, 2004
41
From Dotplots to Alignments
0
1
T
2
G
3
C
4
A
5
T
6
A
0
1A
2T
3C
4T
5G
6A
7T
By Michael Schroeder, Biotec, 2004
42
From Dotplots to Alignments
0
0
0
1A
2T
3C
4T
5G
6A
7T
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
 Insert a row 0 and column 0 initialised with 0
By Michael Schroeder, Biotec, 2004
43
From Dotplots to Alignments
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
0
0
1
1
1
•Consider
•Value north
•Value west
•Value northwest if the row/column character mismatch
•1 + value northwest if the row/column character match
•Put down the maximum of these values for current celll
By Michael Schroeder, Biotec, 2004
44
From Dotplots to Alignments
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
By Michael Schroeder, Biotec, 2004
45
Reading the Alignment
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
-tgcat-aat-c-tgat
By Michael Schroeder, Biotec, 2004
46
Reading the Alignment:
there are more than one possibility
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
---tgcata
atctg-atBy Michael Schroeder, Biotec, 2004
47
Formally:
Longest Common Subsequence LCS
 What is the length s(V,W) of the longest common
subsequence of two sequences
V=v1..vn and W=w1..wm ?
 Find sequences of indices
1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ m
such that vit = wjt for 1 ≤ t ≤ k
 How? Dynamic programming:
 si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and

si-1,j
si,j = max
si,j-1
si-1,j-1 + 1, if vi = wj
 Then s(V,W) = sn,m is the length of the LCS
{
By Michael Schroeder, Biotec, 2004
48
Example LCS
0
1
T
2
G
3
C
4
A
5
T
6
A
0
1A
2T
3C
4T
5G
6A
7T
By Michael Schroeder, Biotec, 2004
49
Example LCS:
0
0
1A
2T
3C
4T
5G
6A
7T
0
0
0
0
0
0
0
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
 Initialisation:
si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and
By Michael Schroeder, Biotec, 2004
50
Example LCS:
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
0
0
1
1
1
 Computing each cell:
si,j = max
By Michael Schroeder, Biotec, 2004
{
si-1,j
si,j-1
si-1,j-1 + 1, if vi = wj
51
Example LCS:
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
1
2
2
2
0
1
2
2
2
2
2
1
1
2
2
2
3
3
1
2
2
3
3
3
4
1
2
2
3
3
4
4
 Computing each cell:
si,j = max
By Michael Schroeder, Biotec, 2004
{
si-1,j
si,j-1
si-1,j-1 + 1, if vi = wj
52
LCS Algorithm
 LCS(V,W)
 For i = 1 to n
 si,0 = 0
 For j = 1 to m
 s0,j = 0
 For i = 1 to n
 For j = 1 to m
 Complexity:
 LCS has quadratic complexity:
O(n m)
 If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then
 si,j = si-1,j-1 +1
 bi,j = North West
 Else if si-1,j ≥ si,j-1 Then
 si,j = si-1,j
 bi,j = North
 Else
 si,j = si,j-1
 bi,j = West
 Return s and b
By Michael Schroeder, Biotec, 2004
53
Printing the alignment of LCS
 PRINT-LCS(b,V,i,j)
 If i=0 or j=0 Then Return
 If bi,j = North West Then
PRINT-LCS(V,b,i-1,j-1)
Print vi
 Else if bi,j = North Then
PRINT-LCS(V,b,i-1,j)
 Else
PRINT-LCS(V,b,i,j-1)
By Michael Schroeder, Biotec, 2004
54
Rewards/Penalities
 We can use different schemes:
 -1 for insert/delete/mismatch
 +1 for match
 …Consider
 -1 + the value of the cell north
 -1 + the value of the cell west
 -1 + the value of the cell northwest if the row/column
character mismatch
 +1 + the value of the cell northwest if the row/column
character match
 Put down the maximum of these values as the value for
the current cell
By Michael Schroeder, Biotec, 2004
55
Reading the Alignment
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
-1
1
0
1
0
-1
1
-1
0
-1
0
2
1
0
-1
-1
1
0
1
1
0
1
0
0
0
0
2
1
0
2
1
1
0
1
3
1
1
1
0
0
1
2
---tgcata
atctg-atBy Michael Schroeder, Biotec, 2004
56
Rewards/Penalities
 Let’s refine the schemes:
 Transition mutations are more common:
 purine<->purine, a<->g
 pyrimidine<->pyrimidine, t<->c
 Transversions (purine<->pyrimidine) are less
common
 Use a subsitutation matrix to rate mismatches:
A
A
G C
2 -2 0 -2
T -2 2 -2 0
G
 -2 for insert/delete
 Mismatch/match according to substitution matrix
T
0 -2 2 -2
C -2 0 -2 2
 …Consider
 -2 + the value of the cell north
 -2 + the value of the cell west
 Corresponding value of the substion matrix
+ the value of the cell northwest
 Put down the maximum of these values as the
value for the current cell
By Michael Schroeder, Biotec, 2004
57
Reading the Alignment
0
0
1A
2T
3C
4T
5G
6A
7T
0
1
T
0
2
G
0
3
C
0
4
A
0
5
T
0
6
A
0
0
0
0
0
0
0
0
-2
2
0
2
0
-2
2
0
0
0
0
4
2
0
-2
0
2
0
2
2
2
2
0
0
0
0
4
2
0
4
2
2
0
2
6
2
2
2
0
2
2
4
---tgcata
atctg-atBy Michael Schroeder, Biotec, 2004
58
Substitution matrixes
By Michael Schroeder, Biotec, 2004
59
How to derive a substitution matrix for
amino acids?
 Amino acids can be classified by physiochemical
properties
Hydrophobic
A
G
P
I
L
Polar
S
T
N
Q
V
Acidic
D
E
Y
C
W
M
F
By Michael Schroeder, Biotec, 2004
K
H
R Basic
Aromatic
60
Cys
Ser
Thr
Pro
Ala
Gly
Asn
Asp
Glu
Gln
His
Arg
Lys
Met
Ile
Leu
Val
Phe
Tyr
Trp
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
C
PAM 250 matrix
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
S
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
T
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
A
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
N
By Michael Schroeder, Biotec, 2004
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
>0, likely mutation
0, random mutation
<0, unlikely
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
6
2
0
-2
-2
-2
-2
-2
0
-3
H
6
3
0
-2
-3
-2
-4
-4
2
R
5
0 6
-2 2 5
-3 4 2
-2 2 4
-5 0 1
-4 -2 -1
-3 -4 -5
K M I
6
2 4
2 -1 9
-1 -2 7 10
-2 -6 0 0 17
L V F Y W
61
Cys
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
Ser
0
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
Thr
-2
1
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
Pro
-3
1
0
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
Ala
-2
1
1
1
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
Gly
-3
1
0
-1
1
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
Asn
-4
1
0
-1
0
0
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
Asp
-5
0
0
-1
0
1
2
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
Glu
-5
0
0
-1
0
0
1
3
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
Gln
-5
-1
-1
0
0
-1
1
2
2
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
His
-3
-1
-1
0
-1
-2
2
1
1
3
6
2
0
-2
-2
-2
-2
-2
0
-3
Arg
-4
0
-1
0
-2
-3
0
-1
-1
1
2
6
3
0
-2
-3
-2
-4
-4
2
Lys
-5
0
0
-1
-1
-2
1
0
0
1
0
3
5
0
-2
-3
-2
-5
-4
-3
Met
-5
-2
-1
-2
-1
-3
-2
-3
-2
-1
-2
0
0
6
2
4
2
0
-2
-4
Ile
-2
-1
0
-2
-1
-3
-2
-2
-2
-2
-2
-2
-2
2
5
2
4
1
-1
-5
Leu
-6
-3
-2
-3
-2
-4
-3
-4
-3
-2
-2
-3
-3
4
2
6
2
2
-1
-2
Val
-2
-1
0
-1
0
-1
-2
-2
-2
-2
-2
-2
-2
2
4
2
4
-1
-2
-6
Phe
-4
-3
-3
-5
-4
-5
-4
-6
-5
-5
-2
-4
-5
0
1
2
-1
9
7
0
Tyr
0
-3
-3
-5
-3
-5
-2
-4
-4
-4
0
-4
-4
-2
-1
-1
-2
7
10
0
Trp
-8
-2
-5
-6
-6
-7
-4
-7
-7
-5
-3
2
-3
-4
-5
-2
-6
0
0
17
Average
StDev
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
-2.8
-0.5
-0.7
-1.2
-0.9
-1.6
-0.7
-1.1
-1.1
-0.8
-0.3
-0.7
-0.9
-0.8
-0.8
-1.4
-0.8
-1.9
-1.5
-3.1
4
1.5
1.7
2.6
1.9
2.7
2.1
3
2.8
2.6
2.3
2.6
2.5
2.6
2.4
3
2.3
4.1
3.8
5.4
By Michael Schroeder, Biotec, 2004
62
Cys
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
Ser
0
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
Thr
-2
1
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
Pro
-3
1
0
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
Ala
-2
1
1
1
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
Gly
-3
1
0
-1
1
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
Asn
-4
1
0
-1
0
0
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
Asp
-5
0
0
-1
0
1
2
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
Glu
-5
0
0
-1
0
0
1
3
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
Gln
-5
-1
-1
0
0
-1
1
2
2
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
His
-3
-1
-1
0
-1
-2
2
1
1
3
6
2
0
-2
-2
-2
-2
-2
0
-3
Arg
-4
0
-1
0
-2
-3
0
-1
-1
1
2
6
3
0
-2
-3
-2
-4
-4
2
Lys
-5
0
0
-1
-1
-2
1
0
0
1
0
3
5
0
-2
-3
-2
-5
-4
-3
Met
-5
-2
-1
-2
-1
-3
-2
-3
-2
-1
-2
0
0
6
2
4
2
0
-2
-4
Ile
-2
-1
0
-2
-1
-3
-2
-2
-2
-2
-2
-2
-2
2
5
2
4
1
-1
-5
Leu
-6
-3
-2
-3
-2
-4
-3
-4
-3
-2
-2
-3
-3
4
2
6
2
2
-1
-2
Val
-2
-1
0
-1
0
-1
-2
-2
-2
-2
-2
-2
-2
2
4
2
4
-1
-2
-6
Phe
-4
-3
-3
-5
-4
-5
-4
-6
-5
-5
-2
-4
-5
0
1
2
-1
9
7
0
Tyr
0
-3
-3
-5
-3
-5
-2
-4
-4
-4
0
-4
-4
-2
-1
-1
-2
7
10
0
Trp
-8
-2
-5
-6
-6
-7
-4
-7
-7
-5
-3
2
-3
-4
-5
-2
-6
0
0
17
Average
StDev
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
-2.8
-0.5
-0.7
-1.2
-0.9
-1.6
-0.7
-1.1
-1.1
-0.8
-0.3
-0.7
-0.9
-0.8
-0.8
-1.4
-0.8
-1.9
-1.5
-3.1
4
1.5
1.7
2.6
1.9
2.7
2.1
3
2.8
2.6
2.3
2.6
2.5
2.6
2.4
3
2.3
4.1
3.8
5.4
By Michael Schroeder, Biotec, 2004
63
PAM 250: Interpretation
 Immutable:
 Cysteine (Avg=-2.8): known to have several unique,
indispensable functions
 attachment site of heme group in cytochrome and of iron
sulphur FeS in ferredoxins
 Cross links in proteins such as chymotrypsin or ribonuclease
 Seldom without unique function
 Glycine (Avg=-1.6): small size maybe advantageous
 Mutable:
 Serine often functions in active site, but can be easily replaced
 Self-alignment:
 Tryptophan with itself scores very high, as W occurs rarely
By Michael Schroeder, Biotec, 2004
64
Point Accepted Mutations PAM
 Substitution matrix using explicit evolutionary model of how
amino acids change over time
 Use parsimony method to determine frequency of mutations
 Entry in PAM matrix: Likelihood ratio for residues a and b: Probability
a-b is a mutation / probability a-b is chance
 PAM x: Two sequences V, W have evolutionary distance of x PAM if
a series of accepted point mutations (and no insertions/deletions)
converts V into W averaging to x point mutation per 100 residues
 Mutations here = mutations in the DNA
 Because of silent mutations and back mutations n can be >100
 PAM 250 most commonly used
By Michael Schroeder, Biotec, 2004
65
PAM and Sequence Similarity
PAM
0
30
80
110
200
250
% identiy
100
75
60
50
25
20
By Michael Schroeder, Biotec, 2004
66
PAM
 Dayhoff, Eck, Park: A model of evolutionary change in proteins,
1978
 Accepted point mutation = substitution of an amino acid
accepted by natureal selection
 Assumption: X replacing Y as likely as Y replacing X
 Used cytochrome c, hemoglobin, myoglobin, virus coat
proteins, chymotrypsinogen, glyceraldehyde 3-phosphate
dehrydogenase, clupeine, insulin, ferredoxin
 Sequences which are too distantly related have been
omitted as they are more likely to contain multiple mutations
per site
By Michael Schroeder, Biotec, 2004
67
PAM: Step 1
 Step 1: Construct a multiple alignment
 Example
 ACGCTAFKI
 GCGCTAFKI
 ACGCTAFKL
 GCGCTGFKI
 GCGCTLFKI
 ASGCTAFKL
 ACACTAFKL
By Michael Schroeder, Biotec, 2004
68
PAM: Step 2
 Create a phylogenetic tree (parsimony method)
ACGCTAFKI
A->G
I->L
GCGCTAFKI
A->G
GCGCTGFKI
ACGCTAFKL
A->L
GCGCTLFKI
By Michael Schroeder, Biotec, 2004
C->S
ASGCTAFKL
G->A
ACACTAFKL
69
PAM: Step 3
 Note, the following variables
 Residue frequency ri is the number of amino acid i
occurring in the sequences, e.g. rA = 10 and rG=10
 Number of residues r is the number of overall amino
acids in all sequences, e.g. r=63
 Substitutability si is the number of substitutions in the
tree involving amino acid i , e.g. sA=4
 Substituion frequency si,j is the number of
substitutions involving amino acid i and j (i.e. the
number of ij and ji ), e.g. sA,G = 3
 Number of substitutions s is the number of overall
substitutions, s=6
By Michael Schroeder, Biotec, 2004
70
PAM: Step 4
 Relative mutability is the number of times the amino
acid is substituted by any other amino acid in the tree
divided by the total number of substitutions that could
have affected the residue
 Note, it is assumed that substitutions in both
directions are equally likely
How many residues, how
many substitutions overall?
 mi = 100 x ( si x ri ) / ( 2 s x r )
How many i? How many
substitutions with i?
 Example mA = 100 x ( 4 x 10 ) / ( 2 x 6 x 63 ) = 5.3
By Michael Schroeder, Biotec, 2004
71
PAM: Step 5
 Compute mutation probability
 Mi,j = mj x si,j / sj
 Example: MG,A = 5.3 x 3 / 4 = 3.975
By Michael Schroeder, Biotec, 2004
72
PAM: Step 6
 Finally the entry in the PAM Matrix:
 Ri,j = log ( Mi,j / ( ri / r ) )
 Example: RG,A = log ( 3.975 / (10/63) ) = 1.4
By Michael Schroeder, Biotec, 2004
73
PAM: Step 6
 For the entries on the diagonal
 mj =relative mutability  1-mj relative immutability
 Rj,j = log ( (1-mj ) / ( rj / r ) )
By Michael Schroeder, Biotec, 2004
74
BLOSUM
 Different approach to PAM
 BLOcks SUbstitution Matrix (based on BLOCKS
database)
 Generation of BLOSUM x
 Group highly similar sequences and replace them by a
representative sequences.
 Only consider sequences with no more than x % similarity
 Align sequences (no gaps)
 For any pair of amino acids a,b and for all columns c of the
alignment, let q(a,b) be the number of co-occurrences of a,b in all
columns c.
 Let p(a) be the overall probability of a occurring
 BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )
 BLOSUM 50 and BLOSUM 62 widely used
By Michael Schroeder, Biotec, 2004
75
LCS Algorithm (Longest Common
Subsequence) Revisited
 Algorithm (Dynamic Programming) with Substitution Matrix:
 Insert a row 0 and column 0 initialised with 0
 Starting from the top left, move down row by row from row 1 and
right column by column from column 1 visiting each cell
 Consider
 The value of the cell north
 The value of the cell west
 The value of the cell northwest if the row/column character
mismatch
 s + the value of the cell northwest, where s is the value
in the subsitution matrix for the residues in row/column
 Put down the minimum of these values as the value for the
current cell
 Trace back the path with the highest values from the bottom right
to the top left and output the alignment
By Michael Schroeder, Biotec, 2004
76
LCS Revisited: Formally
 What is the length s(V,W) of the longest common subsequence
of two sequences
V=v1..vn and W=w1..wm ?
 Find sequences of indices
1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ m
such that vit = wjt for 1 ≤ t ≤ k
 How? Dynamic programming:
 si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and

si-1,j
si,j = max
si,j-1
si-1,j-1 + t, where t is the value for vi and wj in
the substitution matrix
 Then s(V,W) = sn,m is the length of the LCS
{
By Michael Schroeder, Biotec, 2004
77
Dynamic programming revisited:
local and global alignments and gap
By Michael Schroeder, Biotec, 2004
78
Evolution and Alignments
 Alignments can be interpreted in evolutionary terms
 Identical letters are aligned.
Interpretation: part of the same ancestral sequence
and not changed
 Non-identical letters are aligned (substitution)
Interpretation: Mutation
 Gaps
Interpretation: Insertions and deletions (indels)
By Michael Schroeder, Biotec, 2004
79
Evolution and Alignments
 Specific problems aligning DNA:
 “Frame shift”:
DNA triplets code amino acids
Indel of one nucleotide shifts the whole sequence of
triplets
Thus may have a global effect and change all coded
amino acids
 Silent mutation:
Substitution in DNA leaves transcribed amino acid
unchanged
 Non-sense mutation:
Substitution to stop-codon
By Michael Schroeder, Biotec, 2004
80
Local and Global Alignments
 Global alignment (Needleham-Wunsch) algorithm finds
overall best alignment
 Example: members of a protein family, e.g. globins are very
conserved and have the same length in different organisms from
fruit fly to humans
 Local alignment (Smith-Waterman) algorithm finds locally best
alignment
 most widely used, as
e.g. genes from different organisms retain similar
exons, but may have different introns
e.g. homeobox gene, which regulates embryonic
development occurs in many species, but very different
apart from one region called homeodomain
e.g. proteins share some domains, but not all
By Michael Schroeder, Biotec, 2004
81
Local Alignment
 LCS s(V,W) computes globally best alignment
 Often it is better to maximise locally, i.e. compute
maximal s(vi…vi’ , wj… wi’ ) for all substrings of V and
W
 Can we adapt algorithm?
 Global alignment = longest path in matrix s from (0,0)
to (n,m)
 Local alignment = longest path in matrix s from any
(i,j) to any (i’,j’)
 Modify definition of s adding vertex of weight 0 from
source to every other vertex, creating a free “jump” to
any starting position (i,j)
By Michael Schroeder, Biotec, 2004
82
Local Alignment
 Modify the definition of s as follows:
 si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and

0
si-1,j
si,j = max
si,j-1
si-1,j-1 + t, where t is the value for vi wj
in the substitution matrix
 Then s(V,W) = max { si,j } is the length of the local LCS
{
 This computes longest path in edit graph
 Several local alignment may have biological
significance (consider e.g. two multi-domain proteins
whose domains are re-ordered
By Michael Schroeder, Biotec, 2004
83
Aligning with Gap Penalties
 Gap is sequence of spaces in alignment
 So far, we consider only insertion and deletion of single
nucleotides or amino acids creating alignments with many gaps
 So far, score of a gap of length l is l
 Because insertion/deletion of monomers is evolutionary slow
process, large numbers of gaps do not make sense
 Instead whole substrings will be deleted or inserted
 We can generalise score of a gap to a score function A + B l,
where A is the penalty to open the gap and B is the penalty
to extend the gap
By Michael Schroeder, Biotec, 2004
84
Aligning with Gap Penalties
 High gap penalties result in shorter, lower-scoring
alignments with fewer gaps and
 Lower gap penalties give higher-scoring, longer
alignments with more gaps
 Gap opening penalty A mainly influences number
of gaps
 Gap extension penalty B mainly influences length
of gaps
 E.g. if interested in close relationships, then choose
A, B above default values, for distant relationships
decrease default values
By Michael Schroeder, Biotec, 2004
85
Aligning with Gap Penalties
 Adapt the definition of s as follows:
 s-deli,j = max
 s-insi,j = max

{
{
s-deli-1,j - B
si-1,j – (A+B)
s-insi,j-1 - B
si,j-1 – (A+B)
0
s-deli,j
si,j = max
s-insi,j
si-1,j-1 + t, where t is the value for vi, wj
in the substitution matrix
Then s(V,W) = max { si,j } is the length of the local LCS
with gap penalties A and B
{
By Michael Schroeder, Biotec, 2004
86
FASTA and BLAST
By Michael Schroeder, Biotec, 2004
87
Motivation
 As in dotplots, the underlying data structure for dynamic
programming is a table
 Given two sequences of length n dynamic programming
takes time proportional to n2
 Given a database with m sequences, comparing a query
sequence to the whole database takes time proportional to m
n2
 What does this mean?
 Imagine you need to fill in the tables by hand and it takes 10
second to fill in one cell
 Assume there are 1.000.000 sequences each 100 amino acids
long
 How long does it take?
By Michael Schroeder, Biotec, 2004
88
 1.000.000 x 100 x 100 x 10 sec = 1011 sec =
27.777.778h = 1157407days = 3170 years
 Even if a computer does not take 10 sec, but just
0.1ms to fill in one cell, it would still be 12 days.
 We cannot do something about the database size,
but can we do something about the table size?
By Michael Schroeder, Biotec, 2004
89
An idea: Prune the search space
By Michael Schroeder, Biotec, 2004
90
Another idea
 Did we formulate the
problem correctly?
 Do we need the alignments
for all sequences in the
database?
 No, only for “reasonable”
hits  introduce a
threshold
 A “reasonable” alignment will
contain short stretches of
perfect matches
 Find these first, then
extend them to connect
them as best possible
By Michael Schroeder, Biotec, 2004
91
FASTA and BLAST
 FASTA and BLAST faster than dynamic programming
(5 times and 50 times respectively)
 Underlying idea for a heuristic:
 High-scoring alignments will contain short stretches
of identical letters, called words
 FASTA and BLAST first search for matches of words of
a given length and score threshold:
BLAST for words of length 3 for proteins and 11 for
DNA
FASTA for words of length 2 for proteins and 6 for
DNA
 Next, matches are extended to local (BLAST) and
global (FASTA) alignments
By Michael Schroeder, Biotec, 2004
92
FASTA and BLAST
 More formally:
If the strings V=v1..vm and W=w1..wm match with at
most k mismatches, then they share an p-tuple for
p = m/(k+1),
i.e. vi..vi+l-1 =wj..wj+l-1 for some 1 ≤ i,j ≤ m-p+1
 FILTRATION ALGORITHM, which detects all matching
words of length m with up to k mismatches
 Potential match detection: Find all matches of p-tuples
of V,W (can be done in linear time by inserting them into
a hash table)
 Potential match verification: Verify each potential
match by extending it to the left and right until either the
first k+1 mismatches are found or the beginning or end
of the sequences are found
By Michael Schroeder, Biotec, 2004
93
Example for BLAST
 Search SWISSPROT for Immunoglobulin:
SWISS_PROT:C79A_HUMAN P11912
By Michael Schroeder, Biotec, 2004
94
Example for BLAST
 Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for P11912
Database: All non-redundant SwissProt sequences
1,292,592 sequences; 412,925,052 total letters
By Michael Schroeder, Biotec, 2004
95
Example for BLAST
 Distribution of Hits:
By Michael Schroeder, Biotec, 2004
96
Example for BLAST: Top Hits

Score E Sequences producing significant alignments:
Score
gi|547896|sp|P11912|C79A_HUMAN B-cell antigen receptor comp...
473
gi|728993|sp|P40293|C79A_BOVIN B-cell antigen receptor comp...
312
gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor comp...
278
gi|728994|sp|P40259|C79B_HUMAN B-cell antigen receptor comp...
55
gi|125781|sp|P01618|KV1_CANFA IG KAPPA CHAIN V REGION GOM
38
gi|125361|sp|P17948|VGR1_HUMAN Vascular endothelial growth ...
37
gi|549319|sp|P35969|VGR1_MOUSE Vascular endothelial growth ...
36
gi|114764|sp|P15530|C79B_MOUSE B-cell antigen receptor comp...
36
gi|1718161|sp|P53767|VGR1_RAT Vascular endothelial growth f...
35
gi|125735|sp|P01681|KV01_RAT Ig kappa chain V region S211
35
gi|1730075|sp|P01625|KV4A_HUMAN IG KAPPA CHAIN V-IV REGION LEN 34
gi|1718188|sp|P52583|VGR2_COTJA Vascular endothelial growth...
33
gi|125833|sp|P06313|KV4B_HUMAN IG KAPPA CHAIN V-IV REGION J... 33
gi|125806|sp|P01658|KV3F_MOUSE IG KAPPA CHAIN V-III REGION ...
33
gi|125808|sp|P01659|KV3G_MOUSE IG KAPPA CHAIN V-III REGION ...
33
gi|1172451|sp|Q05793|PGBM_MOUSE Basement membrane-specific ...
33
gi|125850|sp|P01648|KV5O_MOUSE Ig kappa chain V-V region HP...
33
gi|125830|sp|P06312|KV40_HUMAN Ig kappa chain V-IV region p...
33
gi|2501738|sp|Q06639|YD03_YEAST Putative 101.7 kDa transcri...
33
By Michael Schroeder, Biotec, 2004
E-Value
e-133
3e-85
5e-75
1e-07
0.019
0.042
0.052
0.064
0.080
0.095
0.26
0.28
0.30
0.30
0.30
0.33
0.36
0.38
0.41
97
Example for BLAST: Alignment
>gi|126779|sp|P11911|C79A_MOUSE
B-cell antigen receptor complex
associated protein alpha-chain
precursor (IG-alpha) (MB-1 membrane glycoprotein)
(Surface-IGM-associated protein) (Membrane-bound
immunoglobulin associated protein) (CD79A)
Length = 220
Score = 278 bits (711), Expect = 5e-75
Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 56
MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60
MPGG
+ LL LS
LGPGCQAL +
P SL V+LGE+A
C N+
MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55
NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120
N N+TWW L N TWPP LGPG+
G L
VNK+ G
C+V E N
++SCG
NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114
Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180
TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D D
Sbjct: 115
TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174
Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226
+YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKP
Sbjct: 175
DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220
By Michael Schroeder, Biotec, 2004
98
Example for BLAST

Lineage Report



























root
. cellular organisms
. . Eukaryota
[eukaryotes]
. . . Fungi/Metazoa group [eukaryotes]
. . . . Bilateria
[animals]
. . . . . Coelomata
[animals]
. . . . . . Gnathostomata
[vertebrates]
. . . . . . . Tetrapoda
[vertebrates]
. . . . . . . . Amniota
[vertebrates]
. . . . . . . . . Eutheria
[mammals]
. . . . . . . . . . Homo sapiens (man) ---------------------. . . . . . . . . . Bos taurus (bovine) .....................
. . . . . . . . . . Mus musculus (mouse) ....................
. . . . . . . . . . Canis familiaris (dogs) .................
. . . . . . . . . . Rattus norvegicus (brown rat) ...........
. . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) .
. . . . . . . . . Coturnix japonica ------------------------. . . . . . . . . Gallus gallus (chickens) ..................
. . . . . . . . Xenopus laevis (clawed frog) ---------------. . . . . . . Heterodontus francisci -----------------------. . . . . . Drosophila melanogaster ------------------------. . . . . Caenorhabditis elegans ---------------------------. . . . Saccharomyces cerevisiae (brewer's yeast) ----------. . . Marchantia polymorpha --------------------------------. . Agrobacterium tumefaciens str. C58 ---------------------. Human adenovirus type 3 ----------------------------------. Human adenovirus type 7 ...................................
By Michael Schroeder, Biotec, 2004
473 33 hits [mammals]
312 2 hits [mammals]
278 31 hits [mammals]
37 1 hit [mammals]
35 7 hits [mammals]
29 1 hit [mammals]
33 2 hits [birds]
31 4 hits [birds]
30 2 hits [amphibians]
28 1 hit [sharks and rays]
30 2 hits [flies]
29 1 hit [nematodes]
33 1 hit [ascomycetes]
29 1 hit [liverworts]
28 1 hit [a-proteobacteria]
30 1 hit [viruses]
30 1 hit [viruses]
99
How good is an alignment?
 Be careful: Fitch/Smith found 17 alignments for alpha- and
beta-chains in chicken haemoglobins
 Only one is the correct one (according to the structure)
 Given an alignment, how good is it :
 Percentage of matching residues, i.e. number of matches divided
by length of smallest sequence
 Advantage: independent of sequence length
 E.g.
AT–C –TGAT
4/6 = 66.67%
–TGCAT –A–
 More general: also consider gaps, extensions,…
By Michael Schroeder, Biotec, 2004
100
Blast Raw Score
 R = a I + b X - c O - d G, where
 I is the number of identities in the alignment and a is
the reward for each identity
 X is the number of mismatches in the alignment and
b is the “reward” for each mismatch
 O is the number of gaps and c is the penalty for each
gap
 G is the number of “-” characters in the alignment
and d is the penalty for each
 The values for a,b,c,d appear at the bottom of a Blast
report. For BLASTn they are a=1, b=-3, c=5, d=2
By Michael Schroeder, Biotec, 2004
101
Example
Query: 1 atgctctggccacggcacttgcgga
||||||||||||||| |||| |||
Sbjt:107 atgctctggccacggatcttgtgga
tcccagggtgatctgtgcacctgcgata 53
|||||
|||| |||||||||||||||
tccca---tgatatgtgcacctgcgata 156
 R = 1 x 46 + -3 x 4 - 5 x 1 - 2 x 3 = 23
 So, given the scores: how significant is the
alignment?
By Michael Schroeder, Biotec, 2004
102
Significance of an alignment
 Significance of an alignment needs to be defined with respect
to a control population
 Pairwise alignment: How can we get control population?
 Generate sequences randomly?  Not a good model of real
sequences
 Chop up both sequences and randomly reassemble them
 Database search: How can we get control population?
 Control = whole database
 Align sequence to control population and see how good result
is in comparison
 This is captured by Z scores, P-values and E-values
By Michael Schroeder, Biotec, 2004
103
Z-score
 Z-score normalises the score S:
 Let m be mean of population and std its standard
deviation, then Z-score = (S – m) / std
 Z-score of 0  no better than average, hence might
have occurred by chance
 The higher the Z-score the better
By Michael Schroeder, Biotec, 2004
104
P-value
 P-value: probability of obtaining a score ≥ S
 Range: 0 ≤ P ≤ 1
 Let m be the number of sequences in the control
population with score ≥ S
 Let p be the size of the control population
 Then P-value = m / p
 Rule of thumb:
P ≤ 10-100 exact match,
10-100 ≤ P ≤ 10-50 nearly identical (SNPs)
10-50 ≤ P ≤ 10-10 homology certain
10-5 ≤ P ≤ 10-1 usually distant relative
P > 10-1 probably insignificant
By Michael Schroeder, Biotec, 2004
105
E-values
 E-value takes also the database into account
 E-value = expected frequency of a score ≥ S
 Range: 0 ≤ E ≤ m, where m is the size of the database
 Relationship to P: E = m P
 E values are calculated from
 the bit score
 the length of the query
 the size of the database
By Michael Schroeder, Biotec, 2004
106
BLAST Bit score
 The bit score normalizes the raw score S to make
score under different settings comparable
 The bit score is obtained from the raw score as
follows
 S = ( lambda x R - ln(K) ) / ( ln(2),
where lambda = 1.37 and K=0.711
 Example
 S = ( 1.37 x 23 - ln(0.711) ) / ln(2) = 46
By Michael Schroeder, Biotec, 2004
107
E-value
 The E-value is then calculated as follows:
E = m x n x 2 -S , where
 m is the effective length of the query
 n is the effective length of the database
 S is the bit score
 (effective length takes into account that an alignment
cannot start at the end of a sequence)
 Example:
 m=34 (19 nucleotides fewer than the 53 submitted)
 n=5,854,611,841
 Result: E=0.003
By Michael Schroeder, Biotec, 2004
108
Precision and Recall
 How good are BLAST and FASTA?




True positives, tp = hits which are biologically meaningful
False positives, fp = hits which are not biologically meaningful
True negatives, tn = non-hits which are not biologically meaningful
False negatives, fn = non-hits which are biologically meaningful
 Minimise fp and fn
 Recall:
tp/(tp+fn)
 Precision: tp/(tp+fp)
(meaningful hits / all meaningful)
(meaningful hits / all hits)
 But: since no objective data available difficult to judge BLAST
and FASTA’s sensitivity and specificity
By Michael Schroeder, Biotec, 2004
109
Multiple Sequence Alignments
By Michael Schroeder, Biotec, 2004
110
Multiple Sequence Alignment
 Align more than two sequences
 Choice of sequences
 If too closely related then large redundant
 If very distantly related then difficult to generate good alignment
 Additionally use colour for residues with similar properties
 Yellow
 Green
Small polar
Hydrophobic
 Magenta
 Red
 Blue
Polar
Negatively charged
Positively charged
By Michael Schroeder, Biotec, 2004
GLy,Ala,Ser,Thr
Cys,Val,Ile,Leu,
Pro,Phe,Tyr,Met,Trp
Asn,Gln,His
Asp,Glu
Lys, Arg
111
Thioredoxins: WCGPC[K or R] motif
By Michael Schroeder, Biotec, 2004
112
Thioredoxins: Gly/Pro = turn
By Michael Schroeder, Biotec, 2004
113
Thioredoxins: every second
hydrophobic = beta strand
By Michael Schroeder, Biotec, 2004
114
Thioredoxins:
ca. every 4th hydrophobic = alpha helix
By Michael Schroeder, Biotec, 2004
115
By Michael Schroeder, Biotec, 2004
116
Profiles, PSI-Blast, HMM
By Michael Schroeder, Biotec, 2004
117
Profiles
 Derive profile from multiple sequence alignment
 Useful to
 Align distantly related sequences
 Conserved regions, which may indicate active site
 Classify subfamilies within homologues
 How can profile be used to search
 Insist on profile (such as WGCPC)? Too strict
 Use frequence distribution of profile…
By Michael Schroeder, Biotec, 2004
118
Consider frequencies
A
25
C D
E
F
G H
1
I
K
L
M N
P
Q R
T
2
26
V
W Y
13
16
27
16
28
29
S
7
1
7
1
5 3
16
30
1
4
2
1
 Score for





VDFSAS = 13+16+16+7+16+7
ADATAA = 1+16+0+1+16+0
Not good to pick up distant relationships
Better: combine with substitution matrix
Result: position specific substitution matrix
By Michael Schroeder, Biotec, 2004
119
PSI-Blast
 Globin familiy (oxygen transport ) of proteins occurs
in many species
 Proteins have same function and structure and
 But there are pairs of members of the family sharing
less than 10% identical residues
A
50%
50%
B
C
Only 10%
 PSI-BLAST idea: score via intermediaries may be
better than score from direct comparison
By Michael Schroeder, Biotec, 2004
120
PSI-BLAST
 PSI-BLAST
 1. BLAST
 2. Collect top hits
 3. Build multiple sequence alignment from significant
local matches
 4. Build profile
 5. Re-probe database with profile
 6. Go back to 2.
By Michael Schroeder, Biotec, 2004
121
PSI-BLAST
 But beware of PSI-BLAST:
 False positives propagate and spread through
iterations
 If protein A consists of domains D and E, and protein B
of domains E and F and protein C of domain F, then
PSI-BLAST will relate A and C although they do not
share any domain
By Michael Schroeder, Biotec, 2004
122
Hidden Markov Model
 Procedure to generate sequences
 State transition systems with three types of states
Deletion
Insertion
Match, which emits residues
 Follow probability distribution for successor state
 Train model on multiple sequence alignment
del
start
ins
match
By Michael Schroeder, Biotec, 2004
del
del
ins
ins
match
match
end
123
Summary
 Evolutionary model: Indels and substitutions
 Homologues vs. similarity
 Dot plots
 Easy visual exploration, but not scalable
 Dynamic programming
 Local, global, gaps
 Substitution matrices (PAM, BLOSUM)
 BLAST and FASTA
 Scores and significance
 Multiple Sequence Alignments
 Profiles, PSI-BLAST, HMM
By Michael Schroeder, Biotec, 2004
124
Phylogeny
By Michael Schroeder, Biotec, 2004
125
Motivation
 How did the nucleus get into the eucaryotic cells?
 From where are we?
 Recent Africa vs. Multi-regional Hypothese
 In 1999 Encephalitis caused by the West Nile Virus
broke out in New York. How did the virus come to
New York?
By Michael Schroeder, Biotec, 2004
126
How did the nucleus get into the
eucaryotic cells?
 Simple experiment:
 Blast classes genes
with related functions
in yeast (Eucaryote)
 against Bacteria
and
 against Archaea
 And count number of
significant hits
By Michael Schroeder, Biotec, 2004
127
How did the nucleus get into the
eucaryotic cells?
 Mitochondria und
Energy metabolism:
 Significantly more hits
in bacteria
 Cell organisation:
 Significantly more hits
in Archaea
Blue = Bacteria
Grey = Archaea
 Fundamental Result
without any experiment!
By Michael Schroeder, Biotec, 2004
128
Phylogeny
 Taxonomists aim to classify and group
organisms
 E.g. Aristoteles, De Partibus Animalium
 Ought we, for instance, to begin by discussing
each separate species – man, lion, ox, and the
like – taking each kind in hand independently of
the rest, or ought we rather to deal first with the
attributes which they have in common in virtue
of some common element of their nature, and
proceed from this as a basis for consideration of
them separately other
By Michael Schroeder, Biotec, 2004
129
Schools of Taxonomists
 Goal: create taxonomy
 Approach:
 Phenotype
 Phylogeny
 3 schools:
 Phenotype only
 Evolutionary
Taxonomists:
Phenotype (+ Phylogeny)
 Cladists:
Phylogeny (+Phenotype)
By Michael Schroeder, Biotec, 2004
130
Practical Application:
Westnile virus in NY
 Westnile virus mainly in
Africa
 Transmitted by insects and
birds
 How did the virus get to NY
in 1999
 Hundreds of DNA samples
taken
 All 99.8% identical 
single entry to NY!
 Phylogenetic tree allows to
deduce origin
By Michael Schroeder, Biotec, 2004
131
Example:
Westnil virus in NY
 How can the trees be
constructed?
By Michael Schroeder, Biotec, 2004
132
Three Methods to Generate
Phylogenetic Trees
 Distance-based
 Hierarchical clustering
 Character-based
 Parsimony
 Maximum likelihood
By Michael Schroeder, Biotec, 2004
133
Distance-based Approach
 Single Alignment
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata
||||||||||||||| |||| ||||||||
|||| |||||||||||||||
atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata
 Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions,
 z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38
 Approach:
 Define distance between two sequences, e.g. percentage of
mismatches in their alignment
 Construct tree, which groups sequences with minimal
distances iteratively together
By Michael Schroeder, Biotec, 2004
134
1
1
2
3
4
5
0
2
6
10
9
0
5
9
8
0
4
5
0
3
2
3
4
5
(1,2)
Hierarchical Clustering
0
(1,2) 3
4
5
0
5
9
8
0
4
5
4
0
3
3
3
4
5
5
0
(1,2)
3
(4,5)
0
5
8
1
0
4
0
3
(4,5)
0
(1,2)
(3,(4,5))
2
(1,2)
(1,2)
(3,(4,5))
0
5
1
2
3
4
5
0
By Michael Schroeder, Biotec, 2004
135
Hierarchical Clustering
 Given a distance matrix D=(dij) with 1≤ i,j ≤ n
 Result: A binary tree of clusters
 Init:
 ToDo = {}
 For all i in { 1,…, n } do
 Let ti be a tree without children, i.e. a leaf
 ToDo := ToDo  { ti }
 Main loop
 While |ToDo | > 1 do
 Find i,j such that dij is minimal
 Add a new column and row labelled k := (i,j) to D
 For all indices h of D apart from k,i,j do
 dh,k = dk,h := min { dh,i , dh,j }
// min = single linkage
 Let tk be a new tree with children ti and tj
 ToDo := ( ToDo  { tk } ) - { ti ,tj }
 Remove columns and rows i,j from D
 Complexity: O(n2)
By Michael Schroeder, Biotec, 2004
136
Hierarchical Clustering:
 How to define distance between clusters?
 Single linkage:
 dh,k = dk,h := min { dh,i , dh,j }
 Example: Distance (A,B) to C is 1
A
B
 Complete linkage:
C
A
B
C
0
1
2
0
1
0
 dh,k = dk,h := max { dh,i , dh,j }
 Example: Distance (A,B) is C is 2
 Average linkage:
 dh,k = dk,h := 0.5 dh,i + 0.5 dh,j
 Example: Distance (A,B) to C is 1.5
 Are dendrograms always the same
independent of the linkage method?
By Michael Schroeder, Biotec, 2004
A B C
A B C
137
Parsimony-method
 Approach: Generate “smallest” tree
containing all the sequences as leaves
Seq
a
b
c
d
1
G
G
G
G
2
G
G
G
A
3
G
G
A
T
4
G
A
T
C
5
G
G
A
A
a GGGGGG
6
G
T
G
T
3 G->A
4 G->T
5 G->A
4 G->A
6 G->T
b GGGAGT c GGATAG
By Michael Schroeder, Biotec, 2004
2 G->A
3 T->A
4 T->C
6 G->T
d GATCAT
138
Parsimony
 Generate smallest tree
 Informative vs. non-informative sites
 Build pairs with fewest possible substitutions
 Example:
 3 possible trees:
((a,b),(c,d)) or ((a,c),(b,d)) or ((a,d),(b,c))
 1,2,3,4 are not informative
 5,6 are informative
5: ((a,b),(c,d))
6: ((a,c),(b,d))
By Michael Schroeder, Biotec, 2004
Seq
a
b
c
d
1
G
G
G
G
2
G
G
G
A
3
G
G
A
T
4
G
A
T
C
5
G
G
A
A
6
G
T
G
T
139
Maximum likelihood
 Assigns quantitative probabilities to mutation
events
 Reconstructs ancestors for all nodes in the tree
 Assigns branch lengths based on probabilities of the
mutational events
 For each possible tree topology, the assumed
substitution rates are varied to find the parameters
that give the highest likelihood of producing the
observed data
By Michael Schroeder, Biotec, 2004
140
Problems
 Character-based methods tend to be better (based
on paleontological data)
 All make assumptions:
 No back mutations
 Same evolutionary rate
By Michael Schroeder, Biotec, 2004
141
Assessing Quality: Bootstrapping
 Given a tree obtained from one of the methods above
 Generate Multiple Alignment
 For a number of interations
Generate new sequences by selecting columns
(possibly the same column more than once) form
the multiple alignment
Generate tree for the new sequences
Compare this new tree with the given tree
For each cluster in the given tree, which also approach
in the new tree, the bootstrap value is increased
 Bootstrap-Value = Percentage of trees containing the
same cluster
By Michael Schroeder, Biotec, 2004
142
From where are we?
 Recent-Africa Hypothesis
 Homo Sapiens came 100-200.000 years ago from
Africa
 Multi-regional Hypothesis
 Ancestors of Homo Sapiens left Africa ca. 2.000.000
years ago
 Which one’s right?
By Michael Schroeder, Biotec, 2004
143
Experiment
 Mitochondrial DNA form 53
humans in different regions
sequenced
 Outgroup = Mitochondrial
DNA of chimpanzee
By Michael Schroeder, Biotec, 2004
144
A nice phylogeny (Nature 2004)
Nature October 2004 Volume 431 No. 7012
By Michael Schroeder, Biotec, 2004
145
Why Mitochondria?
 Simple genetic structure
 No repetitions
 No Pseudo genes
 No Introns
 No recombination
By Michael Schroeder, Biotec, 2004
146
Molecular Clock
 Based on genetic and paleontological the most recent common
ancestor (mrca) of chimpanzee and homo sapiens dates back
5.000.000 years
 Molecular clock:







 1.7 x10-8 nucleotide changes per site and year
Assumption: equal distribution, no silent mutations
Diversity in Afrikca: 3.7 x10-3 nucleotide changes per site and year
Diversity outside Africa: 1.7 x10-3 nucleotide changes per site and year
Estimated expansion1925 generations ago = ca. 40.000 years
Mrca of all humans: 171.500 +/- 50.000 years ago
Mrca of African and non-African: 52,000 +/- 27.500 years ago
 Experiment supports recent-Africa hypothesis
By Michael Schroeder, Biotec, 2004
147
Summary
 Schools of taxonomists
 Assumptions made
 Methods
 Distance-based
 Character-based
By Michael Schroeder, Biotec, 2004
148