No Slide Title

Transcript No Slide Title

Why align sequences?
Lots of sequences with unknown structure and function. A few sequences
with known structure and function
If they align, they are similar
If they are similar, then they might have same structure or function
If one of them has known structure/function, then alignment to the other
yields insight about how the structure or function works
©CMBI 2005
Sequence Alignment
The purpose of a sequence alignment is to line up all residues in
the sequence that were derived from the same residue position in
the ancestral gene or protein
A
B
A
B
gap = insertion or deletion
©CMBI 2005
Alignment
To carry over information from a well studied protein sequence and
its structure to a newly discovered protein sequence, we need an
sequence alignment that represents the protein structures today, a
structural alignment.
©CMBI 2005
Alignment
The implicit meaning of placing amino acid residues below each
other in the same column of a protein (multiple) sequence
alignment is that they are at the “same” position in the 3D
structures of the corresponding proteins!!
Two very simple examples:
1) the 3 active site residues of the serine protease we saw earlier
2) Cys-bridges:
STCTKGALKLPVCRK
TSCTEG--RLPGCKR
©CMBI 2005
Things one can do with a good alignment
Carry information from a well studied to a less well studied protein.
Such information can be:
Phosphorylation sites
Glycosylation sites
Stabilizing mutations
Membrane anchors
Ion binding sites
Ligand binding residues
Cellular localization
©CMBI 2005
Significance of alignment
One can only transfer information if the similarity is significantly high
between the two sequences.
Schneider (group of Sander) determined the “threshold curve” for
transfering structural information from one known protein structure
to another protein sequence:
If the sequences are > 80 aa long, then >25% sequence identity is
enough to reliably transfer structural information.
If the sequences are smaller in length, a higher percentage of
identity is needed.
Structure is much more conserved than sequence!
©CMBI 2005
Significance of alignment (2)
©CMBI 2005
Aligning sequences by hand
Most information that enters the alignment procedure comes from
the physico-chemical properties of the amino acids.
Examples: which is the better alignment (left or right)?
1)
CPISRTWASIFRCW
CPISRT---LFRCW
CPISRTWASIFRCW
CPISRTL---FRCW
2)
CPISRTRASEFRCW
CPISRTK---FRCW
CPISRTRASEFRCW
CPISRT---KFRCW
©CMBI 2005
Aligning sequences by hand (2)
Procedure of aligning depends on information available:
1) Use “only” identity of amino acid and its physico-chemical properties.
This is more or less what alignment programs do.
2) Also use explicitly the secondary structure preference of the amino
acids.
3) Use 3D information if one or more of the structures in the alignment are
known.
In most cases you will start with a alignment program (e.g. CLUSTAL)
and then use your knowledge of the amino acids to improve the
alignment, for instance by correcting the position of gaps.
©CMBI 2005
Helix
©CMBI 2005
Helix
©CMBI 2005
Helix preferences
-4
-3
-2
-1
1
2
3
4
5
total
-
-
-
-
H
H
H
H
H
ALA
143
148
99
58
189
205
187
241
CYS
24
31
29
22
14
17
18
33
17
ASP
98
110
121
260
98
197
167
49
86 1186
GLU
91
100
71
71
152
287
269
70
147 1258
PHE
53
70
90
29
68
46
49
107
GLY
207
246
166
192
96
127
99
65
60 1258
HIS
48
50
39
46
28
36
38
24
30
339
ILE
94
81
133
19
79
45
68
161
99
779
LYS
99
98
80
46
98
105
69
80
154
829
LEU
105
111
188
50
140
84
113
281
MET
37
20
51
13
26
22
54
61
67
351
ASN
103
83
89
206
46
62
55
37
77
758
PRO
143
136
121
99
240
78
40
0
0
857
GLN
48
58
40
38
83
93
124
76
101
661
ARG
82
63
59
51
71
75
61
114
109
685
SER
112
128
98
292
105
126
99
48
76 1084
THR
106
99
119
253
91
80
115
72
67 1002
VAL
141
107
132
37
117
74
120
208
120 1056
TRP
29
25
29
14
30
26
28
30
29
240
TYR
66
65
75
33
58
44
56
72
48
517
268 1538
65
205
577
209 1281
©CMBI 2005
Helix preferences and alignment
1)
2)
S G V S P D Q L A A L K L I L E L A L K
G T S L E T A L L M Q I A Q K L I A G
S G V S P D Q L A A L
-1-4-4-1-4-1 3-2 1 1-2
-3-2 -3 2 5 1 2 2 1
4 -2 3
4 3 3 4
1
5 4 4 5
5 5
G T S L E T A L L M Q
-4-1-1-2 2-1 1-2
-3 3
1 3 3 2 1
4
3 4
5
4 5
5
K L I L E L A L K
2
5
I A Q K L I A G
©CMBI 2005
Helix preferences and alignment
1) S G V S P D Q L A A L K L I L E L A L K
2) G T S L E T A L L M Q I A Q K L I A G
S G V S P D Q L A A L
-1-4-4-1-4-1 3-2 1 1-2
-3-2 -3 2 5 1 2 2 1
4 -2 3
4 3 3 4
1
5 4 4 5
5 5
G T S L E T A L L M Q
-4-1-1-2 2-1 1-2
-3 3
1 3 3 2 1
4
3 4
5
4 5
5
K L I L E L A L K
2
5
I A Q K L I A G
©CMBI 2005
Helix preferences and alignment
S G V S P D Q L A A L
-1-4-4-1-4-1 3-2 1 1-2
-3-2 -3 2 5 1 2 2 1
4 -2 3
4 3 3 4
1
5 4 4 5
5 5
G T S L E T A L L M
-4-1-1-2 2-1 1-2
-3 3
1 3 3 2 1
4
3 4
5
4 5
5
Final alignment:
S G V S P D Q L A
- G T S L E T A L
K L I L E L A L K
2
5
Q I A Q K L I A G
A L K L I L E L A L K
L M Q I A Q K L I A G
©CMBI 2005
A ‘real’ example of threading
1
2
If you know that in structure 1 the Ala is pointing outside and the Ser is
pointing inside:
Where does the Arg in structure 2 go?
(and what will CLUSTAL choose?)
©CMBI 2005
An even more real example
1
2
3
4
5
6
7
8
9 10
ILE CYS ARG LEU PRO GLY SER ALA GLU ALA
VAL CYS ARG THR PRO --- --- --- GLU ALA
VAL CYS ARG --- --- --- THR PRO GLU ALA
11
VAL
ILE
ILE
©CMBI 2005
An even more real example
1
2
3
4
5
6
7
8
9 10
ILE CYS ARG LEU PRO GLY SER ALA GLU ALA
VAL CYS ARG THR PRO --- --- --- GLU ALA
VAL CYS ARG --- --- --- THR PRO GLU ALA
PP-
11
VAL
ILE
ILE
G- S-T
LT-
A-P
RRR
VVV
CCC
EEE
III
AAA
©CMBI 2005
Multiple sequence alignment
Multiple sequence alignments can confirm or improve pair-wise
sequence alignments:
CWPVAASYGR
CWPT---YGR
CWPTA-SYGR
CWPTLGLFGR
CWPVAASYGR
?
CWPTA-SYGR
©CMBI 2005
Multiple sequence alignment
Multiple sequence alignments can reveal structural information:
ASCTRGCIKLPTCKKMGRCTGY
STCTKGALKLPVCRKMGKSSAY
ATSTHGCMKLPCSRRFGKCSSY
TSCTEGCLRLPGCKRFGRCTSY
TTCTKGLLKLPGCKRFGKSSAY
ASSTKGCMKLPVSRRFGRCTAY
©CMBI 2005
Multiple sequence alignment
Multiple sequence alignments can validate PROSITE search
results.
In N-{P}-[ST]-{P} the N is the glycosylation site.
The chance of finding N-{P}-[ST]-{P} is rather high.
So how can you be sure? Look at the multiple sequence alignment:
ASLRNASTVVTIGDTITGNLTLASYHW
GSIKNGSSVITLPGTMEGNLSTTTYHY
ATLRNASTVMEINGTITGDLTLASFHW
©CMBI 2005
Summary
Bioinformatics is all about obtaining information. Everything you
can find in a database saves you doing experiments.
Sequence alignment is important for carrying over information
between ‘similar proteins’.
To align sequences, you need to understand the amino acids.
©CMBI 2005

No Slide Title

Transcript No Slide Title

Directory