No Slide Title

Download Report

Transcript No Slide Title

Aligning Sequences
You have learned about:
Data & databases
Tools
Amino Acids
Protein Structure
Today we will discuss: Aligning sequences
After this, you are ready to carry out
a bioinformatics research project!
©CMBI 2009
Why align sequences?
The problem:
There a lots of sequences with unknown structure and/or function
There are a few sequences with known structure and/or function
Alignment can help:
•
If sequences align well, they are likely to be similar
•
If they are similar, then they very likely share structural and/or
functional aspects
•
If one of them has known structure/function, then alignment gives us
insight in structural and/or functional aspects of the aligned
sequence(s)
TRANSFER OF INFORMATION!
©CMBI 2009
Sequence Alignment (1)
A sequence alignment is a representation of a whole series of
evolutionary events, which left traces in the sequences.
Things that are more likely to happen during evolution should be most
prominently observed in your alignment.
The purpose of a sequence alignment is to line up all residues in the
sequence that were derived from the same residue position in the
ancestral gene or protein.
©CMBI 2009
Sequence Alignment (2)
A
B
A
B
gap = insertion or deletion
©CMBI 2009
Structural alignment
To carry over information from a well studied protein sequence and
its structure to a newly discovered protein sequence, we need a
sequence alignment that represents the protein structures today, a
structural alignment.
The implicit meaning of placing amino acid residues below each
other in the same column of a protein (multiple) sequence
alignment is that they are at the equivalent position in the 3D
structures of the corresponding proteins!!
©CMBI 2009
Examples
1) the 3 active site residues H, D, S, of the serine protease we saw
earlier
2) Cysteine bridges (disulfide bridges):
STCTKGALKLPVCRK
TSCTEG--RLPGCKR
©CMBI 2009
Transfer of information
Such information can be:
Phosphorylation sites
Glycosylation sites
Stabilizing mutations
Membrane anchors
Ion binding sites
Ligand binding residues
Cellular localization
Typically what one finds in the feature (FT) records of Swissprot!
©CMBI 2009
Significance of alignment
One can only transfer information if the similarity is significantly high
between the two sequences.
Schneider (group of Sander) determined the “threshold curve” for
transferring structural information from one known protein structure
to another protein sequence:
If the sequences are > 80 aa long, then >25% sequence identity is
enough to reliably transfer structural information.
If the sequences are smaller in length, a higher percentage of
identity is needed.
Structure is much more conserved than sequence!
©CMBI 2009
Significance of alignment (2)
©CMBI 2009
Aligning sequences by hand
Most information that enters the alignment procedure comes from
the physico-chemical properties of the amino acids.
Examples: which is the better alignment (left or right)?
1)
CPISRTWASIFRCW
CPISRT---LFRCW
CPISRTWASIFRCW
CPISRTL---FRCW
2)
CPISRTRASEFRCW
CPISRTK---FRCW
CPISRTRASEFRCW
CPISRT---KFRCW
©CMBI 2009
Aligning sequences by hand (2)
Procedure of aligning depends on information available:
1) Use “only” identity of amino acid and its physico-chemical properties.
This is more or less what alignment programs do.
2) Also use explicitly the secondary structure preference of the amino
acids.
Example: aligning 2 helices when sequence identity is low
3) Use 3D information if one or more of the structures in the alignment are
known.
In most cases you will start with a alignment program (e.g. CLUSTAL)
and then use your knowledge of the amino acids to improve the
alignment, for instance by correcting the position of gaps.
©CMBI 2009
Helix
©CMBI 2009
Positional preferences in helices (1)
ASP
-4
-3
-2
-1
1
2
3
4
5
-
-
-
-
H
H
H
H
H
110
121
260
98
197
167
49
86
98
total
1186
Position 1 in helix
Dataset of good helices from PDB files
Count all Asp residues in & before helices
Identify preferential positions for Asp residues
©CMBI 2009
Positional preferences in helices (2)
Fill this table for all 20 amino acids
Use this information when aligning helices who have low
percentage of sequence identity
-4
-3
-2
-1
1
2
3
4
5
total
-
-
-
-
H
H
H
H
H
ALA
143
148
99
58
189
205
187
241
CYS
24
31
29
22
14
17
18
33
17
ASP
98
110
121
260
98
197
167
49
86 1186
GLU
91
100
71
71
152
287
269
70
147 1258
TRP
29
25
29
14
30
26
28
30
29
240
TYR
66
65
75
33
58
44
56
72
48
517
268 1538
205
(…)
Position 1 in helix
©CMBI 2009
Aligning 2 helices when sequence identity is low
Helix 1:
S G V S P D Q L A A L K L I L E L A L K
Helix 2:
G T S L E T A L L M Q I A Q K L I A G
©CMBI 2009
Aligning 2 helices when sequence identity is low (2)
S G V S P D Q L A A L K L I L E L A L K
-1-4-4-1-4-1 3-2 1 1-2 2
-3-2 -3 2 5 1 2 2 1 5
4 -2 3
4 3 3 4
1
5 4 4 5
5 5
G T S L E T A L L M Q I A Q K L I A G
-4-1-1-2 2-1 1-2
-3 3
1 3 3 2 1
4
3 4
5
4 5
5
Final alignment:
S G V S P D Q L A A L K L I L E L A L K
- G T S L E T A L L M Q I A Q K L I A G
©CMBI 2009
Use of 3D structure info (1)
1
2
If you know that in structure 1 the Ala is pointing outside and the Ser is
pointing inside:
Where does the Arg in structure 2 go?
(and what will CLUSTAL choose?)
©CMBI 2009
Use of 3D structure info (2)
A
B1
B2
1
2
3
4
5
6
7
8
9 10
ILE CYS ARG LEU PRO GLY SER ALA GLU ALA
VAL CYS ARG THR PRO --- --- --- GLU ALA
VAL CYS ARG --- --- --- THR PRO GLU ALA
11
VAL
ILE
ILE
©CMBI 2009
An even more real example
A
B1
B2
1
2
3
4
5
6
7
8
9 10
ILE CYS ARG LEU PRO GLY SER ALA GLU ALA
VAL CYS ARG THR PRO --- --- --- GLU ALA
VAL CYS ARG --- --- --- THR PRO GLU ALA
PP-
11
VAL
ILE
ILE
G- S-T
LT-
A-P
RRR
IVV
CCC
EEE
VII
AAA
©CMBI 2009
We have seen that alignments ….
1) Are crucial for being able to transfer information
2) Can be optimized by using secondary structure preferences
(e.g. helix positioning)
3) Can be optimized by using 3D structure info
©CMBI 2009
Multiple sequence alignments
If we have more than two sequences aligned, the alignment is called a
multiple sequence alignment (MSA)
MSA’s can:
1)
confirm or improve pair-wise sequence alignments
2)
reveal structural information (e.g. cys-bridges)
3)
validate PROSITE search results
©CMBI 2009
MSA for improvement of pair-wise alignments
CWPVAASYGR
CWPT---YGR
CWPTA-SYGR
CWPTLGLFGR
©CMBI 2009
MSA and cysteine bridges
Multiple sequence alignments can reveal structural information:
1
2
3
4
ASCTRGCIKLPTCKKMGRCTGY
STCTKGALKLPVCRKMGKSSAY
ATSTHGCMKLPCSRRFGKCSSY
TSCTEGCLRLPGCKRFGRCTSY
TTCTKGLLKLPGCKRFGKSSAY
ASSTKGCMKLPVSRRFGRCTAY
©CMBI 2009
MSA to validate PROSITE results (1)
PROSITE glycosylation pattern:
N-{P}-[ST]-{P}
where N is the glycosylation site.
PROSITE Syntax:
A-[BC]-X-D(2,5)-{EFG}-H
Means:
A
B or C
Anything
2-5 D’s
Not E,F or G
H
©CMBI 2009
MSA to validate PROSITE results (2)
The chance of finding N-{P}-[ST]-{P} is rather high.
So how can you be sure? Look at the multiple sequence alignment:
ASLRNASTVVTIGDTITGNLTLASYHW
GSIKNGSSVITLPGTMEGNLSTTTYHY
ATLRNASTVMEINGTITGDLTLASFHW
©CMBI 2009
What you have learned today
(and will need for your own project)
• A good sequence alignment is necessary to carrying over
information between proteins.
•Putting amino acids below each other in a sequence alignment
implies that you predict that they are on equivalent positions in both
proteins.
• If the aligned sequences are > 80 aa long, then >25% sequence
identity is enough to reliably transfer structural information.
•You need to use all structural information available to you to optimize
the sequence alignment. This can be real 3D data, but can also be
“just” your own knowledge about the properties and preferences of
the amino acids.
©CMBI 2009