introtossequencesimilaritysearching_sept2011

Download Report

Transcript introtossequencesimilaritysearching_sept2011

Pairwise Alignments and
Sequence Similarity-Based
Searching
Aidan Budd, EMBL Heidelbe
"Anatomy" of a Sequence Alignment
residues
sequence
s
WKKLGSNVG
WGKVKNVD
residues
Residues:
Monomers within a polymer (polypeptide or polynucleotide)
chain
Sequences:
List of residues in a polymer chain...
...listed in the same order they occur within the polymer
Aidan Budd, EMBL Heidelbe
"Anatomy" of a Sequence Alignment
residues
sequence
s
WKKLGSNVG
WGKVKNVD
1:1 residue
correspondences/
relationships
residues
1:1 residue correspondences/relationships
Correspondences between
• a single residue in one sequence and
• a single residue in another sequence
Aidan Budd, EMBL Heidelbe
"Anatomy" of a Sequence Alignment
WKKLGSNVG
WGKVKNVD
1:1 residue
correspondences/
relationships
Residue has no equivalent in the top sequence
i.e. no residue in the top sequence has a 1:1 relationship with this
residue
Could perhaps say there is a "1:2" relationship between this
residue and these residues
However, alignments focus on 1:1 relationships
Aidan Budd, EMBL Heidelbe
Sequence Alignment Within a Grid
WKKLGSNVG
WGKVKNVD
W K K L G S - N V G
W G K V - - K N V D
Often represented using a
grid/matrix:
One sequence per row
Residues in the same column are 'equivalent'
Gap characters (usually "-") indicate that the sequence contains no
residues 'equivalent' to other residues in that column
Aidan Budd, EMBL Heidelbe
Multiple Sequence Alignment
W K K L G S N V G
W G K V - - N V D
- A K V - - - V D
Aidan Budd, EMBL Heidelbe
Pairwise Sequence Alignment
Raise your hands:
•Who has ever seen a "pairwise alignment" before?
•Who has ever used/encountered one in their research?
Pairwise sequence alignments are a crucial component of many
bioinformatic analyses and tools
In particular they lie at the heart of the tools most commonly used to
predict function of protein/DNA/RNA molecules i.e. to generate
hypotheses for the function of key biological entities that can be tested
by wet-lab experiments
Aidan Budd, EMBL Heidelbe
Building Sequence Alignments
Aidan Budd, EMBL Heidelbe
How They Used to Align Sequences...
Courtesy of Geoff Barton, Dundee
Aidan Budd, EMBL Heidelbe
Pairwise Alignments
How we can build them today "by hand" using JalView
Why it's useful to look at this, despite the many automatic
methods
Sometimes we're right and the automatic methods are wrong and
we can spot this
Useful to get you thinking about how we choose a good from a
bad alingment, and to identify sequences that are easier of more
difficult to align (as these are the same kinds of sequences that
automatic tools find easy/difficult to align, so we know better what
to expect from such tools
Aidan Budd, EMBL Heidelbe
Pairwise Alignments
How we can build them today "by hand" using JalView
http://tinyurl.com/protBioinf2011
During exercises, write down:
1.Features that describe a relatively "good" alignment, thinking in
terms of
•sizes of gaps
•numbers of gaps
•properties of residues in the same column (the same as each
other? different?)
1.Instructions on how to change a "bad" alignment into a better one
2.Characteristics of sequences that are more difficult/take more time
to align than others
Describe what you've written to your neighbour - have you come up with
the same answers?
Aidan Budd, EMBL Heidelbe
Pairwise Alignments
1. Features that describe a relatively "good" alignment:
• small, few gaps
• residues in the same column either the same amino acid, or ones with
similar physical/chemical property
1. We build a "good" alignment by inserting as few, as short as possible
gaps into the alignment, so that as many columns as possible contain
the same/similar residues
2. Sequences that are more difficult/take more time to align than others
when:
• They are more divergent (e.g. when the best alignment contains fewer
"identical" residues)
• They are longer
• They have very different lengths
• They are DNA rather than the corresponding coding sequences
• They
contain
"repeated"
regions
Part
of the
reason
why these
situations cause problems is that, rather than
there clearly being one best alignment, we find several that are "equally"
Aidan Budd, EMBL Heidelbe
1:1 Relationships Between Residues
in the Same Column
Thinking in terms of (i) protein structure and (ii) evolution, what defines
the relationship that you expect residues in the same column of a
good/correct alignment to share with each other?
Think about this on your own for a minute.
Perhaps there's just a single word you might want to use to describe this
sought-after relationship?
Try and write down a definition of this relationship
Try and explain your answers to your neighbours
Aidan Budd, EMBL Heidelbe
Alternative Interpretations of
MSAs (Evolutionary and
Structural)
Aidan Budd, EMBL Heidelbe
“Equivalence”/similarity of residues
Residues in the same column
either:
• Structurally equivalent/similar
• Evolutionary
equivalent/related/homologous
Different applications assume different
types of equivalence
Different types of similarity not
necessarily equivalent
Aidan Budd, EMBL Heidelbe
Structural Similarity
Unaligned
Structurally Aligned
Bacterial toxins 1ji6 and 1i5p
Aidan Budd, EMBL Heidelbe
Structural Similarity
Chain1: 68 ELIGLQANIREFNQQVDNF
1111111111111111111
Chain2: 70 ELQGLQNNFEDYVNALNSW
Residues with a similar structural context may lie almost on
top of each other within a structural alignment. Clearly, the
dark green and red side chains have more similar structural
contexts than they do with the adjacent light-coloured side
chains
Aidan Budd, EMBL Heidelbe
Structural Similarity
Chain 1: 16 KVGSLIGKR---ILSELWGIIFPSGST
111111111
11111111111 111
Chain 2: 16 VVGVPFAGALTSFYQSFLNTIWP-SDA
Aidan Budd, EMBL Heidelbe
Structural equivalence
Some regions of the structures do not have structurally
equivalent residues in the other structure
Alignment gaps are a sure
sign of such residues
Placing such residues in the
same column as residues
from other sequences is a
misalignment - to be
avoided!
1i5p:
DNFLNPTQN----PVPLSITSSVN
111111
111111111111ji6:
NSWKKTPLSLRSKRSQDRIRELFS
Aidan Budd, EMBL Heidelbe
“Evolutionary Equivalence”
AGWYTI
Mutation /
Substitution
Y-W
AGWWTI
AGWWTI
AGWWTI
AGWYTI
AAWYTI
AAQQQWYTI
AGWWTI
AGWWTI
AG---WWTI
AGWYTI
AAWYTI
AAQQQWYTI
AGWYTI
AGWYTI
Substitution
G-A
Two copies
of gene
generated
AGWYTI
AGWYTI
AGWYTI
QQQ
Insertion
Residues in the same alignment column should trace their history
back to the same residue in the ancestral sequence with any
changes due only to point substitutions
Aidan Budd, EMBL Heidelbe
Quiz - Evolutionary Interpretation of
Alignments
Which alignment of the final sequences (X, Y
or Z) only places residues in the same
column if they are related by substitution
events?
X
Y
Z
KGEPG---IGLPG KGEPG------IGL------PG KGE--------PGIGL------PG
KGIPGDPAFGDPG KGIPG---------DPAFGDPG KGIPG-----------DPAFGDPG
RGIPGEVLGAQPG RGIPGEVLGAQ---------PG RGIPGEVLGAQ-----------PG
Aidan Budd, EMBL Heidelbe
Quiz - Evolutionary Interpretation of
Alignments
"True" alignment given history described above
PRANK
KGE--------PGIGL------PG
KGIPG-----------DPAFGDPG
RGIPGEVLGAQ-----------PG
RGIPGEVLGAQPG
KGIPGDPAFGDPG
---KGEPGIGLPG
Aidan Budd, EMBL Heidelbe
Quiz - Evolutionary Interpretation of
Alignments
CLUSTALX
MAFFT
PRANK
K---GEPGIGLPG
KGIPGDPAFGDPG
RGIPGEVLGAQPG
KGEPG---IGLPG
KGIPGDPAFGDPG
RGIPGEVLGAQPG
RGIPGEVLGAQPG
KGIPGDPAFGDPG
---KGEPGIGLPG
Different automatic MSA software gives different results
All are different from the "true" alignment (assuming the scenario of
transformation on the previous slide is true)...
... because that scenario is very unlikely under the models of
evolutionary transformation incorporated within these tools
X
Y
Z
KGEPG---IGLPG KGEPG------IGL------PG KGE--------PGIGL------PG
KGIPGDPAFGDPG KGIPG---------DPAFGDPG KGIPG-----------DPAFGDPG
RGIPGEVLGAQPG RGIPGEVLGAQ---------PG RGIPGEVLGAQ-----------PG
Aidan Budd, EMBL Heidelbe
Interpreting Alignments
• Special 1:1 relationship between residues in the same column
•
•
Structural: very similar structural context
Evolutionary: any difference between residues in the same column due to
point substitution (not to any other kind of mutation e.g. deletion followed by
insertion)
• Structural and Evolutionary equivalence need not necessarily be
the same
• Not all residues have 1:1 equivalents in other sequences
Aidan Budd, EMBL Heidelbe
Non-Equivalence of Evolutionary and
Structural Alignments
Demonstration 1:
Structural equivalence without evolutionary equivalence
Structural alignment of SH3-interaction motifs from nef and
ncf1
aligned ncf1/nef1
SH3 interaction
motifs
nef1/fyn1
PDB:1efn
ncf1/ncf4
PDB:1w70
Aidan Budd, EMBL Heidelbe
Non-Equivalence of Evolutionary and
Structural Alignments
Demonstration 2b:
Sequences differ by ONE amino acid residue and have
different folds
GA95
GB95
Proc Natl Acad Sci U S A. 2009 Dec 15;106(50):21149-54.
A minimal sequence code for switching protein structure and function.Alexander PA, He Y, Chen Y, Orban J, Bryan PN.PMID:
19923431
Aidan Budd, EMBL Heidelbe
Quiz - Numbers of Insertions
The minimum number of insertion events required to account for
the section of haemoglobin alignment shown above is?
(a) 2
(b) 1
(c) 0
(d) 3
Aidan Budd, EMBL Heidelbe
Quiz - Numbers of Insertions
The minimum number of insertion events required to account for
the section of haemoglobin alignment shown above is?
If all sequences are the same length, we can explain their
diversity without inferring ANY insertions or deletions
If and alignment contains sequences that are all either length x or
y, then we can explain their diversity by inferring just one insertion
or deletion
Aidan Budd, EMBL Heidelbe
Quiz - Numbers of Insertions
The minimum number of insertion events required to account for
the section of haemoglobin alignment shown above is?
We can ALWAYS explain observed sequence length diversity
with:
•0 insertions (all length variation due to deletion)
•0 deletions (all length variation due to insertion)
•a combination of insertions and deletions
Perhaps we should instead focus on inferring the most likely
scenario?
(Although if this is not particularly relevant for our analysis, perhaps
we should focus instead on something completely different!)Aidan Budd, EMBL Heidelbe
Identifying Good Alignments
Aidan Budd, EMBL Heidelbe
Distinguishing Better from Worse
Alignments
An "objective" way of choosing which of two alignments
(between the same pair of sequences) is "better" (more likely to
be correct) would help us identify "good" alignments
Scoring schemes/rules have been developed for this purpose
These aim to assign higher scores to alignments that are more
similar to true alignments
h
higher score
lower score
Aidan Budd, EMBL Heidelbe
Scoring Schemes for Aligned
Residues
Commonly, scoring schemes (for columns containing no gaps) assign a
score for each column...
...the score for the full alignment is then the sum of the individual
column scores
Below we calculate, in this way, a score for two alignments of the same
sequences
For each column, the score assigned depends on which residues are
found in the column
Aligned residue scores are taken from an
amino acid substitution matrix - see next
SeqA
N
A
N
slide...
SeqB
N
G
-
score = 3.8 + 0.5 = 4.3
3.8 0.5
SeqA
N
A
N
SeqB
N
-
G
3.8
0.4
score = 3.8 + 0.4 = 4.2
Aidan Budd, EMBL Heidelbe
Amino Acid Substitution Matrices
Matrix provides a
score for alignment
of each different pair
of amino acids
SeqA
N
A
N
SeqB
N
G
-
score = 3.8 + 0.5 = 4.3
3.8 0.5
SeqA
N
A
N
SeqB
N
-
G
3.8
0.4
Gonnet
PAM250
score = 3.8 + 0.4 = 4.2
Aidan Budd, EMBL Heidelbe
Amino Acid Substitution Matrices
Values in some cells in
this matrix are positive
Although more cells
contain negative
values
Gonnet
PAM250
Aidan Budd, EMBL Heidelbe
Amino Acid Substitution Matrices
Values are obtained analysing alignments
between sequences that we are very
confident:
•have similar 3D structures
•evolutionarily related by processes of:
• point substitution
• insertion
• deletion
Gonnet
PAM250
Aidan Budd, EMBL Heidelbe
DE
Neutrophil cytosol factor 2 signature
BL
PR00499;
width=18; seqs=27; 99.5%=1042; strength=1
Amino Acid Substitution Matrices
One set of matrices (the BLOSUM
series) are built by analysing many
ungapped regions ("blocks") of
alignments of fairly similar
sequences
Values are obtained analysing alignments
between sequences that we are very
confident:
•have similar 3D structures
•evolutionarily related by processes of:
• point substitution
• insertion
• deletion
NCF2_BOVIN|O77775
( 218) APLQPQAAEPPPRPKTPE
9
NCF2_HUMAN|P19878
( 218) APLQPQAAEPPPRPKTPE
9
O70145|O70145_MOUSE
( 218) APLQPQSAEPPPRPKTPE
10
YKA7_CAEEL|P34258
( 117) IPLKEAFTALPPRPAAPS
40
Q8NFC7
( 218) APLQPQAAEPPPRPKTPE
9
Q9BV51
( 218) APLQPQAAEPPPRPKTPE
9
Q95MN2|Q95MN2_RABIT
( 218) APLQPQAAEPPPRPKTPE
9
Q95L70|Q95L70_BISBI
( 218) APLQPQAAEPPPRPKTPE
9
Q9N0E9|Q9N0E9_TURTR
( 218) APLQPQAAEPPPRPKTPE
9
Q6GMC8|Q6GMC8_XENLA
( 219) APLQPQANNPPSRPKTPE
22
Q59F14|Q59F14_HUMAN
( 246) APLQPQVRQSDLLGAQAG
95
Q61QT5|Q61QT5_CAEBR
( 237) IPLKEAFSAPPPRPAAPS
37
Q5R5J0|Q5R5J0_PONPY
( 218) APLQPQAAEPPPRPKTPE
9
O08635|O08635_MOUSE
(
20) QIFKNQDPVLPPRPKPGH
20
Q3TC92|Q3TC92_MOUSE
( 232) APLQPQSAEPPPRPKTPE
10
Q3U5S4|Q3U5S4_MOUSE
( 218) APLQPQSAEPPPRPKTPE
10
Q6DFH8|Q6DFH8_XENLA
(
52) YVIKRQQPDLPPRPKPGH
43
Q499C5|Q499C5_XENTR
( 219) APLQPQASNPPPRPKTPE
13
Q60FB5|Q60FB5_ONCMY
( 218) APLQPQVEEVPTRPKVPE
25
Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16
http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A
Q5HYK7|Q5HYK7_HUMAN
( 387) QVFKNQDPVLPPRPKPGH
16
Aidan Budd, EMBL Heidelbe
DE
Neutrophil cytosol factor 2 signature
BL
PR00499;
width=18; seqs=27; 99.5%=1042; strength=1
Amino Acid Substitution Matrices
NCF2_BOVIN|O77775
( 218) APLQPQAAEPPPRPKTPE
9
NCF2_HUMAN|P19878
( 218) APLQPQAAEPPPRPKTPE
9
O70145|O70145_MOUSE
( 218) APLQPQSAEPPPRPKTPE
10
YKA7_CAEEL|P34258
( 117) IPLKEAFTALPPRPAAPS
40
Q8NFC7
( 218) APLQPQAAEPPPRPKTPE
9
•frequency (qij) with which residues i
Q9BV51
( 218) APLQPQAAEPPPRPKTPE
9
Q95MN2|Q95MN2_RABIT
( 218) APLQPQAAEPPPRPKTPE
9
and j are found in the same column
(averaged over all analysed blocks)
•frequency with which each pair of
residues are found in same column if
all sequences randomised
• where pi is the frequency with
which residue i is present in
the complete
thisoften
is: pi
If two residues
i anddataset,
j are more
* pthe
j
found in
same column in blocks
compared to randomised sequences
then
qij / pi*pj > 1
and
log(qij / pi*pj) >
0
Q95L70|Q95L70_BISBI
( 218) APLQPQAAEPPPRPKTPE
9
Q9N0E9|Q9N0E9_TURTR
( 218) APLQPQAAEPPPRPKTPE
9
Q6GMC8|Q6GMC8_XENLA
( 219) APLQPQANNPPSRPKTPE
22
Q59F14|Q59F14_HUMAN
( 246) APLQPQVRQSDLLGAQAG
95
Q61QT5|Q61QT5_CAEBR
( 237) IPLKEAFSAPPPRPAAPS
37
Q5R5J0|Q5R5J0_PONPY
( 218) APLQPQAAEPPPRPKTPE
9
O08635|O08635_MOUSE
(
20) QIFKNQDPVLPPRPKPGH
20
Q3TC92|Q3TC92_MOUSE
( 232) APLQPQSAEPPPRPKTPE
10
Q3U5S4|Q3U5S4_MOUSE
( 218) APLQPQSAEPPPRPKTPE
10
Q6DFH8|Q6DFH8_XENLA
(
52) YVIKRQQPDLPPRPKPGH
43
Q499C5|Q499C5_XENTR
( 219) APLQPQASNPPPRPKTPE
13
Q60FB5|Q60FB5_ONCMY
( 218) APLQPQVEEVPTRPKVPE
25
Key parameters estimated in
analysis:
Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16
http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A
Q5HYK7|Q5HYK7_HUMAN
( 387) QVFKNQDPVLPPRPKPGH
16
Aidan Budd, EMBL Heidelbe
DE
Neutrophil cytosol factor 2 signature
BL
PR00499;
width=18; seqs=27; 99.5%=1042; strength=1
Amino Acid Substitution Matrices
NCF2_BOVIN|O77775
( 218) APLQPQAAEPPPRPKTPE
9
NCF2_HUMAN|P19878
( 218) APLQPQAAEPPPRPKTPE
9
O70145|O70145_MOUSE
( 218) APLQPQSAEPPPRPKTPE
10
YKA7_CAEEL|P34258
( 117) IPLKEAFTALPPRPAAPS
40
Q8NFC7
( 218) APLQPQAAEPPPRPKTPE
9
•frequency (qij) with which residues i
Q9BV51
( 218) APLQPQAAEPPPRPKTPE
9
Q95MN2|Q95MN2_RABIT
( 218) APLQPQAAEPPPRPKTPE
9
and j are found in the same column
(averaged over all analysed blocks)
•frequency with which each pair of
residues are found in same column if
all sequences randomised
• where pi is the frequency with
which residue i is present in
the complete dataset, this is: pi
If two residues i and j are less often found
* pj
in the same column in blocks compared
to randomised sequences then
Q95L70|Q95L70_BISBI
( 218) APLQPQAAEPPPRPKTPE
9
Q9N0E9|Q9N0E9_TURTR
( 218) APLQPQAAEPPPRPKTPE
9
Q6GMC8|Q6GMC8_XENLA
( 219) APLQPQANNPPSRPKTPE
22
Q59F14|Q59F14_HUMAN
( 246) APLQPQVRQSDLLGAQAG
95
Q61QT5|Q61QT5_CAEBR
( 237) IPLKEAFSAPPPRPAAPS
37
Q5R5J0|Q5R5J0_PONPY
( 218) APLQPQAAEPPPRPKTPE
9
O08635|O08635_MOUSE
(
20) QIFKNQDPVLPPRPKPGH
20
Q3TC92|Q3TC92_MOUSE
( 232) APLQPQSAEPPPRPKTPE
10
Q3U5S4|Q3U5S4_MOUSE
( 218) APLQPQSAEPPPRPKTPE
10
Q6DFH8|Q6DFH8_XENLA
(
52) YVIKRQQPDLPPRPKPGH
43
Q499C5|Q499C5_XENTR
( 219) APLQPQASNPPPRPKTPE
13
Q60FB5|Q60FB5_ONCMY
( 218) APLQPQVEEVPTRPKVPE
25
Key parameters estimated in
analysis:
qij / pi*pj < 1
and
log(qij / pi*pj) < 0
Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16
http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A
Q5HYK7|Q5HYK7_HUMAN
( 387) QVFKNQDPVLPPRPKPGH
16
Aidan Budd, EMBL Heidelbe
aa Substitution Matrix Values
Values in matrix cells are
proportional to log(qij / pi*pj)
What range of scores would you expect
alignments that are more similar to
alignments found in blocks, compared to
alignments between random sequences, to
have:
SeqA
N
A
N
A. scores < 0
SeqB
N G
B. 0 < scores < 1
score
=
3.8
+
0.5
=
4.3
3.8
0.5
C. 0 > scores
Aidan Budd, EMBL Heidelbe
Sequence Similarity Searching
Query
sequence
Database
of
sequences
Build optimal alignment
between query sequence
and each database
sequence
...
Calculate score for each optimal alignment
For each alignment, calculate how many alignments between
randomised (structurally dissimilar/evolutionarily unrelated) sequences
you would expect with a score the same or greater than the score for this
alignment
This number is the E-value
Aidan Budd, EMBL Heidelbe
Interpreting E-values
E-value of 0.001 means that you would expect this score to be found
for an alignment between randomised sequences once in every 1000
similar searches
For each of values below, decide whether it it impossible that it be
reported as the result of such a search:
A.1
B.0
C.0.00001
D.1000000000
E.-1
Aidan Budd, EMBL Heidelbe
Interpreting E-values
You have carried out a BLAST search with a query sequence seqA
against a database dbB, such that there are no sequences in dbB that
are "related" to seqA
What is the most likely E-value for the highest-scoring alignment?
Aidan Budd, EMBL Heidelbe
Interpreting E-values
You have carried out a BLAST search with a query sequence seqA
against a database dbB, such that there are no sequences in dbB that
are "related" to seqA
You carry out several variants of this search, as described below. For
each search, decide:
•the most likely value of the E-value for the highest-scoring alignment
•the most likely fraction of the score of highest-scoring alignment
compared to that of the initial search (5 times larger? 5 times smaller
etc.)
•delete the second half of seqA, search against dbB
•query seqA against a database (dbC) which contains two copies of
every sequence in dbB
•query seqA against a database (dbD) from which every second
sequence in dbB has been removed from
Aidan Budd, EMBL Heidelbe