Bolsum and PAM Matrix

Download Report

Transcript Bolsum and PAM Matrix

Measuring the degree of similarity:
PAM and blosum Matrix
Lecture 13
Introduction
•
•
•
•
•
•
Measurement of matching
Nucleic acid and amino acid substitutions
The blosum Matrix
The Pam Matrix
Appropriate use of blosum and Pam Matrix
Measurement of alignment gaps
Measurement of matching
• The dot plot gives a visual representation of
sequence alignment. So how do we measure the
alignment.
• One way is to count of matches and mismatches:
the difference between them
– Hamming distance; :
• The distance corresponds to mismatches for strings of equal
length.
– agtc
– cgta Distance is 2 (give another example)
Measurement of matching
• If the sequences (strings) are not of equal length
the use:
– The Levenshtein distance: is the minimum number of
edit operations (alter/ insert/delete) to required to
turn one string into another:
• ag- tcc
• cgctca what is the levensthein distance?
• But what about the biological plausibility of this
approach? Strings are not the same as
sequences!!! (hint: amino acid alignment)
Nucleic Acid mutations
• It is know that transitions a<->g are more
common than transversions c<->t
• In sequence alignment we are trying to
determine the degree of similarity and not
dissimilarity; but the hamming/levenshtein
measure dissimilarity.
• One approach would be to count the number of
matches but there is now a need to include the
bias associated with possible substitutions.
nucleic acid scoring table
• Based on known rates we could propose, a simple,
table like the following:
– where the each match scores a 1000
– A transition A<-> G scores a 100
– A transversion T<->C and others score a 10
• The values correspond to the chances of a
substitution (no substitution.)
A
G
T
C
A
1000
100
10
10
G
100
1000
10
10
T
10
10
1000
10
C
10
10
10
1000
nucleic acid scoring table
• Using this we could attempt to calculate the similarity
we would look at each sequence and determine the
score seq1 1 to seq 2 .
• Seq 1: agtc
• Seq 2: cgta
• 10 1000 1000 10 since the are, we assume, independent
elements (events) we have to multiple them to get the score.
– LogA+LogB = Log(A*B)
– However by get the log of each value we only have to add
the values: log10 of about is 8.
• What would be the table if log values were used?
Nucleic Acid Matrix
A
G
T
C
A
3
2
1
1
G
2
3
1
1
T
1
1
3
1
C
1
1
1
3
So in this case all we have to do is add the values. Note this is example to illustrate
the concept. This is not actual substitution matrix for nucleic acids (bases) [it can
be found on the internet] . But lesk 2008 p. 255 give an example of one.
Measurement of sequence similarity plays a much greater role in assessing
proteins.
Why do you think the similarity of proteins is more critical than nucleic: (hint: code
and AA properties )
Measuring Protein similarity
• Deriving a matrix for proteins is more complex
because:
• There are 20 amino acids so much larger set of
substitutions.
• The amino acids have properties that affect the
structure and so the protein functionality.
• Therefore substitutions can be conserved or semiconserved
• Observations shows that conserved substitutions
• e.g. Hydrophobic <-> hydrophobic mutations are more common
• semi conserved; e.g. hydrophilic <-> hydrophobic
PAM 1 matrix
Pam (PERCENTAGE ACCEPTED MUTATION) 1 is the chance of a one point
mutation per 100 residues; in other words a first round of divergence: the above
score is dependent on the expected value of occurrence.
Clearly A <-> A, no change, has a high score
A hydrophobic <-> Hydrophobic V<->A (13); while V<-> I is (57)
A hydrophilic <-> hydrophilic K <-> T (11); K<-> R (37)
A hydrophilic <-> hydrophobic: K <-> V (1)
Dayhoff PAM (250) Matrix
•
•
THE most common PAM matrix is the 250
It represents a greater degree of
evolutionary divergence and corresponds
to multiplying the PAM 1 by itself 250
times via a process called dynamic
programming
•
•
To dervive the values you use:
Observed rate of mutation/ the random
mutation rate (based on the AA
frequency. In other words : expected
value .(no bias, positive bias or negative
bias).
the log of this expected value is
multiplied by 10 to give the results in the
table opposite.
•
•
•
Therefore a C<->S has a value of 2 or an
expected value 1.6 :occurred 1.6 times
more often than if it was random.:
log((1.6) = 0.2 . Multiply this by 10 gives
a value of 2.
The values in the PAM 250 are a
obviously lower but the distribution is
about the same: why?
blosum 62 matrix
•
•
•
Another matrix the blosum Matrix used a larger data set (as there was more information available
in 1992 than in 1978)
Moreover the blosum looked at mutations within blocks of conserved sequences
as opposed to point mutations on individual sequences in both conserved and variable regions. [
what was the logic behind excluded]
•
The blosum 62 matrix, unlike the PAM 250 matrix , the blosum multiplied 250 times, is the
probabilities are derived from blocks sharing 62% conservation .
•
•
•
•
•
•
•
•
•
Like the PAM matrix it
Hydrophobic to hydrophobic
V<->A (O)
V<-> I (3)
Hydrophilic to Hydrophilic
K <-> T (-1)
K<-> R (2)
Hydrophobic to hydrophilic
K<-> V (-2)
PAM and blosum Matrices
• In the PAM matrix the as the number
increases so does evolutionary distance while
it is the reverse it the blosum Matrix.
• According to Baxevanis (2003) the following
represents the equivalence and most
appropriate use of both matrices
– PAM250 and the blosum 45
– PAM160 and the blosum 62
•
PAM and blosum Matrix
Matrix
Best in determining
PAM 40/ blosum 90
Short similar (conserved) alignments
PAM 250
Longer more divergent alignments
Pam 160/ blosum 80
Detecting members of protein families
blosum 62
In finding all potential similarities
Adapted from Baxevanis 2005
An excellent review of scoring matrices can be found at : Henikoff and
Henikoff 2000
Measurement of alignment gaps
• Gaps represents insertions and deletions
• Need to be limited so that they represent biological
plausibility.
• Baxevanis (2005) suggest that no more than “one in 20
is a good rule of thumb”.
• Baxevanis (2005) proposed that the use of gaps in
alignments is penalised; in other words the
measurement of the similarity reduces.
• The penalty associated with the using gaps is
dependent on
– Opening the gap
– Extending the gap
– The length of the gap.
The Blast Algorithm
• The most widely used approach to determine
similarity is the BLAST algorithm.
• Basically the algorithm is a combination of the
dot plot and one of the scoring matrices: such
as blosum or PAM,
• Is used to determine the best region of local
alignment between the query sequence and
target sequences (refer to dot plot example 1
in lecture 12).
•
Potential Exam Questions
• Discuss how to derive both the PAM and blosum matrix and
why it is necessary to use different variants ,of each, in
determining different types of similarity analysis.
• The dot plot and the PAM and Blosum matrices are
important tools in the measurement of amino sequences
similarity. Discuss the best variant of each that should be
used in the determination of sequence alignment similarity.
• Distinguish between the two main types of scoring matrices
[PAM and blosum] and explain how they are used to
measure the amount of similarity between two sequences.
References
• Baxevanis A.D. 2005 Bioinformatics: a
practical guide to the analysis of genes and
proteins chapter 11; Wiley
• Lesk, A. 2008; Introduction to bioinformatics,
3rd edition, oxford university press