Transcript pam&blosum

1
PAM & Blosum
Park, Jong Hwa
MRC-DUNN
Hills Road Cambridge
CB2 2XY
England
Bioinformatics in Biosophy
:
Next
02/06/2001
Family business and Matrix
1. Proteins can always be grouped together.
2. As they can be grouped (clustered), there
must be a way to represent a whole group of
proteins.
3. How are we going to represent the groups
of proteins to represent some important
information effectively?
4. In Bioinformatics, multiple sequence
alignment is one of the most common way
of representing the whole groups. (Why?)
5. Once alignments are made, there are so
many things to do, to know the protein
families better. (Why?)
6. Clustering proteins in various other ways.
Some Blocks in Alignments
From sequence Blocks  PAM matrix
We use PAM units to measure the amount of evolutionary
distance between two amino acid sequences (at least 85%
identical)
Two sequences S1 and S2 are said to be one PAM unit
diverged if a series of accepted point mutations (and no
insertions of deletions) has converted S1 to S2 with an
average of one accepted point-mutation event per 100
amino acids in the two sequences.
The term ``accepted'' here means a mutation that was
incorporated into the protein and passed to its progeny.
Therefore, either the mutation did not change the function
of the protein or the change in the protein was beneficial
to the organism.
PAM
Note that two sequences which are one PAM unit
diverged do not necessarily differ in one
percent, as often mistakenly thought, because a
single position may undergo more than a single
mutation. The difference between the two
notions grows as the number of units does.
2 problems with PAM
There are two main problems with the notion of the PAM units:
1. First, practically all the sequences we can obtain today are
extracted from extant organisms. We almost do not know any
protein sequences where one is actually derived from the other.
The lack of ancestral protein sequences is handled by assuming
that amino acid mutations are reversible and equally likely in
either direction. This assumption, together with the additivity
property of the PAM units derived from its definition, imply that
given two amino acid sequences: Si and Sj whose mutual ancestor
is Sij we have:
d(Si,Sj) = d(Si,Sij) + d(Sij,Sj)
when d(i,j) is the PAM distance between amino acid sequences i
and j.
The InDel problem
2. The second problem, which is more difficult to overcome, is that
we disregard here insertions and deletions which may occur
during evolution, hence we can not be sure of the correct
correspondence between sequence positions. In order to know
the exact correspondence one has to be able to identify the true
historical gaps, or, at least to identify large intervals along the two
sequences where the correspondence is correct. This can not
always be done with certainty, especially when the two
sequences are distantly diverged.
From seq. Blocks to BLOSUM
http://www.math.tau.ac.il/
~rshamir/algmb/scribe/ht
ml/lec03/node10.html
The first stage of building a matrix is
eliminating sequences, which are
identical in more than x% of their
amino acid sequence. This is done to
avoid bias of the result in favor of a
certain protein. The elimination is
done either by removing sequences
from the block, or by finding a cluster
of similar sequences and replacing it
by a new sequence that represents the
cluster. The matrix built from blocks
with no more the x% of similarity is
called BLOSUM-x (e.g. the matrix
built using sequences with no more
then 50% similarity is called
BLOSUM-50.)
Building Blocks Matrix (Blosum)
The second stage is counting the pairs of amino acids in each
column of the multiple alignment. For example in a column
with the acids AABACA (as in the first column in the block in
figure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC.
The probability qi,j for a pair of amino acids in the same
column to be Ai and Aj is calculated, as well as the probability
pi of a certain amino acid to be Ai.
In the third stage the log odd ratio
is calculated.
As final result we consider the rounded 2si,j, this value is stored in
the (i,j) entry of the BLOSUM-x matrix.
We take the log value of the probability in order to allow computing the total
score of all substitutions using summation rather than multiplication.
Making Blosum
In contrast to the PAM matrices, more sequences are examined
in the process of computing the BLOSUM matrix. Moreover,
the sequences are of specific nature of resemblance, and
therefore the two sets of matrices differ.
Comparing the efficiency of two matrices is done by calculating
the ratio between the number of pairs of similar sequences
discovered by a certain matrix but not discovered by another
one and the number of pairs missed by the first but found by
the other.
According to this comparison BLOSUM-62 is found to be
better than other BLOSUM-x matrices as well as than PAMx matrices.
Making New FASTA exchange matrix
1. Get some structural or sequence alignments
2. Find non-gap blocks in the alignments
3. Count the occurrences of mutations for single
amino acid
4. Calculate the log odd ratio of expected
frequency vs observed frequency
5. Produce a FASTA format Matrix
6. Use your own matrix to run a FASTA program
(Fasta -s option).
7. fasta -s my.mat query.spfa lib.spfa
Making your own Matrix 2
http://bioinfo.mbb.yale.edu/align/scop/tables/allStrAlignments.txt
Let’s make a Multiple Residue Column (MRC) Matrix.
1. Get the alignment file: bioinfo-238/teaching mat/practical_
2.
3.
4.
5.
resource/
Parse the alignments to get dipeptide entries and occurrences
Make 400x400 matrix with occurrences(frequency)
Calculate the log odd ratio of all the dipeptides
Display the rankings of similar non-identical dipeptides.
How to improve exchange Matrix?
• Use sequence alignment from the best people or
structure/sequence align algorithm
• Use very large set of alignments
• Apply some protein family specific matrix (for
example, transmembrane proteins)
• Use multiple residue columns to make the
Matrix
PAM to quantify homology automatically
• Problem: we have two sequences
–
–
SEQVENCE
STRVCTVT
How similar or Homologous are they?
1. A simple way: Compositional Identity
2. Another Simple way: Sequence Identity
3. Advanced way: Similarity Score
4. Most advanced: Statistical Significance
Simple Sequence Similarity
1. A simple way: Compositional Identity
Compositional ID: SEQVENCE <> STRVCTVT
Shows how similar the sequences without giving order
information
1. Another Simple way: Sequence Identity
Simple to understand, but far from accurate
To have more meaningful similarity, we can use
PAM, BLOSUM, your own MATRICES,,,
Similarity Scores for a pair of sequences
1. Advanced way: Similarity Score
Align the two sequences
Calculate the score based on Matrices used.
2. Most advanced: Statistical Significance
1. Produces some random sequences for one of the
sequences and compare them to the other
sequences many times (~100) and check what is
the control level alignment score.
2. Onc can use reversed sequences instead of
random sequences.
How to count similarity?
• If two sequences are very similar, we can just
count the identical and similar sequences and
calculate score
– ABCDEKKKMLOVE
– ** *.**:.: *
– ABEDCKKRNLIKE
• However, biological sequences have
Insertions and Deletions.
– ABCDEKKKABCDEMLOVE
– ABEDCKKR-----NLOVE
How do we know the correct Insertions and Deletions
positions?
Normally it is not easy when sequences are very distant.
Below twilight zone => Nearly impossible to have correct alignment.
What is the ‘correct’ alignment?  This is a non-sense for
diverged sequences
Around and below 40% sequence identity, most sequence alignments
are WRONG.
The wrong regions are mostly Insertion and Deletion locations (gaps)
The best we can do is to find where the gaps are and give up perfect
Alignment.
Putting gaps correctly in any sequence alignment is NOT solved
completely.
Finding gaps as best as we can
• Usually, the best method is using NNN.
• NNN is time consuming.
• So, Needleman & Wunsch made a computer
algorithm. Simple Dynamic Programming
(J. Mol. Biol. 48, 443-453 [1970])
Let's do a simple example, adapted from Needleman & Wunsch's original paper.
First, place the sequences on a matrix of cells. At each
cell where the amino acids are identical, enter a value of 1. All the other cells are
implicitly given a score of 0 (zero).
A D
A 1
S
C
S
N
R
C
K
C
R
D
1
P
C N S R Q C L C R P M
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Ref: http://www.finchcms.edu/biochem/Walters/nw.html
Now, starting at the C-terminal ends of the sequences and working
toward the origins, add to each cell the maximum value from
among all the cells downstream from it (not including cells
directly below or directly to the right. Let's do a few cycles
of this and see how the matrix develops. Start with the last
column and last row, adding in the zeros.
A D
A 1
S
C
S
N
R
C
K
C
R
D
1
P 0 0
C N S R Q C L C R P M
0
1
0
1
1
1
0
1
0
1
0
1
1
0
1
1
1
0
0
1
1
1
0
1
1
0
0
0 0 0 0 0 0 0 0 0 1 0
A
S
C
S
N
R
C
K
C
R
D
P
A
8
7
6
6
5
4
3
3
2
2
1
0
D
7
7
6
6
5
4
3
3
2
1
2
0
C
6
6
7
6
5
4
4
3
3
1
1
0
N
6
6
6
5
6
4
3
3
2
1
1
0
S
5
6
5
6
5
4
3
3
2
1
1
0
R
4
4
4
4
4
5
3
3
2
2
1
0
Q
4
4
4
4
4
4
3
3
2
1
1
0
C
3
3
4
3
3
3
4
3
3
1
1
0
L
3
3
3
3
3
3
3
3
2
1
1
0
C
2
2
3
2
2
2
3
2
3
1
1
0
R
1
1
1
1
1
2
1
1
1
2
1
0
P
0
0
0
0
0
0
0
0
0
0
0
1
M
0
0
0
0
0
0
0
0
0
0
0
0
• ADCNS-RQCLCR-PM
• | | | | | || |
• ASC-SNR-CKCRDP• ADC-NSRQCLCR-PM
• | | | | | || |
• ASCSN-R-CKCRDPhttp://www.perlmonth.com/columns/modules/Perl/Align/NW/NW.htm
Additional ref: Smith, T.F. and Waterman, M.S. 1981.
``Identification of common molecular subsequences'' Journal of
Molecular Biology. 147: 195-197
FASTA
• FASTA is a program for rapid alignment of
pairs of protein and DNA sequences. Rather
than comparing individual residues in the
two sequences, FASTA instead looks for
matching sequence patterns or words, called
k-tuples, and then attempts to build a local
alignment based upon these word matches.
FASTA
In the initial search for regions of similarity, FASTA uses a
computer method known as hash coding. In this method, a
lookup table showing the positions of each sequence word of
length k, called a k-tuple, is constructed for each sequence.
The relative positions of each word in the two sequences is
then calculated by subtracting the position in the first
sequence from that in the second. Words that have the
same offset position reveal a region of alignment between the
two sequences. The number of comparisons increases linearly
in proportion to average sequence length. In contrast, the time
taken in dot matrix and dynamic programming methods
increases as the square of the average sequence length. The ktuple length is user-defined and is usually 1 or 2 for protein
sequences (i.e. either the positions of each of the individual
20 amino acids or the positions of each of the 400 possible
dipeptides are located).
FASTA
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in
offset
amino acid
protein A protein B
pos A - posB
----------------------------------------------------a
6
6
0
c
2
7
-5
k
11
n
1
p
4
9
-5
r
10
s
3
8
-5
t
5
----------------------------------------------------Note the common offset for the 3 amino acids c,s and p
A possible alignment is thus quickly found protein 1
protein 2
n c s p t a
| | |
a c s p r k