Expect Values

Download Report

Transcript Expect Values

Scoring Matrices
Limitations to Needleman-Wunsch
The problem with Needleman-Wunsch
is the amount of processor memory
resources it requires.
Because of this, it is not favored for
practical use, despite the guarantee
of an optimal alignment.
What is the problem?
There are about 10 88 possible alignments
for two sequences with 300 nucleotides
long( There are only about 10 80 elementary
particles in the universe.
•
• It is not possible to solve the alignment
problem with brute force.
•Therefore, we need some smart methods
(or algorithms to overcome this problem
Limitations to Needleman-Wunsch
The other difficulty is that the
concept of global alignment is not
used in pairwise sequence comparison
searches.
Global Alignment vs. Local
Alignment
Global
Needleman-Wunsch Method
Local
Dot Plots
Smith-Waterman
FastA
BLAST
Global alignment:
The global alignment optimizes the alignment over
the full length of the sequences.
LGPSTKDFGKISESREFDN
LNQLERSFGKINMRLE-DA
Local Alignment:
---------------FGKI-------------------------FGKI----------• In local alignment ,stretches with the highest
density of matches are given the highest priority.
• The alignment tends to stop at the ends of
regions of identity or strong similarity.
Purpose of Smith Waterman
Algorithm
•Smith-Waterman dynamic programming
algorithm, finds the most similar
subsequences of two sequences, that has
been generally recognized as the most
sensitive sequence.
•The search sequences in protein and DNA
databases searches for similarity to the
query sequence by using Smith-Waterman
algorithm as the core sequence comparison
method.
Smith-Waterman searches
A more sensitive brute force
approach to searching
much slower than BLAST or FASTA
uses dynamic programming
SSEARCH is a GCG program for
Smith-Waterman searches
Differences
Needleman- Wunsch
• Global alignments
• Requires alignments
score for a pair of
residues to be >=0
• No gap penalty required
Smith - Waterman
• Local alignments
• Residue alignment score
may be positive or
negative
• Requires a gap penalty
to work effectively
• Score can increase,
decrease or stay level
between two cells of a
pathway.
Scoring Matrix/Substitution
Matrix
To score quality of an alignment
 Contains scores for pairs of residues (amino
acids or nucleic acids) in a sequence
alignment
 For protein/protein comparisons:
a 20 x 20 matrix of similarity scores where
identical amino acids and those of similar
character (e.g. Ile, Leu) give higher scores
compared to those of different character
(e.g. Ile, Asp).
 Symmetric, so often only half is shown.

Substitution Matrices
Not all amino acids are equal



Some are more easily substituted than others
Some mutations occur more often
Some substitutions are kept more often
Mutations tend to favor some substitutions


Some amino acids have similar codons
They are more likely to be changed from DNA
mutation
Selection tends to favor some substitutions


Some amino acids have similar properties or
structure
They are more likely to be kept
Substitution Matrix
A substitution matrix describes the
likelihood that two residue types
would mutate to each other in
evolutionary time.
This is used to estimate how well two
residues of given types would match
if they were aligned in a sequence
alignment.
Substitution Matrix
An amino acid substitution matrix is a
symmetrical 20*20 matrix, where each
element contains the score for substituting
a residue of type i with a residue of type j
in a protein, where i and j are one of the
20 amino-acid residue types.
Same residues should obviously have high
scores, but if we have different residues in
a position, how should that be scored?
Scoring Matrices
Scoring matrices tell how similar
amino acids are.
There are two main sets of scoring
matrices: PAM and BLOSUM.
PAM is based on evolutionary
distances
BLOSUM is based on
structure/function similarities
Substitution Matrix Scoring
The same residues in a position give the
score value 1, and different residues give 0.
The same residues give a score 1, similar
residues (for example: Tyr/Phe, or Ile/Leu)
give 0.5, and all others 0.
One may calculate, using well established
sequence alignments, the frequencies
(probabilities) that a particular residue in a
position is exchanged for another.
Similarity Searching
•It is easy to score if an amino acid is identical to another
(the score is 1 if identical and 0 if not). However, it is not
easy to give a score for amino acids that are somewhat
similar.
•Should they get a 0 (non-identical) or a 1 (identical) or
something in between?
O
O
H2N
Leucine
CH
C
CH2
CH
CH3
CH3
OH
H2N
CH
C
CH
CH3
CH2
CH3
OH
Isoleucine
Scoring Similarity
1) Can only score aligned sequences
2) DNA is usually scored as identical or not
3) Modified scoring for gaps - single vs.
multiple base gaps (gap extension)
4) AAs have varying degrees of similarity



a. # of mutations to convert one to another
b. chemical similarity
c. observed mutation frequencies
5) PAM matrix calculated from observed
mutations in protein families
Dayhoff Matrix
This was done originally be Margaret
Dayhoff. Her matrices are called the PAM
(Point Accepted Mutation) matrices, which
describe the exchange frequencies after
having accepted a given number of point
mutations over the sequence.
Typical values are PAM 120 (120 mutations
per 100 residues in a protein) and PAM
250.
There are many other substitution
matrices: BLOSUM, Gonnet, etc.
Dayhoff Matrix




Derived from how often different amino acids
replace other amino acids in evolution.
Created from a dataset of closely similar protein
sequences (less than 15% amino acid difference).
These could be unambiguously aligned.
A mutation probability matrix was derived where
the entries reflect the probabilities of a
mutational event.
This matrix is called PAM 1. An evolutionary
distance of 1 PAM (point accepted mutation) means
there has been 1 point mutation per 100 residues
Importance of Scoring Matrices
Scoring matrices appear in all analyses
involving sequence comparisons.
The choice of matrix can strongly influence
the outcome of the analysis.
Scoring matrices implicitly represent a
particular theory of relationships.
Understanding theories underlying a given
scoring matrix can aid in making proper
choice.
Scoring Matrix Conventions
Scoring matrices are conventionally
numbered with numeric indices
corresponding to the rows and columns of
the matrix.
For example, M11 refers to the entry at
the first row and the first column.
In general, Mij refers to the entry at
the ith row and the jth column.
Scoring Matrices
To use this for sequence alignment,
we simply associate a numeric value to
each letter in the alphabet of the
sequence.
For example, if the matrix is:
{A,C,T,G} then A = 1, C = 2, etc. Thus,
one would find the score for a match
between A and C at M12.
The Filled-in F matrix for global alignment of x=AAGT
and Y=AGCGT(using BLOSUM50 substitution matrix)
Y/
X
D
A
G
C
G
T
D
A
A
G
T
0 -8 -16 -24 -32
-8
5 -3 -11 -19
-16 -3
5
5 -3
-24 -11 -3
2
4
-32 -19 -11
5
0
-40 -27 -19 -3 10
Global alignment
using BLOSUM50 substitution matrix
Y/X
D
A
G
C
G
T
D
0
-8
A
-8
5
A
-16
-3
G
-24
-11
T
-32
-19
-16
-24
-32
-40
-3
-11
-19
-27
5
-3
-11
-19
5
2
5
-3
-3
4
0
10
alignment: AAG _T
AGCGT
Amino Acid Scoring Matrices
There are two major scoring matrices for
amino acid sequence comparisons
PAM-derived from sequences known to be
closely related (Eg. Chimpanzee and human).
Ranges from PAM1 to PAM500
 BLOSUM-derived from sequences not closely
related (Eg. E. coli and human). Ranges from
BLOSUM 10-BLOSUM 100

PAM250 Matrix
The Point-Accepted-Mutation
(PAM) model
•This model implies that amino acids (AA) mutate
independently of each other with a probability which
depends only on the AA.
•Since there are 20 AA, the transition probabilities
are described by a 20X20-mutation matrix, denoted
by M. A standard M defines a 1-PAM change.
•Point Accepted Mutation (PAM) Distance: A 1-PAM
unit changes 1% of the amino acids on average:
where fi is the frequency of AAi, and Mii is the
frequency of no change in amino acid i.
The Point-Accepted-Mutation
(PAM) model
Started by Margaret Dayhoff, 1978
A series of matrices describing the
extent to which two amino acids have
been interchanged in evolution
PAM-1 was obtained by aligning very
similar sequences. Other PAMs were
obtained by extrapolation
The Point-Accepted-Mutation (PAM) model
of evolution and the PAM scoring matrix
A 2-PAM unit is equivalent to two 1-PAM unit
evolution (or M2).
A k-PAM unit is equivalent to k 1-PAM unit
evolution (or Mk).
Example 1:
CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV
|||||||||||||| |||||||||||||||||||||||||||||||||||
CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV
length = 50
1 mismatch
PAM distance = 2
The Point-Accepted-Mutation (PAM) model
of evolution and the PAM scoring matrix
Observed %
Sequence Difference
1
5
10
20
40
50
60
70
80
Evolutionary Distance
In PAMs
1
5
11
23
56
80
112
159
246
Assumptions in the PAM model
1. Replacement at any site depends
only on the amino acid at that site and
the probability given by the table
(Markov model).
2. Sequences that are being compared
have average amino acid composition.
Steps to building the first PAM
Aligned sequences that were at least 85%
identical.
2. Reconstructed phylogenetic trees and
inferred ancestral sequences. 71 trees
containing 1,572 aa exchanges were used.
3. Tallied aa replacements "accepted" by
natural selection, in all pairwise
comparisons (each Aij is the number of
times amino acid j was replaced by amino
acid i in all comparisons).
1.
Steps to building PAM
4. Computed amino acid “mutability”, mj
(the propensity of a given amino acid, j,
to be replaced by any other amino acid)
5. Combined data from 3 & 4 to produce a
Mutation Probability Matrix for one
PAM of evolutionary distance, according
to the following formula:
Replacements
Steps to building PAM
6. Take the log odds ratio to obtain
each score:
Sij = log (Mij/fi) Where fi is the
normalized frequency of aai in the
sequences used.
7. Note: must multiply the Mij/fi by a
factor of 10 prior to avoid fractions.
Sources of error in PAM model
1. Many sequences depart from average aa composition.
2. Rare replacements were observed too infrequently to
determine probabilities accurately (for 36 aa pairs (out
of 400 aa pairs) no replacements were observed!).
3. Errors in 1 PAM are magnified when extrapolated to
250 PAM. (Mijk = k PAM)
4. The idea that each amino acid is acting
independently is an imperfect representation of
evolution. Actually, distantly related sequences usually
have islands (blocks) of conserved residues implying that
replacement is not equally probable over entire
sequence.
The bottom line on PAM
Frequency of alignment
Frequency of occurrence
The probability that two amino acids, i and j are
aligned by evolutionary descent divided by the
probability that they are aligned by chance
BLOSUM Matrix (BLOcks
SUbstitution Matrices)
Blocks Sum-created from BLOCKS database
A series of matrices describing the extent to
which two amino acids are interchangeable in
conserved structures of proteins
The number in the series represents the
threshold percent similarity between
sequences, for consideration for calculation
(For example, BLOSUM62 means 62% of the aa’s
were similar)
BLOSUM Matrices
BLOSUM is built from distantly
related sequences within conserved
blocks whereas PAM is built from
closely related sequences
BLOSUM is built from conserved
blocks of aligned protein segments
found in the BLOCKS database (the
BLOCKS database is a secondary
database that depends on the
PROSITE Family database)
BLOSUM Matrices (cont.1)
Version 8.0 of the Blocks Database consists of
2884 blocks based on 770 protein families
documented in PROSITE. PROSITE supplies
documentation for each family.
Hypothetical entry in red box in BLOCK record:
AABCDA...BBCDA
DABCDA.A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA...BBCCC
Building BLOSUM Matrices
1. To build the BLOSUM 62 matrix one must
eliminate sequences that are identical in more than
62% of their amino acid sequences. This is done
by either removing sequences from the Block or by
finding a cluster of similar sequences and replacing
it with a single representative sequence.
2. Next, the probability for a pair of amino acids to
be in the same column is calculated. In the
previous page this would be the probability of
replacement of A with A, A with B, A with C, and B
with C. This gives the value qij
3. Next, one calculates the probability that a certain
amino acid frequency exists, fi.
Building BLOSUM Matrices
(cont.)
4. Finally, we calculate the log odds ratio si,j= log2
(qij/fi). This value is entered into the matrix.
Which BLOSUM to use?
BLOSUM
80
62
35
Identity
80%
62% (usually default value)
35%
If you are comparing sequences that are very similar, use
BLOSUM 80. Sequences that are more divergent (dissimilar)
than 20% are given very low scores in this matrix.
Which Scoring Matrix to use?
PAM-1
BLOSUM-100
Small evolutionary
distance
High identity
within short
sequences
PAM-250
BLOSUM-20
Large evolutionary
distance
Low identity within
long sequences
The PAM 250 Scoring Matrix
GCG Wisconsin Package GAP
GAP is the implementation of the NeedlemanWunsch algorithm in the GCG program package.
The NW algorithm will present you with a single
globally optimal alignment, not all possible
optimal alignments - different alignments may
exist that give the same score.
GAP presents you with one member of the family
of best alignments that align the full length of
one sequence to the full length of a second
sequence.
There may be many members of this family, but
no other member has a higher score.
GCG Wisconsin Package GAP
The primary use of a global alignment algorithm is
when you really want the whole of two sequences to
be aligned, without truncation.
GAP could completely bypass a region of high local
homology, if a better (or even just as good) path can
be found in a different way.
This is problematic if one short sequence is aligned
against a longer one with internal repeats.
If there is weak or unknown similarity between two
sequences, a local alignment algorithm (BESTFIT) is
the better choice.
Use GAP only when you believe the similarity is over
the whole length.
Global Alignment vs. Local
Alignment
Global alignment is used when the overall
gene sequence is similar to another
sequence-often used in multiple sequence
alignment.

Clustal W algorithm
Local alignment is used when only a small
portion of one gene is similar to a
small portion of another gene.



BLAST
FASTA
Smith-Waterman algorithm