Transcript Document

Pairwise Sequence Analysis-III
• Amino-acid substitution matrices
• PAM matrices
– Derivation
– Limitation
• BLOSUM matrices
Lecture 4 CS566
1
Amino-acid substitution matrix
• Goal
– To find log [Pjoint(xy)/Pindependent(xy)]
– To find probabilistic measures of “interchangeability”
between amino acids
• Concepts
– Accepted mutation
• Replacement that does not disrupt function
– Markov chain (1st order)
• Next state (amino-acid) in time decided entirely by current
value of state
• “Odds of winning for team in play-offs” (Does not matter how
the team got there!)
Lecture 4 CS566
2
Point Accepted Mutation (PAM) Matrix
• Pioneering work by Margaret Dayhoff et al
(1978)
• Based on Evolutionary model
• PAM n matrix
– Scores based on allowing for average
substitution in n% of residues
– Larger the value of n, greater the evolutionary
distance between residues
Lecture 4 CS566
3
PAM Matrix Generation
• Assumption
– Based on atomic substitutions (“What you see
is what you got”) A=>G and not A=>S=>G
– Sets of highly related sequences (>85%
similarity)
Lecture 4 CS566
4
PAM Matrix Generation
• Build phylogenetic (“family”) tree for each set of
sequences to establish sequence of atomic
changes
• Count residue populations and substitutions
• Estimate probability of replacements for each
pair of residues
• Normalize to 1% average replacement and
generate Mutation probability matrix
• Generate PAM1 matrix
• Generate other PAM matrices (e.g., PAM250)
Lecture 4 CS566
5
Phylogenetic trees
C
D
C
D
C
D
•Tree for set of 4 sequences that have either C or D
at a certain position in the alignment
•Typically double-counted as C=>D as well as D=>C
•Counts to keep track of
•Frequency of each residue
•Frequency of each kind of substitution
•Frequency of each residue’s involvement in substitution
Lecture 4 CS566
6
PAM n% mutation matrix generation
• Square PAM 1 mutation matrix n times to
obtain PAM n% matrix
• Helps to model “what is you see is not
what you got” by representing longer
evolutionary distances
– PAM 250 implies 250% average substitutions,
i.e., average of 2.5 transitions between
aligned residues – and NOT a completely
different pair of protein sequences
Lecture 4 CS566
7
PAM n% matrix generation
• A given PAM n% mutation matrix is
converted to the log odds form by dividing
each entry by the relative abundance of
each residue, taking the log, rounding and
averaging x=>y and y<=x scores
Lecture 4 CS566
8
Point Accepted Mutation (PAM) Matrix
• Limitation
– Based on only one type of mutational event
– Ignores rarer types of mutations that are
observed only over longer periods of time
– Because of the above, model does not fit as
well for the more divergent sequences
Lecture 4 CS566
9
BLOSUMx matrices
• Matrix scores for different evolutionary
distances derived independently
• Much larger dataset (better sampling)
• Sequences clustered into BLOCKS; x
represents % similarity within block
• Intrablock substitutions used to
characterize log odds
Lecture 4 CS566
10
Choice of appropriate matrix
• Matrix should be chosen based on percent
similarity of sequences being analyzed
PAM250 for 20% similarity
PAM120 for 40% similarity
PAM80 for 50% similarity
PAM60 for 60% similarity
• BLOSUM?
Lecture 4 CS566
11