Amino Acid Scoring Matrices

Download Report

Transcript Amino Acid Scoring Matrices

Amino Acid Scoring
Matrices
Jason Davis
Overview


Protein synthesis/evolution
Computational sequence alignment
Smith-Waterman Algorithm
 BLAST


Amino Acid Scoring Matrices
PAM – Point Accepted Mutations
 BLOSUM – BLOck SUbstitution Matrix
 mPAM


Metric Conversions
Proteins


3-dimensional stuctures
Composed of amino acids chained
together




Can be represented as a 2dimensional sequence
20 different amino acids exist
Usually 100-1500 amino acids long
Have many different shapes and
functions

Function depends on both 3d shape
and aa sequence
Protein Synthesis

DNA: strand composed of 4 different base pairs


A, T, C, G
20 amino acids: 3 base pairs needed to encode
each amino acid

Degenerate coding
Signalling
Transcription/Translation
Protein
Protein Evolution

Protein ‘families’




Set of homologous proteins
Same function, different
composition
Similar structure
Identifying families


Pairwise sequence alignment
Multiple sequence alignment


NP-hard
Other approaches

Structural, experimental
Pairwise Sequence Alignment

Input




Global Alignment



2 sequences p, q of lengths m,n
20x20 Amino Acid Substitution Matrix
Insertion (gap) cost
Find optimal set of insertions such that the resulting
alignment (length < m+n) is optimal w.r.t. amino acid
substitution matrix
Difficult, less useful
Local Alignment

Find significant ‘hotspot’ in the alignment
Sequence Alignment Algorithms

Dynamic Programming Approaches



Global and Local variations
Provably Optimal
O(nm) space and time



‘banded’ heuristics can reduce the state space
FSA extensions allow varying penalties for gap openings and
gap extensions
Heuristics Approaches


Blast, Fasta
Sublinear time – look for statistical significance in small local
alignments between sequences
Substitution Matrices - PAM


Dayhoff, Schwartz, Orcutt (1978)
Step 1: extrapolate mutation probabilites from 1 step in
evolutionary time




Pick a set of protein families (71)
Restrict proteins in each family to sequences with similarity
above a certain threshold (>85%)
Build a phylogenetic tree for each family
Extrapolate frequencies Aab that amino acids a, b evolved
from same amino acid


Aab and Aba assumed to be the same
Convert frequencies to probabilities

p(a|b) = Bab = Aab/∑cAac
Substitution Matrices – PAM (2)

Step 2 – Infer greater evolutionary times

Dayhoff defined a PAM1 matrix to have 1%
expected substitutions


For each row, scale off-diagonals and adjust diagonals to
keep the matrix row stochastic
To infer larger evolutionary times, we can view
formed matrix C as a 20-state Markov Chain

Cn is the result of performing n-steps in the Markov
Process
Substitution Matrices – PAM (3)

Create odds ratio of

1) the event that 2 amino acids i,j, evolved from the same ancestor, x



2) the event that the 2 amino acids align at random


fi = observed frequency of amino acid i
p(i,j have same ancestor) = ∑xfx Pr{x→i} Pr{x→j}
= ∑xfx (CN)ix (CN)jx
= ∑x (CN)ix fx (CN)jx
= ∑x (CN)ix fj (CN)xj
= fj (C2N)ij
p(independent alignment of i,j) = fi * fj
Final log odds ratio:



Dij = average[log((CN)ij / fi), log(CN)ji / fj))
The log allows for an additive model
Final numbers are rounded to nearest integer
PAM250

Different values on the
diagonal correspond do
mutability potential
BLOSUM








Henikoff & Henikoff, 1992
Uses aligned, ungapped blocks within protein families
that have similarity greater than some level L%
qa = ∑bAab / ∑c,d Acd
pab = Aab / ∑c,d Acd
S(a,b) = log(pab / qaqb)
Final entries are rounded
Blosum62 (L=62), Blosum50 (L=50)
More direct approach, usually yields better results
Log-Odds Similarity Matrix
Properties


Negative numbers needed for Smith-Waterman local alignment
algorithm
Nice probabilistic interpretation


Amino acid substitutions assumed independent
Attempts to metricize these matrices


Taylor, Jones 93: used various algebraic manipulations to arrive at a metric
matrix with minimal disortion
Dij = a – Sij



Larger values of a yielded better metrics at the cost of high dimensionality
Constant Shift Embedding
Linial, et. al. constructed a near metric over aligned segments of
length 50


D(u,v) = S(u,u) + S(v,v) – 2*S(u,v)
10-7 error rate
mPAM


Metric substitution model
Measures the expected time per 250 mutations among
100 amino acids



Same rate as PAM250
Exponential distribution assumed: f(t) = 1 – e-λt
Given pairwise substitution rates p(a,b)




Solve for λ: f(1) = 1-e- λ = p(a,b)
Expected time t of an event occuring in an exponential distribution
is 1/ λ
mPAM(a,b) = round(1/ λ)
Two values needed to be adjusted to form a metric

Rounding error?
mPAM (2)

Seller’s Theorem:


Optimized for BLAST-like lookup


If a pairwise alignment is found using a metric, resulting
alignment scores are also metrics
Smaller alignments
Difficult to compare with other similarity matrices

Dynamic programming algorithms rely on negative values in
the similarity matrix

Probabilistic interpretation: larger positive alignments are statistically
significant
mPAM Disadvantages

d(x,x) = 0

This does not capture the relative mutability among different
amino acids




PAM/BLOSUM capture this with different positive values along the
diagonal
Do amino acids substitute according to an exponential
distribution?
Amino Acid Substitution may be inherently non-metric
Comparison to BLOSUM?