Slides 4 - University of Florida

Download Report

Transcript Slides 4 - University of Florida

CAP5510 – Bioinformatics
Substitution Patterns
Tamer Kahveci
CISE Department
University of Florida
1
Goals
• Understand how mutations occur
• Learn models for predicting the number of
mutations
• Understand why scoring matrices are used
and how they are derived
• Learn major scoring matrices
2
Why Substitute Patterns ?
• Mutations happen because of mistakes in DNA
replication and repair.
• Our genetic code changes due to mutations
– Insert, delete, replace
• Three types of mutations
– Advantageous
– Disadvantageous
– Neutral
• We only observe substitutions that passed
selection process
3
Mutation Rates
Parent Organism
T time
Organism A
R = K/(2T)
Organism B
K: number of substitutions
4
Functional Constraints
• Functional sites are less likely to mutate
– Noncoding = 3.33 (subs/109 yr)
– Coding = 1.58 (subs/109 yr)
• Indels about 10 times less likely than
substitutions
5
Nucleotide Substitutions and Amino
Acids
• Synonymous substitutions do not change amino acids
• Nonsynonymous do change
• Degeneracy
– Fourfold degenerate: gly = {GGG, GGA, GGU, GGC}
– Twofold degenerate: asp = {GAU, GAC}, glu = {GAA, GAG}
– Non-degenerate: phe = UUU, leu = CUU, ile = AUU, val = GUU
• Example substitution rates in human and mouse
– Fourfold degenerate: 2.35
– Twofold degenerate: 1.67
– Non-degenerate: 0.56
6
Predicting Substitutions
How can we count the true
number of substitutions ?
7
Jukes-Cantor Model
• Each nucleotide can change into another
one with the same probability
P(A->A’, 1) = x, for each A’
P(A->A, 1) = 1 – 3x
Compute P(A->A’, 2) & P(A->A, 2)
x
A
x
G
C
x
T
P(A->A, t+1) = 3 P(A->A’, t) P(A’->A, 1) +
P(A->A, t) P(A->A, 1)
P(A->A, t) ~ ¼ + (3/4)e-4ft
K = num. subst. = -¾ ln(1 – f4/3), f =
fraction of observed substitutions
Oversimplification
8
Two Parameter Model
• Transition:
– purine->purine (A, G),
pyrimidine->pyrimidine (C,
T)
Purine
• Transversion:
– purine <-> pyrimidine
• Transitions are more
likely than transversions.
• Use different probabilities
for transitions and
transversions.
Pyrimidine
9
Two Parameter Model
•P(AA,1) = 1-x-2y
•Compute P(AA,2)
y
A
x
G
C
y
T
P(AA,2) = (1-x-2y) P(AA,1) + x P(AG,1) + y
P(AC,1) + y P(AT,1)
P(AA,t) = ¼ + ¼ e-4yt + ½ e-2(x+y)t
K = ½ ln(1/(1-2P-Q)) + ¼ ln(1/(1-2Q))
P,Q: fraction of transitions and transversions
observed.
10
More Parameters ?
• Assign a different probability for each pair
of nucleotides
• Not harder to compute than simpler
models
• Not necessarily better than simpler models
11
Amino Acid substitutions (1)
• Harder to model than nucleotides
– An amino acid can be substituted for another in more
than one ways
– The number of nucleotide substitutions needed to
transform one amino acid to another may differ
• Pro = CCC, leu = CUC, ile = AUC
– The likelihood of nucleotide substitutions may differ
• Asp = GAU, asn = AAU, his = CAU
– Amino acid substitutions may have different effects on
the protein function
12
Amino Acid substitutions (2)
• Mutation rates may vary greatly among
genes
– Nonsynonymous substitution may affect
functionality with smaller probability in some
genes
• Molecular clock (Zuckerlandl, Paulding)
– Mutation rates may be different for different
organisms, but it remains almost constant
over the time.
13
Scoring Matrices
14
What is it & why ?
• Let alphabet contain N letters
– N = 4 and 20 for nucleotides and amino acids
• N x N matrix
• (i,j) shows the relationship between ith and jth
letters.
– Positive number if letter i is likely to mutate into letter j
– Negative otherwise
– Magnitude shows the degree of proximity
• Symmetric
15
The BLOSUM45 Matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
5
-2
-1
-2
-1
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-2
-2
0
R
-2
7
0
-1
-3
1
0
-2
0
-3
-2
3
-1
-2
-2
-1
-1
-2
-1
-2
N
-1
0
6
2
-2
0
0
0
1
-2
-3
0
-2
-2
-2
1
0
-4
-2
-3
D
-2
-1
2
7
-3
0
2
-1
0
-4
-3
0
-3
-4
-1
0
-1
-4
-2
-3
C
-1
-3
-2
-3
12
-3
-3
-3
-3
-3
-2
-3
-2
-2
-4
-1
-1
-5
-3
-1
Q
-1
1
0
0
-3
6
2
-2
1
-2
-2
1
0
-4
-1
0
-1
-2
-1
-3
E
-1
0
0
2
-3
2
6
-2
0
-3
-2
1
-2
-3
0
0
-1
-3
-2
-3
G
0
-2
0
-1
-3
-2
-2
7
-2
-4
-3
-2
-2
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
0
-3
1
0
-2
10
-3
-2
-1
0
-2
-2
-1
-2
-3
2
-3
I
-1
-3
-2
-4
-3
-2
-3
-4
-3
5
2
-3
2
0
-2
-2
-1
-2
0
3
L
-1
-2
-3
-3
-2
-2
-2
-3
-2
2
5
-3
2
1
-3
-3
-1
-2
0
1
K
-1
3
0
0
-3
1
1
-2
-1
-3
-3
5
-1
-3
-1
-1
-1
-2
-1
-2
M
-1
-1
-2
-3
-2
0
-2
-2
0
2
2
-1
6
0
-2
-2
-1
-2
0
1
F
-2
-2
-2
-4
-2
-4
-3
-3
-2
0
1
-3
0
8
-3
-2
-1
1
3
0
P
-1
-2
-2
-1
-4
-1
0
-2
-2
-2
-3
-1
-2
-3
9
-1
-1
-3
-3
-3
S
1
-1
1
0
-1
0
0
0
-1
-2
-3
-1
-2
-2
-1
4
2
-4
-2
-1
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
2
5
-3
-1
0
W
-2
-2
-4
-4
-5
-2
-3
-2
-3
-2
-2
-2
-2
1
-3
-4
-3
15
3
-3
Y
-2
-1
-2
-2
-3
-1
-2
-3
2
0
0
-1
0
3
-3
-2
-1
3
8
-1
V
0
-2
-3
-3
-1
-3
-3
-3
-3
3
1
-2
1
0
-3
-1
0
-3
-1
5
16
Scoring Matrices for DNA
A
A
A
C
G
T
1
0
0
0
A
C
C
0
1
0
T
0
0
0
0
identity
1
0
-3
-3
1
G
-3
-3
T
-3
-3
1
1
-3
-3
-3
BLAST
C
G
T
A
1
-5
-1
-5
C
-5
1
-5
-1
G
-1
-5
1
-5
T
-5
-1
-5
1
-3
-3
0
T
A
-3
0
G
G
1
C
1
Transitions &
transversions
17
Scoring Matrices for Amino Acids
• Chemical similarities
–
–
–
–
Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)
Polar, Hydrophilic (S, T, C, Y, N, Q)
Electrically charged (D, E, K, R, H)
Requires expert knowledge
• Genetic code: Nucleotide substitutions
– E: GAA, GAG
– D: GAU, GAC
– F: UUU, UUC
• Actual substitutions
– PAM
– BLOSUM
18
Scoring Matrices: Actual
Substitutions
•
•
•
•
Manually align proteins
Look for amino acid substitutions
Entry ~ log(freq(observed)/freq(expected))
Log-odds matrices
19
PAM Matrices
(Dayhoff 1972)
20
PAM
• PAM = “Point Accepted Mutation”
interested only in mutations that have
been “accepted” by natural selection
• An accepted mutation is a mutation that
occurred and was positively selected by
the environment; that is, it did not cause
the demise of the particular organism
where it occurred.
21
Interpretation of PAM matrices
• PAM-1 : one substitution per 100 residues (a
PAM unit of time)
• “Suppose I start with a given polypeptide sequence M at
time t, and observe the evolutionary changes in the
sequence until 1% of all amino acid residues have
undergone substitutions at time t+n. Let the new
sequence at time t+n be called M’. What is the
probability that a residue of type j in M will be replaced
by i in M’?”
• PAM-K : K PAM time units
22
PAM Matrices (1)
• Starts with a multiple sequence alignment
of very similar (>85% identity) proteins.
Assumed to be homologous
• Compute the relative mutability, mi, of
each amino acid
– e.g. mA = how many times was alanine
substituted with anything else on the
average?
23
Relative Mutability
• ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
• Across all pairs of sequences, there are 28
A  X substitutions
• There are 10 ALA residues, so mA = 2.8
24
Pam Matrices (2)
• Construct a phylogenetic tree for the sequences in the
alignment
ACGCTAFKI
AG
GCGCTAFKI
AG
GCGCTGFKI
FG,A = 3
IL
ACGCTAFKL
AL
GCGCTLFKI
CS
ASGCTAFKL
GA
ACACTAFKL
• Calculate substitution frequencies FX,X
• Substitutions may have occurred either way, so A  G
also counts as G  A.
25
Mutation Probabilities
• Mi,j represents the probability of J  I
substitution.
ACGCTAFKI
AG
IL
GCGCTAFKI
AG
AL
GCGCTGFKI
M ij 
ACGCTAFKL
GCGCTLFKI
m j Fij
 Fij
i
CS
ASGCTAFKL
M G, A
GA
ACACTAFKL
2.8 ´ 3 = 2.1

4
26
The PAM Matrix
• The entries of the scoring matrix are the
Mi,j values divided by the frequency of
occurrence, fi, of residue i.
• fG = 10 GLY / 63 residues = 0.1587
• RG,A = log(2.1/0.1587) = log(12.760) = 1.106
• Log-odds matrix
• Diagonal entries are Mjj = 1– mj
27
Computation of PAM-K
• Assume that changes at time T+1 are
independent of the changes at time T.
• Markov chain
• P(A-->B) = X P(A->X) P(X->B)
• PAM-K = (PAM-1)K
• PAM-250 is most commonly used
28
PAM - Discussion
• Smaller K, PAM-K is better for closely related
sequences, large K is better for distantly related
sequences
• Biased towards closely related sequences since it starts
from highly similar sequences (BLOSUM solves this)
• If Mi,j is very small, we may not have a large enough
sample to estimate the real probability. When we
multiply the PAM matrices many times, the error is
magnified.
• Mutation rate may change from one gene to another
29
BLOSUM Matrices
Henikoff & Henikoff 1992
30
BLOSUM Matrix
• Begin with a set of protein sequences and obtain blocks.
– ~2000 blocks from 500 families of related proteins
– More data than PAM
• A block is the ungapped alignment of a highly conserved region of a
family of proteins.
• MOTIF program is used to find blocks
• Substitutions in these blocks are used to compute BLOSUM matrix
block 1
block 2
WWYIR
WFYVR
WYYVR
WYFIR
CASILRKIYIYGPV
CASILRHLYHRSPA
AAAVARHIYLRKTV
AASICRHLYIRSPA
…
block 3
…
GVSRLRTAYGGRKNRG
GVGSITKIYGGRKRNG
GVGRLRKVHGSTKNRG
GIGSFEKIYGGRRRRG
31
Constructing the Matrix
• Count the frequency of occurrence of each amino acid. This gives
the background distribution pa
• Count the number of times amino acid a is aligned with amino acid
b: fab
– A block of width w and depth s contributes ws(s-1)/2 = np pairs
• Compute the occurrence probability of each pair
– qab = fab/ np
• Compute the
probability of occurrence of amino acid a
i
– pa = qaa + Σ qab /2
• Compute the expected probability of occurrence of each pair
a≠b
•
– eab = 2papb, if a ≠ b
papb otherwise
Compute the log likelihood ratios, normalize, and round.
– 2* log2 qab / eab
32
Constructing the Matrix: Example
• fAA = 36, fAS = 9
• Observed frequencies of pairs
A
A
A
A
S
… A …
A
A
A
A
– qAA = fAA/(fAA+fAS) = 36/45 = 0.8
– qAS = 9/45 = 0.2
• Expected frequencies of letters
– pA = qAA + qAS/2 = 0.9
– pS = qAS/2 = 0.1
• Expected frequencies of pairs
– eAA = pA x pA = 0.81
– eAS = 2 x pA x pS = 0.18
• Matrix entries
– MAA = 2x log2(qAA/eAA) = -0.04 ~ 0
– MAS = 2 x log2(qAS/eAS) = 0.3 ~ 0
9A, 1S
33
Computation of BLOSUM-K
• Different levels of the BLOSUM matrix can be created by
differentially weighting the degree of similarity between
sequences. For example, a BLOSUM62 matrix is
calculated from protein blocks such that if two
sequences are more than 62% identical, then the
contribution of these sequences is weighted to sum to
a b
one. In this
way the contributions of multiple entries of
closely related sequences is reduced.
• Larger numbers used to measure recent divergence,
default is BLOSUM62
34
BLOSUM 62 Matrix
Check scores for
MILV
-small hydrophobic
NDEQ
-acid, hydrophilic
HRK
-basic
FYW
-aromatic
STPAG
-small hydrophilic
C
-sulphydryl
35
PAM vs. BLOSUM
Equivalent PAM and BLOSSUM matrices:
PAM100
PAM120
PAM160
PAM200
PAM250
=
=
=
=
=
Blosum90
Blosum80
Blosum60
Blosum52
Blosum45
BLOSUM62 is the default matrix to use.
36
PAM vs. BLOSUM
PAM
Built from global alignments
Built from small amout of Data
Counting is based on minimum
replacement or maximum parsimony
Perform better for finding global
alignments and remote homologs
Higher PAM series means more
divergence
BLOSUM
Built from local alignments
Built from vast amout of Data
Counting based on groups of
related sequences counted as one
Better for finding local
alignments
Lower BLOSUM series means
more divergence
37