Transcript Similarity
Alignment methods
April 17, 2007
Quiz 1—Question on databases
Learning objectives- Understand difference between
identity, similarity and homology. Understand how PAM
scoring matrices. Understand difference between global
alignment and local alignment. Knowledge of Dotter
software program.
Workshop-Import sequences of interest from GenBank,
place in FASTA format, align sequences using DOTTER
program.
Homework #4 due on Tues, April 24 at the beginning of
class.
Purpose of finding differences and similarities
of amino acids in two proteins.
Infer structural information
Infer functional information
Infer evolutionary relationships
Evolutionary Basis of Sequence
Alignment
1. Similarity: Quantity that relates how much
two amino acid sequences are alike.
2. Identity: Quantity that describes how much
two sequences are alike in the strictest terms.
3. Homology: a conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
Evolutionary Basis of Sequence
Alignment (Cont. 1)
Why are there regions of identity?
1) Conserved function-residues participate in
reaction.
2) Structural (For example, conserved cysteine
residues that form a disulfide linkage)
3) Historical-Residues that are conserved solely
due to a common ancestor gene.
Identity Matrix
A
C
I
L
1
0
0
0
A
1
0 1
0 0
C I
1
L
Simplest type of scoring matrix
Similarity
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
One is mouse trypsin and the other is crayfish trypsin.
They are homologous proteins. The sequences share 41% identity.
Evolutionary Basis of Sequence
Alignment (Cont. 2)
Note: it is possible that two proteins share a high degree of
similarity but have two different functions. For example,
human gamma-crystallin is a lens protein that has no known
enzymatic activity. It shares a high percentage of identity with
E. coli quinone oxidoreductase. These proteins likely had a
common ancestor but their functions diverged.
Analogous to railroad car and diner function.
Orthologs vs Paralogs
Two proteins that have a common ancestor
that exist in different species are said to be
orthologs.
Two proteins with a common ancestor that
exist in the same species are said to be
paralogs.
Modular nature of proteins
The previous alignment was global. However,
many proteins do not display global patterns of
similarity. Instead, they possess local regions of
similarity.
Proteins can be thought of as assemblies of
modular domains. It is thought that this may, in
some cases, be due to an evolutionary process
known as exon shuffling.
Modular nature of proteins (cont. 1)
Gene A
Exon 1a
Exon 2a
Duplication of Exon 2a
Gene A
Exon 1a
Exon 2a
Exon 2a
Exchange with Gene B
Gene B
Exon 1b
Exon 2b
Exon 2b
Gene A
Exon 1a
Exon 2a
Exon 3 (Exon 2b from Gene B)
Gene B
Exon 1b
Exon 2b
Exon 3 (Exon 2a from Gene A)
Scoring Matrices
Importance of scoring matrices
Scoring matrices appear in all analyses involving
sequence comparisons.
The choice of matrix can strongly influence the
outcome of the analysis.
Scoring matrices implicitly represent a particular
theory of relationships.
Understanding theories underlying a given scoring
matrix can aid in making proper choice of which
matrix to use.
Scoring Matrices
When we consider scoring matrices, we
encounter the convention that matrices have
numeric indices corresponding to the rows and
columns of the matrix.
For example, M11 refers to the entry at the first
row and the first column. In general, Mij refers
to the entry at the ith row and the jth column. To
use this for sequence alignment, we simply
associate a numeric value to each letter in the
alphabet of the sequence.
Two major scoring matrices for amino acid
sequence comparisons
PAM-derived from sequences known to be
closely related (Eg. Proteins from
chimpanzees and human). PAM1 was
created from empirical data and other PAMs
were mathematically derived.
BLOSUM-derived from sequences not
closely related (Eg. E. coli and human) from
data stored in the BLOCKS database.
The Point-Accepted-Mutation (PAM) model
of evolution and the PAM scoring matrix
Started by Margaret Dayhoff, 1978
A series of matrices describing the extent to
which two amino acids have changed
during evolution.
Proteins were aligned by eye and then the
number of times an amino acid was
substituted in different species was counted.
Protein families used to construct
Dayhoff’s scoring matrix
Protein
IgG kappa C region
Kappa casein
Serum Albumin
Cytochrome C
Histone H3
Histone H4
PAMs per 100 mil yrs
37
33
26
0.9
0.14
0.10
Numbers of accepted point mutations, multiplied by 10
A
V
A
R 30
N 109
D 154
C 33
Q 93
E 266
G 579
H 21
I 66
L 95
K 57
M 29
F 20
P 345
S 772
T 590
W
0
Y 20
V 365
17
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T W
Y
Original amino acid
17
0
10
120
0
10
103
30
17
477
17
7
67
137
20
27
3
20
532
0
0
50 76
0
94 831
0 422
156 162 10 30 112
226 43 10 243 23 10
36 13 17
8 35
0
37
0
0 75 15 17
322 85
0 147 104 60
0
0
0 20
7
7
7
0
0
0
0 17
27 10 10 93 40 49
432 98 117 47 86 450
169 57 10 37 31 50
3
0
0
0
0
0
36
0 30
0 10
0
13 17
33 27 37 97
Replacement amino acid
3
40 253
23 43 39
0 57 207 90
20 90 167
0 17
50
7 43 43
4
7
26 20 32 168 20 40 269
14 129 52 200 28 10 73 696
3
0 13
0
0 10
0 17
0
40 13 23 10
0 260
0 22 23 6
30 661 303 17 77 10 50 43 186 0
Calculation of relative mutability of
amino acid
Find frequency of amino acid change to another
amino acid at a certain position in protein.
Divide the frequency of aa change by the
frequency that the “j” (original) aa occurs in all
proteins studied. This is called the “mutability”.
Determine the constant to multiply the alanine
mutability to get 100.
Multiply the 19 other a.a. mutabilities by the same
constant. This is called the relative mutability.
Relative mutabilities of amino acids
Asn
Ser
Asp
Glu
Ala
Thr
Ile
Met
Gln
Val
134
120
106
102
100
97
96
94
93
74
His
Arg
Lys
Pro
Gly
Tyr
Phe
Leu
Cys
Trp
66
65
56
56
49
41
41
40
20
18
Why are the mutabilities different?
High mutabilities because a similar amino
acid can replace it. (Asp for Glu)
Conversely, the low mutabilities are unique,
can’t be replaced.
Creation of a mutation probability
matrix
Used accepted mutation data from earlier
slide and the mutability of each amino acid
in nature to create a mutation probability
matrix.
Mij shows the probability that an original
amino acid j (in columns) will be replaced
by amino acid i (in rows) over a defined
evolutionary interval. For PAM1, 1% of
aa’s have been changed.
PAM1 mutational probability matrix
. . .
Values of each column will sum to 10,000
The Point-Accepted-Mutation (PAM) model
of evolution and the PAM scoring matrix
A 1-PAM unit is equivalent to 1 mutation found in a
stretch of 2 sequences each containing 100 amino acids
that are aligned
Example 1:
..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV..
|||||||||||||| |||||||||||||||||||||||||||||||||||
..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV..
length = 100, 1 Mismatch, PAM distance = 1
A k-PAM unit is equivalent to k 1-PAM units (or Mk).
The Point-Accepted-Mutation (PAM) model
of evolution and the PAM scoring matrix
Observed %
Difference
1
5
10
20
40
50
60
70
80
Evolutionary Distance
In PAMs
1
5
11
23
56
80
112
159
246
Final Scoring Matrix is the LogOdds Scoring Matrix
Replacement amino acid
S (a,b) = 10 log10(Mab/Pb)
Original amino acid
Frequency of amino acid b
Mutational probability matrix number
Summary of PAM Scoring Matrix
PAM = a unit of evolution (1 PAM = 1 point mutation/100
amino acids)
Accepted Mutation means fixed point mutation
Comparison of 71 groups of closely related proteins
yielding 1,572 changes. (>85% identity)
Different PAM matrices are derived from the PAM 1
matrix by matrix multiplication.
The matrices are converted to log odds scoring matrices.
(Frequency of change divided by probability of chance
alignment converted to log base 10.)
A PAM 250 matrix is roughly equivalent to 20% identity in
two sequences.
The Dotter Program
• Program consists of three components:
•Sliding window
•A table that gives a score for each amino acid match
•A graph that converts the score to a dot of certain density.
The higher the density the higher the score.
Two proteins that are similar in
certain regions
Tissue plasminogen activator (PLAT)
Coagulation factor 12 (F12).
Region of
similarity
Single region on F12
is similar to two regions
on PLAT
FASTA format
>gi|1244762|gb|AAA98563.1| p53 tumor suppressor homolog
MSQGTSPNSQETFNLLWDSLEQVTANEYTQIHERGVGYEYHEAEPDQTSLEISAYRIAQPDPYGRSESYD
LLNPIINQIPAPMPIADTQNNPLVNHCPYEDMPVSSTPYSPHDHVQSPQPSVPSNIKYPGEYVFEMSFAQ
PSKETKSTTWTYSEKLDKLYVRMATTCPVRFKTARPPPSGCQIRAMPIYMKPEHVQEVVKRCPNHATAKE
HNEKHPAPLHIVRCEHKLAKYHEDKYSGRQSVLIPHEMPQAGSEWVVNLYQFMCLGSCVGGPNRRPIQLV
FTLEKDNQVLGRRAVEVRICACPGRDRKADEKASLVSKPPSPKKNGFPQRSLVLTNDITKITPKKRKIDD
ECFTLKVRGRENYEILCKLRDIMELAARIPEAERLLYKQERQAPIGRLTSLPSSSSNGSQDGSRSSTAFS
TSDSSQVNSSQNNTQMVNGQVPHEEETPVTKCEPTENTIAQWLTKLGLQAYIDNFQQKGLHNMFQLDEFT
LEDLQSMRIGTGHRNKIWKSLLDYRRLLSSGTESQALQHAASNASTLSVGSQNSYCPGFYEVTRYTYKHT
ISYL
Workshop 3