No Slide Title

Download Report

Transcript No Slide Title

Protein homology I: Evolution and
comparison of protein sequences
Biochem 565, Fall 2007
09/10/07
Cordes
Outline
1. What is homology?
2. Sequence conservation and covariation
3. Generalized substitution matrices
4. Pairwise alignment--global vs. local
5. Sequence identity and homology
6. Sequence similarity and homology-use of substitution matrices
7. Alignment scores and statistics
8. Limitations of pairwise alignment
9. Remote homologies--use of evolutionary profiles
Homology between different proteins--key terms
boxes represent
protein-coding genes
A1
gene duplication
A1
A2
speciation
orthologs
paralogs
A1
homologous
paralogous
orthologous
A2
A1
A2
descended from a common ancestor, e.g. “A1 and A2 are
homologous”. Also sometimes defined as “Similar due to
descent from a common ancestor.” Homology is either/or-there is no such thing as “percent homology”!
Homologous is not a synonym for “similar”! It is, however,
possible for only a part of two sequences to be
homologous, for instance one domain in multidomain prot.
related by gene duplication
related by speciation
Sequence conservation among homologous proteins
If one aligns the sequences of multiple proteins which are related by
common ancestry and thus form a family, one can then examine the extent
of amino-acid sequence evolution both in general and at particular positions
in the alignment.
bars here
reflect level
some positions universally conserved
of conservation
some are highly variable
coloring reflects conserved, chemically similar amino acid types, e.g.
charged residues such as Arg, Lys
Multiple alignment of six bacteriophage Cro proteins (made with ClustalX).
Overall rates of protein sequence evolution vary widely
Table. Substitution rates in genes encoding orthologous rodent and human proteins.
Units are substitution rates per site per billion years.
protein
nonsynonymous rate
synonymous rate
KA/KS
histone 3
actin a
insulin
myoglobin
b-globin
urokinase
0.00
0.01
0.13
0.56
0.80
1.28
6.38
3.68
4.02
4.44
3.05
3.92
0
0.002
0.03
0.126
0.262
0.362
Nonsynonymous changes are nucleotide substitutions which lead to amino acid
changes. Synonymous changes are those which do not. KA/KS is the ratio of
nonsynonymous to synonymous changes in the gene, and is a measure of the
functional selection on a protein. Proteins with highly constrained function tend to
evolve more slowly and have lower KA/KS values. This includes critical proteins
with multiple levels of function and regulation, such as histones.
adapted from Protein Evolution by L. Patthy, Blackwell Science, 1999 and from
Fundamentals of Molecular Evolution by Li & Graur, Sinauer, 1991
Orthologous vs. paralogous proteins
As a general rule, orthologous proteins tend to perform the same
function in different species, while paralogous proteins tend to have
diversified somewhat in function.
For example, the chymotrypsin serine proteases are orthologous to
each other, and they retain not only the same general function
(proteolysis using a catalytic triad including a serine), but also the same
substrate specificity (cleavage at positions following aromatic side
chains).
The chymotrypsins are paralogous to the trypsins and the elastases.
These proteins share the same general serine protease function but
have evolved different substrate specificities. These proteins also have
paralogs which have lost all protease activity.
Since the functional constraints are different, sequence conservation
patterns might vary depending upon whether one is comparing orthologs
or paralogs, or close paralogs vs. distant paralogs.
Classic phylogenetic studies of sequence
conservation: the globins
The globins are the best studied family in
terms of sequence conservation, partly
because they were one of the first families
for which multiple members were
sequenced, and partly because some of the
earliest protein structures (in fact, the
earliest) solved were globins. The classic
papers of Perutz, Kendrew and Watson
were the first to correlate sequence
conservation with aspects of protein
structure and function. They drew their
conclusion based on only a few aligned
sequences. Later globin studies, such as
that of Bashford, Chothia and Lesk,
expanded the analyses of globin sequence
conservation to include hundreds of
sequences.
Perutz, Kendrew & Watson J Mol Biol 13, 669 (1965)
Bashford, Chothia & Lesk J Mol Biol 196, 199 (1987)
Scapharca inaequivalvis
oxygenated hemoglobin
Conservation of functional residues
There were only 2 perfectly
conserved residues among the 8
known globin structures at the
time of the Bashford et al study.
These are residues critical in
binding of heme and/or interaction
w/heme-bound oxygen. It will
often be found that the best
conserved residues in related
Phe 43
proteins are those involved in
heme
critical aspects of the general
function.
His 87
Residues involved in more specific aspects of function may or may not be
conserved, depending upon the relationship between the proteins under
consideration. For example, residues involved in substrate specificity for
serine proteases may be conserved among orthologs, such as the
chymotrypsins, but not between paralogs, such as chymotrypsins and
trypsins.
Conservation at buried positions
• core residues, which are usually hydrophobic, often tolerate
conservative substitutions, i.e. to other hydrophobics
• overall core volume is well-conserved (Lim & Ptitsyn, 1970) though
individual core positions tolerate variation in volume
• this reflects what we know about packing and the effects of core
mutations on stability--thus sequence conservation is partly related to
maintaining a stable structure
portion of alignment of
prokaryotic and eukaryotic globins
Y140
yellow = small neutral/polar
green = hydrophobic
red/pink = polar/acidic
blue = basic
buried
H156
human hemoglobin
beta chain
Conservation at solvent-exposed positions
• solvent-exposed (surface) positions are mutable and usually tolerate
mutation to many residue types including hydrophobics. Bashford et al.,
however, noted that for globins at least, some surface positions do not
tolerate large hydrophobics. Since polar-to-hydrophobic mutations on protein
surfaces do not reduce stability, this conservation could reflect constraints
on solubility. Indeed, it is clear that the overall polar character of the
surface is conserved for soluble, globular proteins, even though a certain
number of hydrophobics may be tolerated.
Y140
yellow = small neutral/polar
green = hydrophobic
red/pink = polar/acidic
blue = basic
examples
of surface
residues
H156
human hemoglobin
beta chain
Conservation of loops and turns
• “Spacer” regions between secondary structures, such as loops and
turns, are often hypermutable and vary not only in sequence but in
length, tolerating insertion and deletion events (Insertions and deletions
are much less often found within secondary structure elements. Why?)
part of alignment of animal hemoglobin a and b chains
human a chain
Are the a and b chains related to each other by paralogy or orthology?
Covariation analysis
Substitution patterns at different positions in a sequence alignment are
not necessarily independent. This is sometimes referred to as
covariation or correlated evolution.
name
A
B
C
D
sequence
YADLGRIKS
YSDLGSEKE
IDDFGEIAA
IDDFGVIGT
For example, in the mini multiple
alignment shown at left, the identity of
the residue at the 4th position is
correlated to the identity of the
residue at the 1st position.
A statistical perturbation analysis can be used to characterize this
covariation. An alignment of related sequences is “perturbed” by
only considering sequences at which, for example, the first position is
Y. The effect of this perturbation on the residue distribution observed
at other positions is then measured. If the distribution changes
significantly, covariation between sequence changes at the first site
and other sites in the alignment is inferred.
Covariation and hydrophobic core packing
The hydrophobic core residues in related
proteins tend to be covariant due to
constraints on core packing. One sees
compensatory volume changes at different
positions.
Davidson and coworkers found that for 266
aligned SH3 domain sequences, the
strongest covariation was observed for a
cluster of central hydrophobic residues.
For example, substitution of a smaller residue
(Ala->Gly) at 39 was strongly correlated to
substitution of a larger residue (Ile->Phe) at
50.
Hydrophobic core of SH3
domains, with most frequently
covarying residues shown in
yellow
S.M. Larson, A.A. DiNardo and A.R. Davidson, J Mol Biol 303, 433 (2000)
Some recent studies (Suel
et al) have suggested a
connection between
covarying clusters of
residues and transduction
of signals between distant
sites in proteins.
For example, G-protein
coupled receptors bind a
ligand on one side of a
membrane, and then
transduce that signal to the
other side through
conformational change.
Suel et al showed that
the main clusters of
covarying residues tended
to connect the ligand and
G-protein binding sites.
ligand
covarying
networks
(brown)
membrane
G-protein binding sites
Suel et al. Nat Struct Biol 2003
Sequence logos: a common way of representing
sequence conservation
convert
alignment
to logo
Note how the logo simplifies
reading the conservation, but
doesn’t throw away information like
simply calculating a “consensus”
sequence would.
Logos represent sequence conservation in an easy to read format, with letter heights essentially
representing the frequency with which a residue type occurs at a position in an alignment, relative to the
frequency with which it would occur at random. The units of the y-axis are “bits” of information, which is
to say that if a residue did not occur more often than expected at random, it would not offer us any
information and the letter height would be zero. Note that the letter heights only become very high when
a residue really dominates in the alignment, like Ala at the fifth position here.
weblogo server: http://weblogo.berkeley.edu
sequence logos paper: Schneider and Stevens, Nucleic Acids Res 18, 6097 (1990).
Generalized substitution matrices
Conservation patterns observed in related proteins have been used
to construct generalized substitution matrices (e.g. BLOSUM, PAM,
Gonnet) which reflect the gross average likelihood of a mutation
occurring and being accepted in a protein.
(handouts in class will have more notes on these)
cysteine,
tryptophan
least mutable
polar more
mutable than
hydrophobic.
Polar more
likely to be
substituted
by polar,
hydrophobic by hydrophobic
the PAM 250 matrix
Inferring homology between proteins
The simplest way of identifying homology is by sequence comparison.
If two protein sequences are sufficiently similar (we’ll talk about what
similarity means in a moment), they can be statistically inferred to be
homologous. In addition, if a sequence obeys conservation patterns
observed in a known family of related sequences, it can be inferred to
be a member of that family.
For sequences of statistically borderline similarity, structural and
functional comparison, if such information is available, can be used as
a supplement to establish common ancestry. If similarity between two
sequences is really statistically weak, very strong structural and
functional similarity can still make a convincing argument for
homology.
Finally, gene context can play a role--for example, do two genes
occupy the same location within an operon in different organisms?
We will next focus on identification of homology through sequence
comparison. We will begin with simple pairwise comparison.
Pairwise alignment of sequences--global and local
GLOBAL ALIGNMENT
F R T Y I A E W Q R T E P G A D H
F Q T Y A A D Y - R T E P S S D H
*
* *
*
* * * *
* *
entire length of sequence aligned--about 60% identity
over 17 residues. Note that allowance for gaps improves
the % identity. The best alignment would be determined by using
some optimization algorithm in combination with a scoring
scheme, e.g. +1 for every identity and 0 for every mismatch or gap
(identity matrix).
- - - - - - - - - R T E P G A D H
LOCAL ALIGNMENT
- - - - - - - - - R T E P S S D H
* * * *
* *
only the best matching portion(s) of sequence is (are) included
in the alignment--75% percent identity over 8 residues. How does
a local alignment algorithm decide where to stop? By lengthening
the alignment only insofar as it increases the score. For example,
one could increase the score by +2 for every identical amino acid,
while assigning a penalty of -1 for every mismatch or gap. Such
penalties would prevent the alignment from extending to dissimilar
regions
Pairwise alignment of sequences--global vs. local
Local alignment is more versatile than global and is thus more widely used.
It can be used to align proteins that are not related throughout their lengths
but share a conserved domain, as well as proteins with very unevenly
distributed sequence similarity. Many many such cases exist. Thus, when
one has no prior knowledge of what to expect, local alignment routines are
preferable. This will especially be the case if one is using pairwise
alignment to search a database for sequences that are related to a query
sequence. Thus, alignment algorithms for database searching essentially
always use local alignment. It should be noted that the scoring scheme
used can be tailored to favor longer or shorter local alignments.
Global alignment is usually used to align sequences that are approximately
the same length and are already known to be related.
Once we’ve aligned all or part of a pair of sequences, how do we decide
whether they are homologous?
Percent sequence identity and homology
Common rule-of-thumb: 30% identity indicates homology. This is
too simplistic!
high level of identity between
unrelated proteins is common
at short alignment lengths
don’t worry about this
20-30% identity
called the “twilight zone”:
difficult to assess
relatedness from identity
from Brenner et al.
PNAS 95, 6073 (1998)
the 30% identity threshold for identification of
homology only works for long alignments, i.e.
>100-150 amino acids
Sequence identity and homology: false positives
sequence identity is 39% over
64 residues, yet the two proteins
are unrelated--this would be a false
positive using a 30% cutoff rule. Use
of a length-dependent cutoff would
help.
Note also that gaps are allowed in this
alignment--identity would be lower if gaps
were not allowed. However, gaps are
common among true homologs.
from Brenner et al. PNAS 95, 6073 (1998)
Sequence identity and homology: poor coverage
the two proteins have the
same fold,both bind heme
and oxygen in same place:
good independent
structural/functional evidence
for homology...
Yet alignments of their
sequences reveal only 24%
identity. There are also many
examples of related globins
and other proteins with much
lower identity than this.
1MBO and 1HBB
hemoglobin and myoglobin
Any reasonable sequence identity criterion, whether it is a flat percent
cutoff or a length-dependent cutoff, will give incomplete coverage--in
other words, it will fail to identify many distant but true relationships.
“Sequence similarity”
Sequence identity is one specific way of assessing sequence similarity, and
it’s not a very good one. If you just use sequence identity, you are throwing
away a lot of information. Specifically, as we have recently learned, not all
mutations are equally likely to occur and be accepted during the course of
evolution. Knowledge of what substitutions commonly occur among related
proteins can be put to use in aligning sequences and identifying
relationships.
Various methods have been developed which use such knowledge to
assess sequence similarity. The most widely used and familiar of these
methods work by using generalized amino acid substitution matrices (aka
scoring matrices) in tandem with effective local pairwise alignment
algorithms. This is coupled with a statistical assessment of the significance
of the alignment score obtained between two sequences using a given
matrix. As we’ll see, it’s possible (and often worth doing) to get much more
sophisticated than that, but that’s where it begins.
Scoring alignments using substitution matrices
would moving F over two
positions to the left improve
or worsen the score?
G
G
6
D A
E R
2 -1
Y
Y
7
M
- M
Q P
5 -11 -1
F
L
0
R
R
5
D W I
D W G
6 11 -4 =
gap extension penalty
gap opening penalty
substitution penalties are just
elements from BLOSUM 62 matrix
25
overall
score is
sum of
scores at
each position
In theory, the score is related to the odds that the alignment represents a
an actual homologous relationship between two proteins. Because scoring
matrices are in log-odds form, the overall alignment score is the sum of the
scores at each position rather than the product.
Common pairwise local alignment methods
Smith-Waterman dynamic programming algorithm:
See handout for description of dynamic programming.
Mathematically guaranteed to find highest scoring alignment for a
given set of input parameters. Tradeoff is that it is slow, although
computer speed is getting to the point where this is less of a problem.
The global version of Smith-Waterman is called Needleman-Wunsch.
If one were simply comparing any 2 sequences to see if they are
homologous, Smith-Waterman would be the method of choice.
BLAST (Basic Local Alignment Search Tool)
FASTA
These two are very similar--both achieve a speed advantage over
Smith-Waterman by initially looking for short “words” of 2 or 3 residues
that (nearly) exactly match. Alignments are then built from these initial
seed matches. The tradeoff for the speed advantage is that some
homologies may be missed. Because of their speed, BLAST and
FASTA are used in searches of large databases for homologues.
Variables in local alignment-based search algorithms
scoring matrix
the generalized log odds substitution matrix used to
score alignment--BLOSUM and PAM are the most
commonly used. BLOSUM 62 is default on BLAST and
BLOSUM 50 on most FASTA servers
gap penalties
gap opening penalty (for initiating a gap)
gap extension penalty (adding elements to existing gap)
“word size”
(“ktup” parameter in FASTA). BLAST and FASTA are so
fast partly because they start by looking for short “words”
that match exactly and build up a longer alignment from
these words. The size of the starting words can be
varied with this parameter (the shorter the word the more
it slows down the program)
filter
filters sequence to get rid of “low complexity” regions.
Such regions can lead to false positives due to their
compositional bias.
Statistical significance of alignment scores:
The extreme value distribution
Raw alignment scores by themselves are not particularly meaningful. In
order to assess the statistical significance of an alignment, i.e. the chances
that it represents a real relationship, one must understand what the
distribution of alignment scores would be for random pairs of sequences of
similar length and composition. Such scores obey what is called an
extreme value distribution, which is like a normal distribution but has a
positively skewed tail. The exact characteristics of the distribution will
depend upon the scoring matrix, the gap penalties employed, the
composition of the sequences, etc.
what is probability
P that a random
alignment will have
example of extreme value
a given score or
distribution
higher?
# of occurrences vs.
alignment score
Altschul et al. Nucleic Acids Research 25, 3389 (1997)
Statistical significance of alignment scores:
Z-scores, P-values and E-values
A Z-score is the number of standard deviations between the alignment
score and the mean of a normal distribution. The FASTA algorithm
reports Z scores in its output.
A P-value is the probability that an alignment between two random
sequences will have a score equal to or greater than the observed
score, as calculated from the extreme value distribution. The E-value or
expect value represents the number of times that the observed score or
higher would be observed when searching a database of D sequences.
For cases where P < 0.1, E ~ D*P. Both FASTA and BLAST report Evalues for alignments.
Basically, to be confident that a match between two sequences
represents true homology, you generally want an E-value < 0.01. That
means there’s a 1 in 100 chance that you have a false positive.
It has been shown (Brenner et al. 1998) that FASTA and BLAST Evalues do a pretty good job of distinguishing true and false positives.
Sample BLAST output
alignment score
>RCRO_BPP22
MYKKDVIDHFGTQRAVAKALGISDAAVSQWKEVIPEKDAYRLEIVTAGALKYQENAYRQAA
E-value
GenBank identifier
gi|4539473|emb|CAB39982.1| (AJ237660) Cro protein [Bacterio...
gi|12515040|gb|AAG56161.1|AE005346_8 (AE005346) unknown pro...
gi|13361674|dbj|BAB35631.1| (AP002557) putative regulatory ...
gi|12514991|gb|AAG56121.1|AE005343_11 (AE005343) putative r...
gi|118633|sp|P06965|DICC_ECOLI REPRESSOR PROTEIN OF DIVISIO...
gi|6093941|sp|Q37907|RCRO_BPD3 REGULATORY PROTEIN CRO (ANTI...
gi|9635583|ref|NP_061566.1| Cro [Pseudomonas phage D3] >gi|...
gi|118631|sp|P06966|DICA_ECOLI REPRESSOR PROTEIN OF DIVISIO...
gi|13559845|ref|NP_112055.1| cII [Bacteriophage HK620] >gi|...
gi|7531033|sp|O84102|ACPS_CHLTR HOLO-[ACYL-CARRIER PROTEIN]...
100
47
47
42
40
33
32
32
29
29
4e-21
4e-05
4e-05
0.001
0.005
0.46
0.86
1.5
8.9
9.3
>gi|12515040|gb|AAG56161.1|AE005346_8 (AE005346) unknown protein encoded within
prophage CP-933O[Escherichia coli O157:H7 EDL933] Length = 84
Score = 47.0 bits (110), Expect = 4e-05
Identities = 25/53 (47%), Positives = 32/53 (60%)
Query: 1
Sbjct: 5
“positives” means positions
at which scoring matrix
element is positive
MYKKDVIDHFGTQRAVAKALGISDAAVSQWKEVIPEKDAYRLEIVTAGALKYQ 53
M K +V+ +FG
A ALG S
VS W E +P K A ++ VTAGALKY+
MKKSEVLGYFGGVVKTAAALGTSKTTVSMWGEDVPWKWALLIQAVTAGALKYE 57
percent positives is sometimes
also called “percent similarity”
BLAST and FASTA can identify some homologues in
the “twilight zone”--20 to 30% identity
Score = 43.5 bits (101), Expect = 0.001
Identities = 36/145 (24%), Positives = 56/145 (37%), Gaps = 2/145 (1%)
Query: 2
Sbjct: 4
Query: 62
Sbjct: 62
LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDL 61
L+ E
V +W KV D G
+ L RL
+P T
F+ F L T
+ + +
LTPEEKSAVTALWGKVNVDEVGG--EALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61
KKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121
K HG VL A
L
+ +
L++ H K + +
+
++ VL
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121
Query: 122 DFGADAQGAMNKALELFRKDIAAKY 146
+F
Q A K +
+A KY
Sbjct: 122 EFTPPVQAAYQKVVAGVANALAHKY 146
BLAST alignment of
hemoglobin and myoglobin
Even though sequence identity here is low, the E-value is statistically significant
Pairwise alignments will not detect all homologies
Although pairwise local alignment routines employing generalized
substitution matrices do a good job of avoiding false positives, their
coverage is imperfect (though it’s better than just using % identity).
That is, there will be many relationships that they will miss because
the sequences have diverged too far.
big challenge in bioinformatics
is finding these “remote homologs”
homology identified
independently in this
trial database by
structural/functional
similarity
EPQ means errors per query,
ideally like E < 0.01 (1 in 100
chance of false positive)
from Brenner et al. PNAS 95, 6073 (1998)
SSEARCH is a program
employing the Smith-Waterman
algorithm
Multiple alignment of sequences
Conservation patterns observed in families of homologous sequences carry
much more useful information than do single sequences, both from the point of
view of understanding structure and function for a family, as well as for
identifying whether a particular sequences is homologous to a particular family.
Obtaining this information depends upon the ability to generate alignments of
multiple related sequences:
We aren’t going to have time to talk about methods for multiple alignment.
Some of the better known methods/websites, such as ClustalX for global
multiple alignment, will be listed as links on the course website. I recommend
Chapter 4 of David Mount’s Bioinformatics for thorough coverage of the topic.
We’re going to focus instead on what one can do with multiply aligned
sequences.
Position-dependent scoring matrices or “profiles”
of sequence families can be generated
from multiple alignments
row in matrix is constructed by
weighting a generalized
substitution matrix by the
appearance of the different
amino acids in the alignment.
For example, this row might be
made from an equal weight of
the E, G, V and L columns in,
say, a PAM250 matrix.
Gribskov, McLachlan & Eisenberg, PNAS 84, 4355, 1987
The resulting matrix contains
position-dependent information
about sequence conservation
within a particular family of
sequences, as opposed to
a generalized scoring matrix,
which is constructed by
averaging general sequence
conservation tendencies among
many families of related
sequences
Examples of models generated from multiple alignments
profiles
these two are almost
the same thing
position-specific scoring matrices (PSSM)
hidden Markov models (HMM)
These models can be generated for lengthy sequences or for short
ungapped conserved regions (blocks or motifs)
PSI-BLAST (Position-Specific Iterated)
query
sequence
initial
BLAST
search
utility obviously
depends on getting
some seed hits
Altschul et al. Nucleic Acids Research 25, 3389 (1997)
hits with
significant
similarity
(e.g. E < 0.005)
multiple
alignment
of hits
PSSM
iterated BLAST search
using the PSSM as query
the utility of PSI-BLAST in finding more remote homologues than
simple pairwise searches has been demonstrated. An example of
a similar program that uses a Hidden Markov model rather than a
PSSM is SAM-T99 (now SAM-T02)
Example of utility of PSI BLAST
initial BLAST with cutoff of E <0.01 brings up only
BRCT domains from other BRCA1s (orthologues)
two BRCT domains
from BRCA1 used
as query
few false
positives
were found
using E<0.01
cutoff
repeated rounds of PSI-BLAST bring up many others
and reveal first plant protein to contain BRCTs
Altschul et al. Nucleic Acids Research 25, 3389 (1997)
Searching profile databases
query sequence
database of HMMs, PSSMs
A number of researchers have used similarity searches to cluster the
known proteins into homologous groups, and then generated profiles
for each cluster using HMMs or PSSMs. Servers now allow one to do
similarity searches of these database profiles using a single query
sequence. This is qualitatively the reverse of what is done in PSIBLAST, in which one generates a profile and uses it to match
individual database sequences.
Some of these profiles represent motifs or short ungapped “blocks”,
whereas others are the length of entire domains. Among the best
known collections of domain profiles are SMART and Pfam. These
two form part of what is now called the Conserved Domain Database
(CDD). BLAST searches with the NCBI server will now automatically
do a search against the CDD unless you opt not to.