Transcript Document

Protein homology I: Evolution and
comparison of protein sequences
Biochem 565, Fall 2008
09/17/08
Cordes
Outline
1. Homology and kinds of homology
2. Mutations and sequence conservation
3. Pairwise alignment--global vs. local
4. Sequence identity and homology
5. Sequence similarity and homology-use of substitution matrices
6. Alignment scores and statistics
7. Limitations of pairwise alignment
8. Remote homologies--use of evolutionary profiles
Evolutionary relationships between proteins
boxes represent
protein-coding genes
A1
gene duplication
A1
A2
speciation
orthologs
paralogs
A1
A2
A1
A2
key terms to describe evolutionary relationships among proteins
homologous
descended from a common ancestor, e.g. “A1 and A2 are
homologous”. Also sometimes defined as “Similar due to
descent from a common ancestor.” Homology is either/or-there is no such thing as “percent homology”!
Homologous is not a synonym for “similar”! It is, however,
possible for only a part of two sequences to be
homologous, for instance one domain in multidomain prot.
paralogous
related by gene duplication
orthologous
related by speciation
Orthologous and paralogous proteins
As a general rule, orthologous proteins tend to perform the same
function in different species, while paralogous proteins tend to have
diversified somewhat in function--duplication is a very common way in
which evolution gives proteins the freedom to develop new functions.
For example, the chymotrypsin serine proteases are orthologous to
each other, and they retain not only the same general function
(proteolysis using a catalytic triad including a serine), but also the same
substrate specificity (cleavage at positions following aromatic side
chains).
The chymotrypsins are paralogous to the trypsins and the elastases.
These proteins share the same general serine protease function but
have evolved different substrate specificities. These proteins also have
paralogs which have lost all protease activity.
Homology at the domain level
•
•
•
•
proteins often have a modular organization
single polypeptide chain may be divisible into smaller independent units
of tertiary structure called domains
different domains in a protein are also often associated with different
functions carried out by the protein, though some functions occur at the
interface between domains
domains are a more fundamental unit of protein homology than a
full protein--it is possible for two proteins to have one or more
domains that are homologous combined with one or more that
aren’t. In other words, domains can be “shuffled” in evolution.
domain organization of P53 tumor suppressor
1
60
activation
domain
100
300 324 355 363 393
sequence-specific
tetramer- non-specific
DNA binding domain ization
DNA-binding
domain domain
Simple mutations in protein-coding gene sequences
nonsynonymous substitutions-change in codon and in translated
amino acid
MetGluGlyTyrCysValAla...
ATGGAAGGGTACTGCGTGGCA...
ATGGAGGGGTACAGC---GCA...
MetGluGlyTyrSer---Ala...
silent or synonymous substitutions-change in codon but not in translated
amino acid
diagram shows DNA and
translated protein
sequence
for two sequences related
by mutations
deletions and insertions
(indels)--if occur in multiples of
three will lead to
deletion/insertion of amino
acids. Otherwise will produce
frameshifts which change the
entire downstream sequence.
Acceptance and rejection of mutations
Depends upon many factors, among which are:
• Is the mutation a substitution, indel, frameshift?
• If it is a substitution, is the mutation nonsynonymous
or synonymous?
• If it is nonsynonymous, is it “conservative”? Does it
preserve the approximate physicochemical properties of
the amino acid mutated, or does it change them radically?
• What protein does it occur in? Some proteins more
essential and more tightly constrained by natural
selection than others.
• Where does it occur in the protein? Is it important for
function or the stability of the structure, or both/neither?
Synonymous and nonsynonymous substitutions
Table. Substitution rates in genes encoding orthologous rodent and human proteins.
Units are substitution rates per site per billion years.
protein
nonsynonymous rate
synonymous rate
KA/KS
histone 3
actin a
insulin
myoglobin
b-globin
urokinase
0.00
0.01
0.13
0.56
0.80
1.28
6.38
3.68
4.02
4.44
3.05
3.92
0
0.002
0.03
0.126
0.262
0.362
KA/KS is the ratio of nonsynonymous to synonymous changes in the gene, and is a
measure of the functional selection on a protein. In general, synonymous
changes are more likely to be accepted than nonsynonymous changes, but
how much more likely varies a lot: the sequences of proteins with highly
constrained function tend to evolve more slowly and have lower KA/KS
values. This includes critical proteins with multiple levels of function and
regulation, such as histones.
adapted from Protein Evolution by L. Patthy, Blackwell Science, 1999 and from
Fundamentals of Molecular Evolution by Li & Graur, Sinauer, 1991
Generalized substitution matrices
The likelihood of a nonsynonymous substitution occurring and being
accepted also depends upon whether the mutation is “conservative”, meaning
that it preserves similar properties, or “nonconservative”. Substitutions
observed in alignments of related proteins have been used to construct generalized
substitution matrices (e.g. BLOSUM, PAM, Gonnet) which reflect the average
likelihood of a mutation occurring and being accepted in a protein.
Cys, Trp
least mutable,
most unique in
properties
Polar more
mutable than
hydrophobic.
Polar more
likely to be
substituted
by polar,
hydrophobic
by hydrophobic
the PAM 250 matrix
(Margaret Dayhoff)
Generalized substitution matrices
the PAM 250 matrix
(Margaret Dayhoff)
Position-specific conservation and sequence variation
Multiple alignments of
members of families of related
proteins, color coded by
categories of amino acids, can
reveal conservation at specific
positions in the sequence.
Color coding in this alignment:
Orange: conserved small
Green: conserved aliphatic
Red: conserved basic
Blue: conserved aromatic
Position in sequence alignment
names of
family
members
Level of the bar indicates level of
conservation, or lack of tolerance
to mutation. Some positions
variable, others not.
Position-specific conservation and sequence variation
Multiple alignment
Alternative representation:
a sequence logo
Logos represent sequence conservation in an easy to read format, with letter heights essentially
representing the frequency with which a residue type occurs at a position in an alignment, relative to the
frequency with which it would occur at random. The units of the y-axis are “bits” of information, which is
to say that if a residue did not occur more often than expected at random, it would not offer us any
information and the letter height would be zero. Note that the letter heights only become very high when
a residue really dominates in the alignment, like Ala at the fifth position here.
weblogo server: http://weblogo.berkeley.edu
sequence logos paper: Schneider and Stevens, Nucleic Acids Res 18, 6097 (1990).
Classic studies of sequence conservation: the globins
The globins are the best studied family in
terms of sequence conservation, partly
because they were one of the first families
for which multiple members were
sequenced, and partly because some of the
earliest protein structures (in fact, the
earliest) solved were globins. The classic
papers of Perutz, Kendrew and Watson
were the first to correlate sequence
conservation with aspects of protein
structure and function. They drew their
conclusion based on only a few aligned
sequences. Later globin studies, such as
those of Bashford, Chothia and Lesk,
expanded the analyses of globin sequence
conservation to include hundreds of
sequences.
Perutz, Kendrew & Watson J Mol Biol 13, 669 (1965)
Bashford, Chothia & Lesk J Mol Biol 196, 199 (1987)
Scapharca inaequivalvis
oxygenated hemoglobin
Conservation of functional residues
There were only 2 perfectly
conserved residues among the 8
known globin structures at the
time Bashford et al did their study.
These are residues critical in
binding of heme and/or interaction
w/heme-bound oxygen. It will
often be found that the best
conserved (least tolerant of
Phe 43
mutation) residues in related
heme
proteins are those involved in
critical aspects of the general
function.
His 87
Residues involved in more specific aspects of function may or may not be
conserved, depending upon the relationship between the proteins under
consideration. For example, residues involved in substrate specificity for
serine proteases may be conserved among orthologs, such as the
chymotrypsins, but not between paralogs, such as chymotrypsins and
trypsins.
Conservation at buried (interior) positions
• Core or buried residues, which are usually hydrophobic, often
tolerate conservative substitutions, i.e. to other hydrophobics
• overall core volume is well-conserved (Lim & Ptitsyn, 1970) though
individual core positions tolerate variation in volume
• this reflects what we know about the packing in protein interiors and
the effects of interior mutations on stability--thus, sequence
conservation is partly related to maintaining a stable structure!
portion of alignment of
prokaryotic and eukaryotic globins
Tyr 140
yellow = small
green = hydrophobic
pink/red = neutral polar/acidic
blue = basic
buried
His 156
residues on one
face of this helix
are in the interior
human hemoglobin
beta chain
Conservation at solvent-exposed positions
• Solvent-exposed (surface) positions are mutable and usually tolerate
mutation to many residue types including hydrophobics. Bashford et al.,
however, noted that for globins at least, some surface positions do not
tolerate large hydrophobics. Since polar-to-hydrophobic mutations on protein
surfaces do not reduce stability, this conservation could reflect constraints
on solubility. Indeed, it is clear that the overall polar character of the
surface is conserved for soluble, globular proteins, even though a certain
number of hydrophobics may be tolerated.
Tyr 140
yellow = small
green = hydrophobic
pink/red = neutral polar/acidic
blue = basic
examples
of surface
residues
His 156
residues on the other
face of this helix
are exposed to solvent
human hemoglobin
beta chain
Conservation of loops and turns
• Loops and turns that connect regular secondary structures are often
hypermutable and vary not only in sequence but in length, tolerating insertion
and deletion events (which are not well-tolerated within regular secondary
structure elements).
part of alignment of animal hemoglobin a and b chains
human
hemoglobin
a chain
Covariation analysis
Substitution patterns at different positions in a sequence alignment are
not necessarily independent. This is sometimes referred to as
covariation or correlated evolution.
name
A
B
C
D
sequence
YADLGRIKS
YSDLGSEKE
IDDFGEIAA
IDDFGVIGT
For example, in the mini multiple
alignment shown at left, the identity of
the residue at the 4th position is
correlated to the identity of the
residue at the 1st position.
A statistical perturbation analysis can be used to characterize this
covariation. An alignment of related sequences is “perturbed” by
only considering sequences at which, for example, the first position is
Y. The effect of this perturbation on the residue distribution observed
at other positions is then measured. If the distribution changes
significantly, covariation between sequence changes at the first site
and other sites in the alignment is inferred.
Covariation and hydrophobic core packing
The hydrophobic core residues in related
proteins tend to be covariant due to
constraints on core packing. One sees
compensatory volume changes at different
positions.
Davidson and coworkers found that for 266
aligned SH3 domain sequences, the
strongest covariation was observed for a
cluster of central hydrophobic residues.
For example, substitution of a smaller residue
(Ala->Gly) at 39 was strongly correlated to
substitution of a larger residue (Ile->Phe) at
50.
Hydrophobic core of SH3
domains, with most frequently
covarying residues shown in
yellow
S.M. Larson, A.A. DiNardo and A.R. Davidson, J Mol Biol 303, 433 (2000)
Some recent studies (Suel
et al) have suggested a
connection between
covarying clusters of
residues and transduction
of signals between distant
sites in proteins.
For example, G-protein
coupled receptors bind a
ligand on one side of a
membrane, and then
transduce that signal to the
other side through
conformational change.
Suel et al showed that
the main clusters of
covarying residues tended
to connect the ligand and
G-protein binding sites.
ligand
covarying
networks
(brown)
membrane
G-protein binding sites
Suel et al. Nat Struct Biol 2003
Inferring homology between proteins
The simplest way of identifying homology is by sequence comparison.
If two protein sequences are sufficiently similar (we’ll talk about what
similarity means in a moment), they can be statistically inferred to be
homologous. In addition, if a sequence obeys conservation patterns
observed in a known family of related sequences, it can be inferred to
be a member of that family.
For sequences of statistically borderline similarity, structural and
functional comparison, if such information is available, can be used as
a supplement to establish common ancestry. If similarity between two
sequences is really statistically weak, very strong structural and
functional similarity can still make a convincing argument for
homology.
Finally, gene context can play a role--for example, do two genes
occupy the same location within an operon in different organisms?
We will next focus on identification of homology through sequence
comparison. We will begin with simple pairwise comparison.
Pairwise alignment of sequences--global and local
GLOBAL ALIGNMENT
F R T Y I A E W Q R T E P G A D H
F Q T Y A A D Y - R T E P S S D H
*
* *
*
* * * *
* *
entire length of sequence aligned--about 60% identity
over 17 residues. Note that allowance for gaps improves
the % identity. The best alignment would be determined by using
some optimization algorithm in combination with a scoring
scheme, e.g. +1 for every identity and 0 for every mismatch or gap
(identity matrix).
- - - - - - - - - R T E P G A D H
LOCAL ALIGNMENT
- - - - - - - - - R T E P S S D H
* * * *
* *
only the best matching portion(s) of sequence is (are) included
in the alignment--75% percent identity over 8 residues. How does
a local alignment algorithm decide where to stop? By lengthening
the alignment only insofar as it increases the score. For example,
one could increase the score by +2 for every identical amino acid,
while assigning a penalty of -1 for every mismatch or gap. Such
penalties would prevent the alignment from extending to dissimilar
regions
Pairwise alignment of sequences--global vs. local
Local alignment is more versatile than global and is thus more widely used.
It can be used to align proteins that are not related throughout their lengths
but share a conserved domain, as well as proteins with very unevenly
distributed sequence similarity. Many many such cases exist. Thus, when
one has no prior knowledge of what to expect, local alignment routines are
preferable. This will especially be the case if one is using pairwise
alignment to search a database for sequences that are related to a query
sequence. Thus, alignment algorithms for database searching essentially
always use local alignment. It should be noted that the scoring scheme
used can be tailored to favor longer or shorter local alignments.
Global alignment is usually used to align sequences that are approximately
the same length and are already known to be related.
Once we’ve aligned all or part of a pair of sequences, how do we decide
whether they are homologous?
Percent sequence identity and homology
Common rule-of-thumb: 30% identical residues between two aligned
protein sequences indicates homology. This is too simplistic and only
works if the 30% is measured over a long stretch of amino acids!
high level of identity between
unrelated proteins is common
at short alignment lengths
do not worry about this line
20-30% identity
called the “twilight zone”:
difficult to assess
relatedness from identity
from Brenner et al.
PNAS 95, 6073 (1998)
the 30% identity threshold for identification of
homology only works for long alignments, i.e.
>100-150 amino acids
Sequence identity and homology: false positives
sequence identity is 39% over
64 residues, yet the two proteins
are unrelated--this would be a false
positive using a 30% cutoff rule. Use
of a length-dependent cutoff would
help.
Note also that gaps are allowed in this
alignment--identity would be lower if gaps
were not allowed. However, gaps are
common among true homologs.
from Brenner et al. PNAS 95, 6073 (1998)
Sequence identity and homology: poor coverage
the two proteins have the
same fold,both bind heme
and oxygen in same place:
good independent
structural/functional evidence
for homology...
Yet alignments of their
sequences reveal only 24%
identity. There are also many
examples of related globins
and other proteins with much
lower identity than this.
1MBO and 1HBB
hemoglobin and myoglobin
Any reasonable sequence identity criterion, whether it is a flat percent
cutoff or a length-dependent cutoff, will give incomplete coverage--in
other words, it will fail to identify many distant but true relationships.
“Sequence similarity” and homology
Sequence identity is one specific way of assessing sequence similarity, and
it’s not a very good one. If you just use sequence identity, you are throwing
away a lot of information. As we have just learned, not all mutations are
equally likely to occur and be accepted during the course of evolution.
Knowledge of what substitutions commonly occur among related proteins
can be put to use both in aligning sequences and in using sequence
similarity to identify homology/common ancestry.
Various methods have been developed which use such knowledge to
assess sequence similarity. The most widely used and familiar of these
methods work by using generalized amino acid substitution matrices (aka
scoring matrices) in tandem with effective computational alignment
algorithms that find the best (highest scoring) alignment. This is coupled
with a statistical assessment of the significance of the alignment score
obtained between two sequences using a given matrix.
Percent similarity in sequence alignments
G
G
G
6
D A
+
E R
2 -1
Y
Y
Y
7
M
M
M
5
-
-
Q
P
V
+
L
2
R
R
R
5
D W I
D W
D W G
6 11 -4
Identical amino
acids
These two sequences
have 50% identity, but
67% similarity
Similar amino
acids: positive
matrix element
Substitution matrix
element assessing
probability of
mutations exchanging
the two aligned amino
acids
Scoring alignments using substitution matrices
G
G
6
D A
E R
2 -1
Y
Y
7
M
- M
Q P
5 -11 -1
F
L
0
R
R
5
D W I
D W G
6 11 -4 =
gap extension penalty
gap opening penalty
substitution penalties are just
elements from a substitution matrix
25
overall
score is
sum of
scores at
each position
A more sophisticated way to assess similarity is to actually “score” the
alignment using the substitution matrix. One must also apply penalties for
introducing and lengthening gaps in the alignment.
In theory, the raw alignment score is related to the odds or probability that
the alignment represents an actual homologous relationship between two
proteins. Because scoring matrices are in logarithmic odds form, the
overall alignment score is the sum of the scores at each position rather than
the product.
Common pairwise alignment methods
Smith-Waterman dynamic programming algorithm:
Mathematically guaranteed to find highest scoring alignment for a
given set of input parameters. Tradeoff is that it is slow, although
computer speed is getting to the point where this is less of a problem.
The global version of Smith-Waterman is called Needleman-Wunsch.
If one were simply comparing any 2 sequences to see if they are
homologous, Smith-Waterman would be the method of choice.
BLAST (Basic Local Alignment Search Tool)
FASTA
These two are very similar--both achieve a speed advantage over
Smith-Waterman by initially looking for short “words” of 2 or 3 residues
that (nearly) exactly match. Alignments are then built from these initial
seed matches. The tradeoff for the speed advantage is that some
homologies may be missed. Because of their speed, BLAST and
FASTA are used in searches of large databases for homologues. This
is a very common application--I have a protein, and I want to ask, is it
related to anything about which anything is known?
Variables in local alignment-based search algorithms
scoring matrix
the generalized log odds substitution matrix used to
score alignment--BLOSUM and PAM are the most
commonly used. BLOSUM 62 is default on BLAST and
BLOSUM 50 on most FASTA servers
gap penalties
gap opening penalty (for initiating a gap)
gap extension penalty (adding elements to existing gap)
“word size”
(“ktup” parameter in FASTA). BLAST and FASTA are so
fast partly because they start by looking for short “words”
that match exactly and build up a longer alignment from
these words. The size of the starting words can be
varied with this parameter (the shorter the word the more
it slows down the program)
filter
filters sequence to get rid of “low complexity” regions.
Such regions can lead to false positives due to their
compositional bias.
Statistical significance of alignment scores:
The extreme value distribution
Raw alignment scores by themselves are not particularly meaningful. In
order to assess the statistical significance of an alignment, i.e. the chances
that it represents a real relationship, one must understand what the
distribution of alignment scores would be for random pairs of sequences of
similar length and composition. Such scores obey what is called an
extreme value distribution, which is like a normal distribution but has a
positively skewed tail. The exact characteristics of the distribution will
depend upon the scoring matrix, the gap penalties employed, the
composition of the sequences, etc.
what is probability
P that a random
alignment will have
example of extreme value
a given score or
distribution
higher?
# of occurrences vs.
alignment score
Altschul et al. Nucleic Acids Research 25, 3389 (1997)
Statistical significance of alignment scores:
Z-scores, P-values and E-values
A Z-score is the number of standard deviations between the alignment
score and the mean of a normal distribution. The FASTA algorithm
reports Z scores in its output.
A P-value is the probability that an alignment between two random
sequences will have a score equal to or greater than the observed
score, as calculated from the extreme value distribution. The E-value or
expect value represents the number of times that the observed score or
higher would be observed when searching a database of D sequences.
For cases where P < 0.1, E ~ D*P. Both FASTA and BLAST report Evalues for alignments.
Basically, to be confident that a match between two sequences
represents true homology, you generally want an E-value < 0.01. That
means there’s a 1 in 100 chance that you have a false positive.
It has been shown (Brenner et al. 1998) that FASTA and BLAST Evalues do a pretty good job of distinguishing true and false positives.
Sample BLAST output
alignment score
>RCRO_BPP22
MYKKDVIDHFGTQRAVAKALGISDAAVSQWKEVIPEKDAYRLEIVTAGALKYQENAYRQAA
E-value
GenBank identifier
gi|4539473|emb|CAB39982.1| (AJ237660) Cro protein [Bacterio...
gi|12515040|gb|AAG56161.1|AE005346_8 (AE005346) unknown pro...
gi|13361674|dbj|BAB35631.1| (AP002557) putative regulatory ...
gi|12514991|gb|AAG56121.1|AE005343_11 (AE005343) putative r...
gi|118633|sp|P06965|DICC_ECOLI REPRESSOR PROTEIN OF DIVISIO...
gi|6093941|sp|Q37907|RCRO_BPD3 REGULATORY PROTEIN CRO (ANTI...
gi|9635583|ref|NP_061566.1| Cro [Pseudomonas phage D3] >gi|...
gi|118631|sp|P06966|DICA_ECOLI REPRESSOR PROTEIN OF DIVISIO...
gi|13559845|ref|NP_112055.1| cII [Bacteriophage HK620] >gi|...
gi|7531033|sp|O84102|ACPS_CHLTR HOLO-[ACYL-CARRIER PROTEIN]...
100
47
47
42
40
33
32
32
29
29
4e-21
4e-05
4e-05
0.001
0.005
0.46
0.86
1.5
8.9
9.3
>gi|12515040|gb|AAG56161.1|AE005346_8 (AE005346) unknown protein encoded within
prophage CP-933O[Escherichia coli O157:H7 EDL933] Length = 84
Score = 47.0 bits (110), Expect = 4e-05
Identities = 25/53 (47%), Positives = 32/53 (60%)
Query: 1
Sbjct: 5
“positives” means positions
at which scoring matrix
element is positive
MYKKDVIDHFGTQRAVAKALGISDAAVSQWKEVIPEKDAYRLEIVTAGALKYQ 53
M K +V+ +FG
A ALG S
VS W E +P K A ++ VTAGALKY+
MKKSEVLGYFGGVVKTAAALGTSKTTVSMWGEDVPWKWALLIQAVTAGALKYE 57
percent positives is sometimes
also called “percent similarity”
BLAST and FASTA can identify some homologues in
the “twilight zone”--20 to 30% identity
Score = 43.5 bits (101), Expect = 0.001
Identities = 36/145 (24%), Positives = 56/145 (37%), Gaps = 2/145 (1%)
Query: 2
Sbjct: 4
Query: 62
Sbjct: 62
LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDL 61
L+ E
V +W KV D G
+ L RL
+P T
F+ F L T
+ + +
LTPEEKSAVTALWGKVNVDEVGG--EALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61
KKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121
K HG VL A
L
+ +
L++ H K + +
+
++ VL
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121
Query: 122 DFGADAQGAMNKALELFRKDIAAKY 146
+F
Q A K +
+A KY
Sbjct: 122 EFTPPVQAAYQKVVAGVANALAHKY 146
BLAST alignment of
hemoglobin and myoglobin
Even though sequence identity here is low, the E-value is statistically significant
Comparing pairs of sequences will not detect all homologies
Matrix-scored pairwise alignments with robust statistics like E-values do a
good job of avoiding false positives--however their coverage is imperfect
(though it’s better than just using % identity). That is, there will be many
relationships that they will miss because the sequences have drifted too far
apart!
white bars: pairs of “remote homologs” missed by pairwise alignment
homology identified
independently in this
trial database by
known structural/
functional similarity
EPQ means errors per query,
ideally like E < 0.01 (1 in 100
chance of false positive)
SSEARCH: Smith-Waterman
black bars: relationships successfully identified
algorithm
by sequence comparison. Most are pairs with more
than 20% identical sequences.
from Brenner et al. PNAS 95, 6073 (1998)
Multiple alignment of sequences
Conservation patterns observed in families of homologous sequences carry
much more useful information than do single sequences, both from the point of
view of understanding structure and function for a family, as well as for
identifying whether a particular sequences is homologous to a particular family.
Obtaining this information depends upon the ability to generate alignments of
multiple related sequences:
We aren’t going to have time to talk about methods for multiple alignment.
Some of the better known methods/websites, such as ClustalX for global
multiple alignment, will be listed as links on the course website. I recommend
Chapter 4 of David Mount’s Bioinformatics for thorough coverage of the topic.
We’re going to focus instead on what one can do with multiply aligned
sequences.
Position-dependent scoring matrices or “profiles”
of sequence families can be generated
from multiple alignments
row in matrix is constructed by
weighting a generalized
substitution matrix by the
appearance of the different
amino acids in the alignment.
For example, this row might be
made from an equal weight of
the E, G, V and L columns in,
say, a PAM250 matrix.
Gribskov, McLachlan & Eisenberg, PNAS 84, 4355, 1987
The resulting matrix contains
position-dependent information
about sequence conservation
within a particular family of
sequences, as opposed to
a generalized scoring matrix,
which is constructed by
averaging general sequence
conservation tendencies among
many families of related
sequences
Examples of models generated from multiple alignments
profiles
these two are almost
the same thing
position-specific scoring matrices (PSSM)
hidden Markov models (HMM)
These models can be generated for lengthy sequences or for short
ungapped conserved regions (blocks or motifs)
PSI-BLAST (Position-Specific Iterated)
query
sequence
initial
BLAST
search
utility obviously
depends on getting
some seed hits
Altschul et al. Nucleic Acids Research 25, 3389 (1997)
hits with
significant
similarity
(e.g. E < 0.005)
multiple
alignment
of hits
PSSM
iterated BLAST search
using the PSSM as query
the utility of PSI-BLAST in finding more remote homologues than
simple pairwise searches has been demonstrated. An example of
a similar program that uses a Hidden Markov model rather than a
PSSM is SAM-T99 (now SAM-T02)
Example of utility of PSI BLAST
initial BLAST with cutoff of E <0.01 brings up only
BRCT domains from other BRCA1s (orthologues)
two BRCT domains
from BRCA1 used
as query
few false
positives
were found
using E<0.01
cutoff
repeated rounds of PSI-BLAST bring up many others
and reveal first plant protein to contain BRCTs
Altschul et al. Nucleic Acids Research 25, 3389 (1997)
Searching profile databases
query sequence
database of HMMs, PSSMs
A number of researchers have used similarity searches to cluster the
known proteins into homologous groups, and then generated profiles
for each cluster using HMMs or PSSMs. Servers now allow one to do
similarity searches of these database profiles using a single query
sequence. This is qualitatively the reverse of what is done in PSIBLAST, in which one generates a profile and uses it to match
individual database sequences.
Some of these profiles represent motifs or short ungapped “blocks”,
whereas others are the length of entire domains. Among the best
known collections of domain profiles are SMART and Pfam. These
two form part of what is now called the Conserved Domain Database
(CDD). BLAST searches with the NCBI server will now automatically
do a search against the CDD unless you opt not to.