Transcript Power Point
Proteins dictate function in an organism:
What happens as proteins evolve?
Budding yeast
Saccharomyces pombe
(sugar fungus)
Fission yeast
Schizosaccharomyces
pombe
In our project, we'll be determining if functional homologs of S. cerevisiae
Met proteins are present in S. pombe
This semester: Five genes from S. pombe will be transferred to S. cerevisiae
What organism should the class study after we finish S. pombe genes?
A look at the molecular phylogeny should help
Are there any correlations between the kind of amino acid
substitutions observed over evolution with their chemistry?
How are bioinformatics tools used to analyze the conservation of
protein sequences?
How can I identify regions of proteins that are most strongly
conserved and most likely to be important for function?
For proteins to maintain their
function, they don't tolerate drastic
changes to their shapes
Amino acid substitutions that
significantly perturb the structure of
a protein or alter its chemistry can
cause the protein to lose function
Met16p from S. cerevisiae
complexed with PAP (2OQ2)
Recall that the final folded form of a protein is
determined by its primary sequence
R (“reactive”) groups form a variety of bonds
important for structure and function
Cysteine is one of the most
evolutionarily constrained
amino acids
Cys-254 is in close proximity to
the end-product, PAP,
suggesting that it plays a role
in catalysis
Custom view of Met16p highlights Cys
Protein: backbone view
PAP: ball-and-stick
Cysteine: space-fill
Acidic
Glu (E)
Asp (D)
Amino acids can be
grouped according to
the chemistry and
size of their R
groups
Charged
Arg (R) Basic
His (H)
Lys (K)
Polar
Asn (N)
Gln (Q)
Gly (G)
Small
Thr (T)
Cys (C)
Tyr (Y)
Ser (S)
Aromatic
Trp (W)
Ala (A)
Neutral
Ile (I)
Val (V)
Phe (F)
Leu (L)
Met (M)
Pro (P)
Hydrophobic
Most amino acids are abbreviated by
their first letter:
(Abundant, hydrophobic ones get
preference)
A
C
G
H
I
L
M
P
S
T
V
Ala
Cys
Gly
His
Ile
Leu
Met
Pro
Ser
Thr
Val
alanine
cysteine
glycine
histidine
isoleucine
leucine
methionine
proline
serine
threonine
valine
Phonetic abbreviations:
F Phe
phenylalanine
R Arg
arginine
Oddballs:
(Charged, aromatic, some polar)
D
E
Asp
Glu
aspartic acid
glutamic acid
K
Lys
lysine
N
Q
Asn
Gln
asparagine
glutamine
W Trp
Y Tyr
tryptophan
tyrosine
The one letter code needs to be part
of a 21st century biologist’s vocabulary
Studying the evolutionary conservation of amino acids in sequences
provides a sense of the importance of the amino acid to protein function
BLOSUM62 (BLOck SUbstitution Matrix) was based on statistical
alignments seen in proteins that are at least 62% identical
Matrix assigns scores for substitutions:
Maximum score for the same amino acid
(completely conserved, possibly essential)
Positive scores are awarded for common
amino acid substitutions, in decreasing
order, based on their occurrence in proteins
Negative scores are unlikely substitutions
Note the high score for Cys!
The biochemical connection:
Higher scores are frequently correlated with conservative amino acid
substitutions based on amino acids chemistry and size
Are there any correlations between the kind of amino acid
substitutions observed over evolution with their biochemistry?
How are bioinformatics tools used to analyze the conservation of
protein sequences?
How can I identify regions of proteins that are most strongly
conserved and most likely to be important for function?
BLAST
BLAST is an acronym for Basic Local Alignment Search Tool, a
computer algorithm for finding homologous sequences in databases
BLASTN compares nucleic acid sequences
BLASTP compares protein sequences
BLOSUM62 is the default scoring matrix for BLASTP
BLOSUM 62 scores relate the frequency of a particular substitution to
the probability that it occurs by chance in proteins that are at least 62%
identical throughout their length
Score = k log10
(
Pij
Qi * Q j
)
Scaling factor used to
produce integral values
Pij is the observed frequency of two amino acids (i and j)
replacing each other in homologous sequences
Qi and Qj are probabilities of finding i and j randomly in a
sequence
Positive and negative scores suggest amino acid changes have been
selected for (positive) or against (negative) during evolution
Magnitude of the score suggests the strength of the selection
Score of zero suggests that a
particular substitution can be
explained by chance alone
BLASTP begins with a query sequence (e.g. your MET sequence)
The query sequence is broken into "words" that will act as seeds in alignments
Query
Word
s
BLAST searches for matches (or synonyms) in target entries in the database
Word match
Word match
Target sequence
If a target entry has two or more matches to "words" from the query, the
alignment is extended in both directions looking for additional similarity
Word match
Target sequence
Word match
"Words" are integral to the BLASTP search
BLASTP uses a sliding window to identify words
Consider the sequence:
E
A
G
L
E
S
BLASTP would break this down into a series of four 3-letter words:
E
A
A
G
G
G
L
L
L
Tip!
Use a non-proportional word font such as
Courier when working with database entries.
E
E
S
The fonts are uglier, but the letters have a
constant spacing that generates nice columns!
Next: words are given a numerical score
BLASTP uses the BLOSUM62 matrix as its default for assigning
values to words
E
A
G
5 + 4 + 6 = 15
A
G
L
G
L
E
6 + 4 + 5 = 15
4 + 6 + 4 = 14
L
E
S
4 + 5 + 4 = 13
BLASTP next checks for word synonyms (1-letter replacements)with a
score greater than a default threshold of 10
Of the 60 possible synonyms for each word, only a small handful are
statistically likely to appear in homologous proteins
E
K
E
E
E
E
A
S
C
T
V
A
G
G
G
G
G
G
(11)
(12)
(11)
(11)
(11)
A
G
L
S G L (11)
A G I (12)
G
L
E
G I E (13)
G L D (12)
G L Q (12)
L
E
S
I E S (13)
BLASTP will search for all of these words
and synonyms in the protein database
Sequences must have at least two words for further consideration
BLASTP uses word matches as a nucleus and extends them in both directions,
looking for additional similarity
Word match
Target sequence
Original search word
Q A S T L Y E - A G L E S E A T T N - - R R E I Query
+ A + T
+ +
+ G L E S E A
+ +
R + E + Summary
N A A T Y W D A S G L E S - - - S Q I I R K E L Target
As BLASTP extends the alignment out from the match, it calculates a
running score – extension stops when the score drops below a threshold value
Penalties are assigned for gaps and mismatches
Plus signs in summary line indicate a positive BLOSUM62 value
Are there any correlations between the kind of amino acid
substitutions observed over evolution with their biochemistry?
How are bioinformatics tools used to analyze the conservation of
protein sequences?
How can I identify regions of proteins that are most strongly
conserved and most likely to be important for function?
Highly conserved protein sequences are often essential for function
You will compare sequences of homologous proteins from model organisms
Escherichia coli K-12
(gram negative)
Caenorhabditis
elegans
Arabidopsis
thaliana
Bacillus subtilis str. 168
(gram positive)
Mus musculus
Phylogeny.fr provides tools for
preparing multiple sequence
alignments and phylogenetic trees
Multiple sequence alignments show regions of conservation
Identical amino acids are shown in blue – conservative changes in grey
Tree Dyn generates a phylogenetic tree
Length of branches reflects
time since divergene from a
node
Bootstrap values
predict reliability
of nodes in the
tree (max = 1.0)
Length corresponds to 600 million years
Weblogo program provides a graphical depiction of
multiple sequence alignments
Sizes of different amino acids reflects the frequency with which a
particular amino acid is found at the position – note the positions of
amino acids with high BLOSUM scores