Transcript file1

Using phylogenetic profiles
to predict protein function
and localization
As discussed by Catherine Grasso
Papers
 Pellegrini, et al. Assigning protein functions
by comparative genome analysis: Protein
phylogenetic profiles. (1999) PNAS 96, 42854288.
 Marcotte, et al. Localizing proteins in the cell
from their phylogenetic profiles. (2000) PNAS
97, 12115-12120.
Basic Idea:
Sequence alignment is a good way to infer
protein function, when two proteins do the
exact same thing in two different
organisms.
Proteins with > 30% sequence identity
have the same fold, and typically the same
function.
Basic Idea:
But can we decide if two proteins function
in the same pathway, such as histidine
biosynthesis, or the same biomolecular
structure, such as the flagella or ribosome,
even if they don’t do the exact same
thing?
Yes. Assume that if the two proteins
function together they must evolve in a
correlated fashion: so every organism that
has a homolog of one of the proteins must
also have a homolog of the other protein.
Phylogenetic Profile
For a given protein, BLAST against N
sequenced genomes.
Construct a vector with N coordinates.
If protein has a homolog in the organism n,
set coordinate n to 1. Otherwise set it to 0.
Protein P1:
0
0
1
0
1
1
0
0
Functional Link
Assign a degree of functional linkage
between P1 and P2 based on the number
of positions (or bits) at which their profiles
differ.
Protein P1:
0
0
1
0
1
1
0
0
Protein P2:
0
1
1
0
1
1
0
0
What They Did:
Computed phylogenetic profiles for 4,290
proteins in E. Coli.
Aligned each protein sequence Pi with the
proteins from 16 other fully sequenced
genomes.
Proteins coded by genome n are defined
as including a homolog of Pi if they align to
Pi with a score that is deemed statistically
significant.
Conclusions
 Comparing profiles is useful tool for
identifying the complex or pathway in which
a protein participates.
 As the number of fully sequenced genomes
increases scientists will be able to construct
longer more informative profiles.
 In 1999, 100 more genomes were due to be
completed in next few months.
 Suggests that as eukaryotic genomes come
out profiles will be a useful tool for studying
pathways in higher organisms.
Evolutionary Origin of Eukaryotic Cell
Mitochondria, chloroplasts and perhaps
other organelles descended from microbes
captured by progenitors of eukaryotic
cells.
You exist because of a bad case of
indigestion!
Evolutionary Origin of Eukaryotic Cell
This endosymbiosis was stabilized by
shifting of genes of organelle into nuclear
genome and transport systems being
established to shuttle organellar proteins
form cytoplasm into organelles.
Contemporary mitochondrial genome
encode only a few genes (<20), primarily
large integral membrane proteins which
can’t be transported.
Evidence
Proteins of these organelles have
molecular properties resembling
prokaryotic rather than eukaryotic proteins:
1.Average lengths
2.Domain composition
3.Amino acid composition
4.Homologs among prokaryotes
Phylogenetic profiles
Will show that proteins with similar
phylogenetic profiles localize to similar
subcellular locations.
Actually, will primarily show this for the
mitochondria.
Calculating phylogenetic profiles
In this study, the value at each position of
the profile is equal to -1/log E, where E is
the BLAST expectation value of best
matching protein in a genome.
Calculated only for E < 1x10-6 and 1.0
otherwise. So zero is a perfect match and
one is no match.
Three Categories
Prokaryote Derived: Only has homologs
in prokaryotes.
Eukaryote Derived: Only has homologs in
eukaryotes.
Organism Specific: Has no homologs.
Why split these categories? Should have
different functions and roles in
mitochondria.
Linear Discriminant Functions
t
Varying t increases
prediction accuracy
at the expense of
coverage.
MP
Non-MP
Testing Algorithm
First, predicted the location of yeast
proteins of known location (open
diamonds).
Second, a jackknife test was performed.
Repeated 100 times with different random
sets (filled diamonds). Coverage 58% at
50% accuracy.
Third, used yeast proteins as training set
and worm proteins as test set. Coverage
65% at 50% accuracy.
Prediction
Applied algorithm to all yeast proteins.
Estimate ~630 total mitochondriontargeted genes in yeast or 10% of
genome.
Applied algorithm to all worm proteins.
Estimate ~660 total mitochondriontargeted genes in worms of 4% of
genome.
Verifications
 Tested whether functions of newly predicted
mitochondrial proteins matched functions of known
mitochondrial protein better than the functions of a
random set of proteins. (Jacard Coefficient, Pie
Charts)
 Fraction of predicted mitochondrial proteins with
predicted transmembrane segments or signal
peptides.
 2D gel of whole rat liver and human placental
mitochondria reveals ~250-350 visible proteins.
Conclusions
There is information in the phylogenetic
profiles, but it is quite noisy.
Yields approximate numbers of genes
migrated to the nuclear genomes from the
mitochondria.
Gives even more evidence for
endosymbiotic theory.
However, verifications did not confirm
results as much as one might like.
Perhaps fundamental assumption flawed.