AFP for Structural Genomics and Metagenomics

Download Report

Transcript AFP for Structural Genomics and Metagenomics

Predicting the function of a protein
form either a sequence or a structure
(is not trivial)
Adam Godzik
The Sanford-Burnham Medical Research
Institute
Summary - overview




Homology based methods
Analogy based methods
Physics based methods
Why function prediction?
What we mean by function

Multilevel definition



Phenotype
Cellular function
Molecular function
(activity)



Substrates
Inhibitors
cofactors

Several attempts to
develop a unified
function
classification



EC classification for
enzymes 4.2.31.101
Merops (proteases),
CAZY (hydrolases)
Gene ontology
Two, complementary views of the
evolution and diversity of life
Organisms (species)
Genes (proteins)
Both are amazingly large and
diverse

Organisms (species)


About 1.5M known today, 10-100
million species estimated to exists,
depending on the definition of
species and other assumptions
Their relations can be described in a
tree of life, at least for eukaryotes.

Proteins


With 20 amino acid alphabet, the
number of possible protein
sequences is very large (20100 i.e.
1.2*10130 short proteins(!))
Total number: >10billions?



Bacterial and archeal tree of life is
much more controversial, some even
dispute the concepts of species for
bacteria

10-100M species, with ~4K genes in a
bacterial and ~10K in an eukaryotic
genome
Over 25 million known today, i.e.
~0.2%

Representative sample?
From the 25 million proteins known
today



Direct experimental data is available for
few thousand proteins
Indirect experimental data are available
for perhaps few hundred thousand
Structures of ~60 thousands have been
solved
protein universe seems to be very large.
But is it random?
Many proteins (like species) are
close relatives







Histone H1 (human) - histone H1
(chicken)
SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA
| | || || || ||| ||| | |||||||||||||||||| ||| |||||| ||
SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT
similarity: 77% id, BLAST e.value 0.0
function: two H1 histones from different
species (orthologs)
Their functions and structures are
obviously very similar
We can organize the protein universe into
neighborhoods (families)?
How many protein families are still out
there?
Number of clusters (thousands)
Rate of discovery
250
200
size >=3
150
size >=5
size >=10
100
size >=20
50
0
0
1
2
3
4
5
6
7
Number of sequences (millions)

Number of protein clusters (modeling families) grows
linearly in number of protein sequences (and exponentially
in time) – cumulative total
From Yooseph et al, PloS Biology, (2007) 5:e16
How far can we go?







Histone H5 - histone H1
TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA
| | | | | | | |
|
|||
|
| | |||| ||||||||
SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS
similarity: 40% seq id, BLAST e.value
10-15
function: two histones (paralogs)
Structures still very similar, functions
somewhat different, but obviously
similar
This is surely too far?

Histone H5 - TRANSCRIPTION FACTOR
E2F-4

PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL
| |
|
|
|
GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW

similarity :7% seq id, BLAST e.value 1

Is it?



Structure – obviously
similar (2.4 Å RMSD
over 80 aa)
function – clearly
related (both bind DNA)
More subtle similarity
can be detected with
more sophisticated
methods
We can keep adding more layers
most “function assignments”
are provided by predicted
homology


Unknown protein
GLLTTKFVSLLQEAKDGVLDLKL
AADTLAVRQKRRIYDITNVLEGIG
Similarity
LIEKKSKNSIQW


->
homology
similarity
?
prediction
Well studied protein
SRRSASHPTYSEMIAAAIRAEKS
RGGSSRQSIQKYIKSHYKVGHN
ADLQIKLSIRRLLAA
Similarity -> homology based
annotations

Recognition of close
and/or distant
homologs based on
similarity



Sequence
Sequence/profile,
profile/profile
Structure

Problems

How to predict
differences? Even
homologous proteins
evolve and change!
Prediction by homology
Recognition
Alignment
Are there any well characterized
proteins similar to my protein?
Can we assume they are homologous?
What is the position-by-position
target/template equivalence
Modeling
Structure of my protein is
similar to the other one
Function
prediction
Function of my protein is
similar to the other one
We could predict
Structure of a
complex
Role in the whole
organism
activity
3D structure
Important distinction

Similarity



Homology
Two proteins have similar
sequences/structures/function
s if by some metric the s/s/f
of one protein is more similar
to the s/s/f of another than to
a randomly chosen protein

Two proteins are
homologous if they
have evolved from a
common ancestor
Common error

Two proteins are 65% homologous


What we really meant
The sequences of two proteins are 65% similar, therefore
we can safely assume they are homologous, why else
they would be so similar?
If life would be easy, this is how it
would look like
similar
homologous
not similar
unrelated
Not (obviously) similar, but
(probably) homologous

Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar
function (DNA binding)

PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL
| |
|
|
|
GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW

Similar, but not homologous


phosphoribosyltransferase and viral coat protein, identity: 42%, different
folds, different functions
.
.
.
.
.
99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173
: ||. ||| ||
|.
|| | : |
| | | || | || |:|
| ||.| |
214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279

Similarity vs. homology
similar
homologous
not similar
homologous
similar
not homologous
not similar
unrelated
Can we return to this simple
picture by redefining similarity?
similar
homologous
not similar
unrelated
Are these two protein families
related?


New protein (target)

KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEI

EAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLM
IDEVGEMMHSNIEKAKLCLQ
Known protein (template)
VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTS
SRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF
How to compare two families?
Score =

alignment
 
Fam1
Fam 2
Ali MM ij B jm ?
Profile-profile similarity
Compare as
vectors in 21
dimensional
space (FFAS)
How to validate a protocol
1. Recognition

Folding benchmarks




from structural clustering of PDB (several sets, 700 pairs
used here)compared to sequence based clustering of the
same group of proteins
correct predictions vs. wrong predictions
CASP meetings, CAFASP, LiveBench
published and/or publicly available predictions,
fold prediction servers, available prediction
programs
Summary - overview




Homology based methods
Analogy based methods
Physics based methods
Why function prediction?
Similarity -> analogy based
annotations

Recognition of
potential analogs
based on similarity in




Genome organization
(non homologous
replacements)
Genomic fingerprints
Expression patterns
Specific features


Charge distribution
Presence of specific
patterns

Problems

Is this similarity
related to function?
TM0449 (thy1) – from prediction to
proof

TM0449

Hypothetical, uncharacterized
protein

Multiple homologs in pathogenic
and thermophilic bacteria

Novel fold
evidence




Phylogenetic profile complementing
thymidylate synthase
A homolog complements TS in
Dictyostelium
Confirmed experimentally
3D motif search finds an identical
arrangement binding phosphate in a
different protein
Summary - overview




Homology based methods
Analogy based methods
Physics based methods
Why function prediction?
“Ab initio function prediction” –
substrate docking
We know the structure of one protein in
the family and functions of some others –
is the function conserved?
Newly
solved
target
Gallery of models
We can analyze conservation of surface
features by mapping them on the sphere

And then compare maps between
homologs
And come up with new (predicted)
functions
Phospholipid vs. retinol vs. short peptide binding
Summary - overview




Homology based methods
Analogy based methods
Physics based methods
Why function prediction?
Why my interest in function
prediction?

Structural genomics: the structure is
often the easiest experimental
information to obtain (after sequence)
Function vs function

We witnessed dramatic
technological advances in
sequencing and now structure
determination, function
analysis remain a painstaking,
manual effort.
We used to know a lot about
function even before we
started working on a protein.
Well, not anymore
1970
1990
2005
2010 ?
1 year

Function discovery
Structure determination
Sequencing
Structure determination is now
done on an assembly line
target
selection
expression
cloning
1X
3X
purification
5X
2X
2X
imaging
1X
harvesting
1X
crystallization
1X
1X
bl xtal mounting
xtal screening
1X
1X
publication
annotation
data collection
tracing
2X
2X
struc. validation
1X
PDB
phasing
1X
1X
struc. refinement
7X
Even few years ago functional
annotation seemed trivial
target
selection
expression
cloning
1X
3X
purification
5X
2X
2X
imaging
1X
harvesting
1X
crystallization
1X
1X
bl xtal mounting
xtal screening
1X
1X
publication
annotation
data collection
tracing
2X
2X
struc. validation
1X
PDB
phasing
1X
1X
struc. refinement
7X
After few years, the reality
seems to be very different
target
selection
expression
cloning
1X
purification
imaging
harvesting
crystallization
1X
1X
bl xtal mounting
xtal screening
data collection
annotation
struc. validation
1X
PDB
tracing
2X
2X
publication
phasing
1X
1X
struc. refinement
7X
“reverse order” of function and structure
determination and it’s challenges

The classical way






1. A function is discovered
and studied
2. The gene responsible in
this function is identified
3. Function is confirmed
4. Product of this gene is
isolated, crystallized solved.
5. we have a whole story!
Structure “rationalizes”
function and provides
molecular details

Post-genomic



1. a new, uncharacterized gene
is found in a genome
2. predictions or high-throughput
methods prioritize this gene for
further studies
3. the protein is studied in detail


Structure is solved in a high
throughput center
Structure is the first
experimental information about
the “hypothetical” protein
We now have hundreds of structures
of proteins with unknown functions
Summary


For some, function prediction is a practical,
day to day problem
Analogy based approaches dominate the field




Homology seen from sequence similarity
structural similarities
Potential active sites, clefts, surface features
Many useful tools exists, but they are very
scattered and not very user-friendly
Summary (2)




Avoid overconfidence - “easy” predictions
contain many surprises
Only synergy of several independent lines of
reasoning can give a correct answer
Elimination of “easy”, but inconsistent
predictions is critical
So far, AFP doesn’t even come close to
expert analysis