Transcript document

Motivation for Reference Genome Effort
Fully and reliably annotated Genomes:
• empower scientific research
• are essential for use in automatic inference.
We comprehensively capture the experimental
data from the most active research
communities producing high-confidence
functional descriptions to leverage the power
of the comparative method for inference.
Deliverable of Reference Genome Effort
1. Proteome sets
2. Annotation best practices
documentation
3. Annotation software tool
4. Reference annotations for inference of
function in other species
Evolutionary relationships are
the “glue” in RefGenome
• Goal
– identify genes in reference genomes that may have the
same or similar functions, so that comprehensive curation
can be done simultaneously
• Why?
– Different model organisms have different strengths for
investigating gene function, and these can often inform each
other
– Most genes did not first evolve within a given extant species:
they were INHERITED from a common ancestor shared with
other species. Genes in different organisms have similar
functions because they were inherited, and haven’t changed
much since the common ancestor.
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
“ortholog
clusters”
Current process
ISS annotations
made
independently by
each MOD
Selection of
“annotation set”,
including independent
ortholog identification
at each MOD
Individual MODs
annotate in-depth each
gene in set
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
New process
coordinate and centralize
where possible
Trees and clusters
used to define ref.
genome annotation
sets
Individual MODs
annotate in-depth each
gene in set
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Select “gene set for
concurrent annotation” from a
central resource with more
complete information
Trees and clusters
used to define ref.
genome annotation
sets
Individual MODs
annotate in-depth each
gene in set
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Make homology-based
annotations concurrently and
consistently in the context of
an evolutionary tree
Trees and clusters
used to define ref.
genome annotation
sets
Individual MODs
annotate in-depth each
gene in set
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Trees and clusters
used to define ref.
genome annotation
sets
Individual MODs
annotate in-depth each
gene in set
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
Update on progress:
comprehensive gene sets from each MOD
• Short term solution implemented as of 9/4
– Gp2protein files are now approximately complete
• Most sets were OK as deposited by the MOD
• A few sets had to be augmented (missing genes filled in
from Ensembl or Entrez Gene), one set had to be
reduced by selecting a single “representative” protein
sequence per gene
• Long term solution: UniProt?
• SwissProt record includes all alternatively spliced exons ,
which is ideal for evolutionary modeling of protein coding
gene history
• We have already shared the gp2protein files with
SwissProt, and they are comparing to UniProt “complete
proteome” sets
Proposal made at this meeting
• Write a white paper describing the “complete
protein-coding gene set” needs/requirements
for the RefGenome project
• Michael will approach Amos and discuss
options for working together
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Trees and clusters
used to define ref.
genome annotation
sets
Individual MODs
annotate in-depth each
gene in set
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
Example: NEDD4
• Selected for electronic jamboree Oct. 2008
• Human NEDD4 was “core” target
• OrthoMCL identified “orthologs” in
–
–
–
–
–
–
Drosophila
C. elegans
Mouse (2)
Human (2)
Zebrafish
Rat
• Curators at SGD identified an ortholog in
yeast from a published paper
Orthologs (green) and paralogs (orange) of human NEDD4 (red)
duplications at
base of metazoa
WWP1/2;
SMURF1/2
diverge
NEDD4
conserved
duplication at
base of chordata
HACE1
diverges
NEDD4
conserved
duplication at
base of reptilia?
OrthoMCL cluster containing human NEDD4/NEDD4L (blue)
and curator-identified yeast ortholog (lt. blue)
duplications at
base of metazoa
duplication at
base of chordata
duplication at
base of reptilia
Orthologs (green) and paralogs (orange) of human NEDD4 (red)
And “conserved orthologs” of NEDD4/NEDD4L (yellow)
duplications at
base of metazoa
duplication at
base of chordata
duplication at
base of reptilia
Update on progress
Gene trees and “homology set” selection tool
• Gene trees have been built for all existing PANTHER
families, from all RefGenome species, plus 35 other
“phylogenetically informative” species
• Tree Curation Tool has been updated by Paul’s and
Suzi’s groups in collaboration
– Retrieves and displays tree, and UniProt information for
each sequence
– Displays OrthoMCL clustering results-- scalable to any
number of different clustering algorithms
– “Pre-alpha” prototype has been installed and is being tested
by Pascale
• GOC has obtained supplemental funding to support
– Adding multiple homology clustering algorithms
– A “protein family curator”
Proposal made at this meeting
• Lead RefGenome Curator and Protein Family
Curator work together to define set of genes
to be annotated concurrently
• No need for review by individual MODs
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Trees and clusters
used to define ref.
genome annotation
sets
Review and sign off on
r.g. experimental
annotations
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
Annotation inference based on homology
• We need to make homology inferences correctly and
consistently
– Infer only from annotations with experimental evidence
– Use explicit evolutionary model: inheritance (maybe with
modification) from a common ancestor!
• Homology inference is actually two inferences
– 1. the common ancestor has the same annotation as its
descendant that has been characterized
– 2. another (unannotated) descendant has the same annotation as
its ancestor
– Need traceable, versioned evidence trail:
• Inferred annotation -> tree -> experimental annotation(s)
-> literature
GO process: cellular response to UV
GO process: positive regulation of synaptogenesis
?
?
GO function: ubiquitin-protein ligase activity
Proposal made at this meeting
• Protein family curator makes first pass
at homology inferences
– Confers with individual MODs as
necessary
• Iterative: protein family curator prepares
list of inferred annotations for each
MOD, each MOD reviews and can
suggest changes
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Annotation
process
Trees and clusters
used to define ref.
genome annotation
sets
Review and sign off on
r.g. experimental
annotations
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
Trees and clusters
used to define ref.
genome annotation
sets
1.
2.
3.
Protein family curator (Princeton/Pascale)
suggests protein set based on report/examination
of trees
MOD curators annotate all experimental data to
completion
Protein family curator mediates annotation review
Protein family curator
Inferences made
to ancestral
proteins
Review and sign off on
r.g. experimental
annotations
Protein family curator
Reviewed by protein
family and MOD
curators
Inferences made
to extant proteins
Done!
structural
annotation of
genomes used to
build gp2protein
files
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Transformations
Trees and clusters
used to define ref.
genome annotation
sets
Review and sign off on
r.g. experimental
annotations
Inferences made
to ancestral
proteins
Inferences made
to extant proteins
Princeton / P-POD update
• New run with protein sets used by PANTHER
under way
• Implementing algorithms for generation of
consensus clusters and other ortholog
prediction methods
• New P-POD features
P-POD search
P-POD results/disambiguation
P-POD-Notung
structural
annotation of
genomes used to
build gp2protein
files
UniProt
complete proteome
project?
How to most
efficiently
incorporate input
from all MOD
curators?
Gp2protein files
used to build
trees
Gp2protein files
used to build
“ortholog
clusters”
Pascale
picks a
focal gene
Trees and clusters
used to define ref.
genome annotation
sets
Review and sign off on
r.g. experimental
annotations
Inferences made
to ancestral
proteins
How are resulting homologybased annotations delivered to
MODs?
Inferences made
to extant proteins