GO_EUGENE_2009_pipeline1

Download Report

Transcript GO_EUGENE_2009_pipeline1

Protein Family Annotation
Pipeline: update
Protein Family Curators:
Pascale Gaudet (dictyBase), Mike Livstone
(Princeton), Kara Dolinski (Princeton)
1. Protein family is
chosen for annotation
Pascale will talk about these steps
2. MOD curators
review and sign off on
experimental
annotations
3. Protein family
curators annotate
ancestors, extant
proteins
Kara will talk about these steps
4. MOD curators
review, sign off, and
add inferred
annotations to their
MOD
…then we will talk about a couple of examples.
1. Protein family is
chosen for annotation
• Setting curation priorities: *
- as before (disease, 'hot genes',
metabolism, unknown and conserved)
- suggestions by annotators
• once PAINT/PANTHER is ready, generate
reports of good candidates based on
experimental evidence available. Tracker
would be helpful (see next slide)
* this is an action item: to be discussed
Setting annotation priorities
PANTHER
family ID
# RG
orthologs
# RG species # genes
represented compreh.
annotated
Date family
annotated
Most recent
compreh.
annotation
PANTHER
001
35
8
30
3/2009
4/2009
PANTHER
002
12
12
12
Not done
4/2009
PANTHER
003
4
4
0
Not done
1/2009
PANTHER
004
10
6
2
Not done
1/2009
PANTHER
005
12
8
3
Not done
10/2007
2. MOD curators review and
sign off on experimental
annotations
1. Distributing the list: PANTHER family ID
2. Annotators 'comprehensively annotate'
3. One month deadline for “sign off”, gene list included in
monthly targets to keep these on the radar?
Curation targets
Distributing the list: PANTHER family ID -> Gene IDs
PANTHER005
PANTHER0010
PANTHER0012
PANTHER0013
PANTHER0014
PANTHER0015
….
C. elegans: WBGene:1238787, WBGene:123878,
WBGene:23487243, WBGene:3827487
H. sapiens: Q12345
M. musculus: MGI: 123456, MGI: 234356
Comprehensive Curation Done
Annotators must be able to mark individual genes as 'comprehensively annotated' in
the PAINT tool/tracker system.
Enter ID:
WBGene:1238787
Summary
Gene Name
top2-1
msh-2
msh-6
Annotator
Kimberly
Kimberly
Kimberly
Done!
Date
3-31-2009
3-31-2009
3-31-2009
Setting annotation priorities
PANTHER
family ID
# RG
orthologs
# RG species # genes
represented compreh.
annotated
Date family
annotated
Most recent
compreh.
annotation
PANTHER
001
35
8
30
3/2009
4/2009
PANTHER
002
12
12
12
Not done
4/2009
PANTHER
003
4
4
0
Not done
1/2009
PANTHER
004
10
6
2
Not done
1/2009
PANTHER
005
12
8
3
Not done
10/2007
PAINT will display annotation status
*
*
*
*
1. Protein family is
chosen for annotation
2. MOD curators
review and sign off on
experimental
annotations
3. Protein family
curators annotate
ancestors, extant
proteins
4. MOD curators
review, sign off, and
add inferred
annotations to their
MOD
3. Protein family
curators annotate
ancestors, extant
proteins
-We have started protein family annotation at this step by starting
with families that have proteins that are “done” (based on notes
in SourceForge tracker).
- We will go through an example or two (HPRT1, MSH2) to show
our process—even in these “done” cases, we have provided
feedback to MODs on existing protein annotation, which led to
improvements in accuracy/consistency.
4. MOD curators
review/approve, and
add inferred
annotations to their
MOD
How to make the process easy for the MODs:
Pre-PAINT:
-Email link to tree, summary with GO Term(s) and IDs, proteins,
WITH (will use one of the proteins in the family), reference
-Is a gene association file also required?
Post-PAINT:
-Incorporation into PANTHERdb
-Email link to tree, summary with GO Term(s) and IDs, proteins,
WITH (will use persistent PANTHER identifier for the ancestor
node), reference.
-Gene association file.
GAF file 1
ID
PANTHERNODE1
PANTHERNODE1
PANTHERNODE3
GO term
GO:1234457
GO:7654321
GO:1234457
With
Mouse|human|yeast
Mouse|human
Mouse|human|yeast
Evidence
ISS
ISS
ISS
REF
Panther method
Panther method
Panther method
GAF file 2
ID
Dicty_gene_A
cerevisiae_gene1
mouse_gene-1
GO term
With
GO:1234457 PANTHERNODE1
GO:1234457 PANTHERNODE1
GO:1234457 PANTHERNODE1
Evidence
ISS
ISS
ISS
REF
Panther method
Panther method
Panther method
- Do we need a way to ensure annotation propagation
to the MODs?
-Issue of maintenance: We need to be alerted happens
when there is new data, or when an annotation is
removed on which interferences were based
(see next slide)
Setting annotation priorities
PANTHER
family ID
# RG
orthologs
# RG species # genes
represented compreh.
annotated
Date family
annotated
Most recent
compreh.
annotation
PANTHER
001
35
8
30
3/2009
4/2009
PANTHER
002
12
12
12
Not done
4/2009
PANTHER
003
4
4
0
Not done
1/2009
PANTHER
004
10
6
2
Not done
1/2009
PANTHER
005
12
8
3
Not done
10/2007
Panther update: tree topology changed
Examples
MutS superfamily (PTHR11361)
mismatched DNA binding
ATPase activity
MutS2 (bacteria, archae, plants)
MutS (bacteria)
MSH1 (eukaryotes, not animals)
MSH2 (eukaryotes)
MSH3 (eukaryotes)
MSH4 (eukaryotes)
MSH5 (eukaryotes)
MSH6 (eukaryotes)
Propagate both terms to all proteins in superfamily
Phosphohexomutase Family (PTHR22573):
Phosphoglucomutase activity
PGM5
PGM1
Active site
IEA
Dm:FBgn0003076
Hs:A6N1Q7
Hs:Q5JTY7
Hs:Q15124
Mm:1925668
Rn:1307969
Rn:1584213
Hs:P36871
Mm:97565
Hs:XP_001126256
Sc:PGM1
Sc:PGM2
Hs:XP_001130346
Ec:P36938
Ec:P31120
Mm:1918224
Mm:97564
Sc:PGM3
Active site
Active site
Phosphohexomutase Family (PTHR22573):
Phosphoglucomutase activity
Active site N->C/S/L
Phosphohexomutase Family (PTHR22573):
Phosphoglucomutase activity
Active site G->A/T
Phosphohexomutase Family (PTHR22573):
Phosphoglucomutase activity
Active site Y->F/I