PPT - Bioinformatics Research Group at SRI International

Download Report

Transcript PPT - Bioinformatics Research Group at SRI International

Natalia Roberts
University of Delaware
[email protected]
Pathways Tools Workshop
October 28, 2010
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
PRO Consortium
 Initially PRO to address proteins
Cathy Wu

Barry Smith
Extended to protein complexes
Peter D’Eustachio
Carol Bult
Judith Blake
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
PRO within OBO Foundry
 Ontology for semantic integration of heterogeneous biological data
 OBO Foundry establishes rules and best practices to create a suite of
orthogonal interoperable reference ontologies
Molecule
(PRO)
7
PRotein Ontology (PRO)
Ontology to formally represent proteins and protein complexes



Ontology for Protein Evolution (ProEvo)
 captures the protein classes reflecting evolutionary relationships at
full-length protein levels.
Ontology for Protein Forms (ProForm)
 captures the different protein forms of a specific gene arising from
genetic variations, alternative splicing, cleavage, post-translational
Protein A
modifications.
P
Protein A unmodified
A
A
A
A
Ontology for protein complexes (ProComp)
 formally defines the protein complexes in terms of the specific
components.
Complex Z
P
A
B
8
P
P
Protein A phospho 1
Protein A phospho 2
Protein A cleaved
has_part Protein A phospho 1
Complex Z
has_part Protein B
Why PRO?




Allow specification of relationships between PRO and
other ontologies, such as GO and Disease Ontology
Provides a structure to support formal, computer-based
inferences based on shared attributes among homologous
proteins
Provides a stable unique identifier to any protein type
Provides formalization and precise annotation of specific
protein forms/ classes, allowing accurate and consistent data
mapping, integration and analysis
ProEvo
Gene
Duplication
Speciation
m
h
m
BMP
m
h
TGFB
m
h
m
Smad
MH1 domain
(PF03165)
SMAD1
h
h
SMAD5
Ontology for Protein Evolution
(PROEvo)
SMAD9
SMAD3
SMAD2
m
I-Smads
h
m
h
Co-Smad
MH2 domain
(PF03166)
SMAD6
SMAD7
h
m
captures the protein classes reflecting
evolutionary relationships at full-length protein
levels.
SMAD4
Family: a PRO term at this level refers to proteins that
can trace back to a common ancestor over the entire
length of the protein are part of the same family.
Gene: a PRO term at this level refers to the protein
products of a distinct gene.
In the ontology...
Functional classes are not described by PRO (for now)
Cathepsins
proteases, most of the them become activated at the low
pH found in lysosomes.
Crystallins
constituent of eye lens.
Heat shock protein
class of functionally related proteins whose expression is
increased when cells are exposed to elevated
temperatures or other stress.
Protein Kinase
an enzyme that modifies other proteins by chemically
adding phosphate groups.
ProForm
A
A
P
A
A
P
P
The Need for Representation of Various Proteins Forms
Gene
SMAD2
CUL1
CD14
ROCK1
CREBBP
Protein Form
Distinctive Attributes
Long isoform phosphorylated
(PRO:000000468)
NOT has_function GO:00036677 DNA binding
Short isoform phosphorylated
(PRO:000000469)
has_function GO:00036677 DNA binding
Unmodified form
(PRO:000002507)
NOT part_of GO:0019005 SCF ubiquitin ligase complex
Needylated form
(PRO:000000542)
part_of GO:0019005 SCF ubiquitin ligase complex
Membrane form
(PRO:000002149)
located_in GO:0005886 plasma membrane
Soluble form
(PRO:000002147)
located_in GO:0005615 extracellular space
Full length
(PRO:000002529)
has_function GO:0004674 protein serine/threonine kinase
activity
Cleaved form
(PRO:000000563)
Increased has_function GO:0004674 protein
serine/threonine kinase activity
Variant R -> P(1378)
(PRO:000000266)
agent_in MIM:180849, RUBINSTEIN-TAYBI SYNDROME
Disease
SO:1000118, loss_of_function_of_polypeptide
Function
Association
Localization
Modification
PROEvo
PROForm
SMAD5
BMP
SMAD1
SMAD9
TGFB
SMAD3
Isoform 1
SMAD2
Smad
I-Smads
Isoform 2
Isoform 1 unmodified
Isoform 1 modified (PTM/Cleaved)
Isoform 2 unmodified
Isoform 2 modified (PTM/Cleaved)
Co-Smad
capture different protein forms of a specific gene arising from genetic variations,
alternative splicing, cleavage, post-translational modifications
Sequence: a PRO term at this level refers to the protein products with a distinct sequence upon initial
translation.
Modification: a PRO term at this level refers to the protein products with some change that occurs
after initial translation.
In the ontology
A
B
ProComp
P
Soluble CD14
Complexes that differ in
subunit composition (within
or in different species). E.g.
mCD14/LPS vs sCD14/LPS
memb CD14
Complex where subunits are
post-translationally modified
E.g.Phosphorylated IRF3 dimer
Figure adapted from http://www.reactome.org/cgibin/eventbrowser?DB=gk_current&FOCUS_SPECIES=Homo%20sapiens&ID=166016&
In the ontology
TGF-beta signaling – comparison
between PID and Reactome
PRO:000000397
LAP
TGF-b
Furin
Ca2+
Growth
signals
PRO:000000616
PRO:000000410
TGF-b
S P
Y P
Y P
S P
Y P
T P
T P
MEKK1
ERK1/2
PRO:000000618
TGF-beta
II I
receptor
PRO:000000523
S P
S P
T P
Smad 2
Growth
signals
TGF-b
S P
Y P
Y P
S P
Y P
Stress
signals
II I S P
Cytoplasm
S P
T P
PRO:000000468
PRO:000000650 S SP P
Smad 2
PRO:000000481
Smad 4
S P
S P
Smad 2 X P
PRO:000000366
Shc
XIAP
TAK1
TPTP
SPSP
Smad 2 S P S P
CaM
Smad 2
PRO:000000651
X P
Shc
S P
S P
Smad 4
S P
Y P
K U
TAK1
PRO:000000650
PRO:000000366
Degradation
MAPKKK
S P
S P
P38 MAPK
pathway
JNK
cascade
Smad 2
PRO:000000652
X
S P
S P
PRO:000000650
PRO:000000366
S P
S P
Smad 2
Smad 2
Smad 4
Smad 4 Ski
X
Nucleus
DNA binding and transcription regulation
S P
T P
Y P
Phosphorylation (P) at Serine (S),
K U
Ubiquitination (U) at Lysine (K)
Threonine (T)
Tyrosine (Y)
Common in both Reactome & PID
Only included in Reactome
* All others are in PID. Not all components
in the pathway from both databases are
listed
Framework
Categories/ Levels of Distinction
 Family: a PRO term at this level refers to proteins
that can trace back to a common ancestor over the
entire length of the protein are part of the same
family.
 Gene: a PRO term at this level refers to the protein
products of a distinct gene.
 Sequence: a PRO term at this level refers to the
protein products with a distinct sequence upon
initial translation.
 Modification: a PRO term at this level refers to the
protein products with some change that occurs
after initial translation.
Categories in ProEvo & ProForm
PROEvo
PROForm
Gene product
Family
SMAD5
BMP
SMAD1
SMAD9
Protein
TGFB
SMAD3
Sequence
Isoform 1
SMAD2
Smad
I-Smads
Isoform 2
Modification
Isoform 1 unmodified
Isoform 1 modified (PTM/Cleaved)
Isoform 2 unmodified
Isoform 2 modified (PTM/Cleaved)
Co-Smad
Organism-gene
Organism-sequence
Organism-modification
pro.obo
Protein ontology framework
ProEvo, ProForm
annotation
ProComp
PAF.txt
Some concepts
 Ortho-isoform: These are isoforms- encoded by orthologous genes that are
believed to have arisen prior to speciation and divergence of the primary sequence.
PRO:000000048 TGF-beta receptor type-2 isoform 1 (Also known as Isoform RII-1)
 Ortho-modified form: Post-translational modifications on equivalent residues in
ortho-isoforms.
PRO:000000615 GTP-binding protein RhoA isoform 1 prenylated 1
A GTP-binding protein RhoA isoform 1 prenylated form where a geranylgeranyl moiety has been added
to the Cys residue within the C-terminal Cxxx motif. Example: UniProtKB:P61586-1, has_modification
MOD:00113 S-geranylgeranyl-L-cysteine, Cys-190. [PMID:16773203, PRO:CNA]
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
What is curated in the ontology?
 Name: take name from data source but we follow established naming
guidelines.
 Synonyms: imported from data source, add others by request and during
curation.
 Definitions: we try to create standard definitions.
A is a B that C’s
Both text and logical definitions are created when possible
PRO:000025762 serine palmitoyltransferase complex A (mouse)
def: "A serine palmitoyltransferase complex that is heterotrimeric and whose components are encoded in
the genome of mouse." [PRO:CJB]
is_a: GO:0017059 ! serine C-palmitoyltransferase complex
relationship: has_part PRO:000025361 {cardinality="1"} ! serine palmitoyltransferase 1 (mouse)
relationship: has_part PRO:000025362 {cardinality="1"} ! serine palmitoyltransferase 2 (mouse)
relationship: has_part PRO:000025363 {cardinality="1"} ! serine palmitoyltransferase 3 (mouse)
relationship: only_in_taxon taxon:10090 ! Mus musculus
 Cross-ref: when there is a database object that corresponds to the term.
Example of how we define modified forms
Feature section of UniProtKB record for human smad2
Reference section
Ser-465/Ser-467
Ser-240/Ser-465/Ser-467
[Term]
id: PRO:000000650
name: smad2 isoform 1 phosphorylated 1
def: "A smad2 isoform 1 phosphorylated form that has been phosphorylated in the last two Ser
residues within the SSxS C-terminal motif by TGF-beta pathway activation." [PMID:8980228,
PMID:9346966]
comment: Category=modification.
synonym: "TGF-beta receptor-activated smad2" RELATED []
is_a: PRO:000000574 ! smad2 isoform 1 phosphorylated form
ProForm Curation
by TGF-beta receptor
Ser-465/Ser-467
through Ca++-mediated signaling
Ser-240/Ser-465/Ser-467
[Term]
id: PRO:000000652
name: smad2 isoform 1 phosphorylated 3
def: "A smad2 isoform 1 phosphorylated form that has been phosphorylated at a [S/T] residue
within the MH1-MH2 domain linker region in response to decorin-induced Ca(2+) signaling. This
form is also phosphorylated in the last two Ser residues within the SSxS C-terminal motif."
[PMID:11027280]
comment: Category=modification.
is_a: PRO:000000574 ! smad2 isoform 1 phosphorylated form
What is Annotated?
 Domain, especially ProEvo level:
 GO terms
 PSI-MOD terms for protein modifications
 SO for sequence variants
 MIM for sequence variants
PRO homepage
http://pir.georgetown.edu/pro/pro.shtml
PRO distribution files
ftp://ftp.pir.georgetown.edu/databases/ontology/pro_obo/
pro.obo
Ontology in OBO format
format-version: 1.2
date: 02:04:2009 14:51
saved-by: arighic
auto-generated-by: OBO-Edit 1.101
default-namespace: pro
remark: release: 5.0, version 1
[Term]
id: PRO:000000001
name: protein
def: "A biological macromolecule that is composed of amino acids linked in a
linear sequence (a polypeptide chain) and is genetically encoded. Proteins
descended from a common ancestor can be classified into families and
superfamilies composed of products of evolutionarily-related genes. The
domain architecture of a protein is described by the order of its constituent
domains. Proteins with the same domains in the same order are defined as
homeomorphic." [PRO:WCB]
[Term]
id: PRO:000000002
name: E3 ubiquitin ligase SFC complex, Skp1 subunit
def: "A protein with a core domain composition consisting of an N-terminal
Skp1 family, tetramerisation domain (PF03931) followed by a Skp1 family,
dimerization domain (PF01466). Skp1 proteins bind several F-box-containing
proteins, and are involved in the ubiquitin protein degradation pathway."
[PRO:CNA]
comment: Category=family.
xref: PIRSF:PIRSF028729
is_a: PRO:000000001 ! protein
[Term]
id: PRO:000000003
PRO distribution files
ftp://ftp.pir.georgetown.edu/databases/ontology/pro_obo/PAF_guidelines.pdf
PAF.txt
Column
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PRO association file (Tab delimited,
Column Title
PRO_ID
Object_term
Object_synonym
Modifier
Relation
Ontology_ID
Ontology_term
Relative_to
NOT
contributes_to
similar
to GAF)
decreased
increased
altered
Description
PRO identifier, mandatory
Name of the PRO term
Other names by which the described object is known
Flags that modify the interpretation of an annotation
Relation to the corresponding annotation.
ID for the corresponding annotation.
Term name for the corresponding ontology ID.
Modifiers increased, decreased and altered require an entry in this
part_of
column to indicate what the change is relative to.
located_in
Interaction_with
To indicate binding partner.
has_part
Evidence_source
Pubmed ID or database source for the evidence.
Evidence_code
Same as evidence code for GO annotations
has_agent
Taxon
Taxon identifier for the species that the annotation ishas_function
extracted
from.
participates_in
Inferred_from
Use only for evidence code: IPI and ISS for PRO.
agent_in
DB_ID
One or more unique identifiers for a single source cited
as an
authority for the attribution of the ontology term. has_modification
Protein_region
To indicate part of the protein sequence.
Modiresidue(s), MOD_ID To indicate the residue(s) that has a post-translational modification
and the type of modification.
Date
Date on which the annotation was made.
Assigned_by
The database which made the annotation.
Equivalent forms
List the equivalent form in other organisms.
Comments
Curator comments, free text.
PRO distribution files
ftp://ftp.pir.georgetown.edu/databases/ontology/pro_obo/PAF_guidelines.pdf
PAF.txt
PRO_ID
Object_term
PRO:000000536
PRO association file (Tab delimited, similar to GAF)
Modifier
Relation
Ontology_ID
Ontology_term
Evidence_source Evidence_code
Taxon
DB_ID
Modified_residue, MOD_ID
c-myc isoform 1 glycosylated 1
has_function
GO:0003700
transcription factor activity
PMID:11904304
IDA TaxID:9606
UniProtKB:P01106-1
Thr-58, MOD:00806
PRO:000000536
c-myc isoform 1 glycosylated 1
participates_in
GO:0006357
regulation of transcription from
RNA polymerase II promoter
PMID:11904304
IDA TaxID:9606
UniProtKB:P01106-1
Thr-58, MOD:00806
PRO:000000536
c-myc isoform 1 glycosylated 1
has_modification
MOD:00806
O-(N-acetylaminoglucosyl)-Lthreonine
PMID:11904304
TaxID:9606
UniProtKB:P01106-1
Thr-58, MOD:00806
PRO:000000538
c-myc isoform 1 phosphorylated 2
has_function
GO:0003700
transcription factor activity
PMID:7623799
EXP TaxID:9606
UniProtKB:P01106-1
Ser-62, MOD:00046|Thr-58,
MOD:00047
PRO:000000538
c-myc isoform 1 phosphorylated 2
located_in
GO:0005634
nucleus
PMID:14563837|PMID:15503302
IDA TaxID:10090
UniProtKB:P01108-1
Ser-62, MOD:00046|Thr-58,
MOD:00047
PRO:000000538
c-myc isoform 1 phosphorylated 2
has_modification
MOD:00046
O-phospho-L-serine
PMID:14563837
TaxID:10090
UniProtKB:P01108-1
Ser-62, MOD:00046|Thr-58,
MOD:00047
PRO:000000538
c-myc isoform 1 phosphorylated 2
has_modification
MOD:00047
O-phospho-L-threonine
PMID:14563837
TaxID:10090
UniProtKB:P01108-1
Ser-62, MOD:00046|Thr-58,
MOD:00047
NOT
PRO scope and statistics
Current scope: human, mouse and E. coli proteins.
Comprehensive coverage for gene level and general
modified forms
In PRO Release 13.0, version 0 (link):
There are 25700 PRO terms
# terms
Category
281
family
18184
gene
128
organism-gene
1153
sequence
5662
modification
13
complex
Annotation
Curated papers:
Annotation to GO terms:
Annotation to MOD terms:
929
2006
361
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
PRO Entry
http://purl.obolibrary.org/obo/PRO_000000563
1-Ontology
2-Features
3-Mapping
4-Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Browse
3-Information tabs
1-Number terms
2-Sorting
4-Add text to find
Browse (cont)
‘Find’ is exact text match to a term
Add text to search
Information tabs
Use information tab to display information of interest
Quick Browse
Quick Browse allows to browse terms related to
a given theme:
-Terms for proteins with a given modification
-Terms for saliva biomarkers
-Terms with link to a given database
-Terms with orthoisoforms
Quick Browse:Orthoisoforms
Quick Browse:Saliva biomarkers
PRO homepage
PRO Advanced Search
•Boolean searches: AND, OR, NOT
•Null/not null search
Examples for search fields in:
http://pir.georgetown.edu/pro/searchPRO.pdf
-All terms derived from a given gene
Link to family
database.
Proteins in this
class are from
vertebrates
Search for PRO terms that are modified forms that are annotated
with GO term nucleus
http://pir.georgetown.edu/cgi-bin/pro/textsearch_pro
•Boolean searches: AND, OR, NOT
Save results
as tab
delimited file
Show the selected terms in the
hierarchy
Indicate level in
the hierarchy
Use Display Option to add/remove columns
Click apply to
see the new
column(s)
>
Annotation column added
Upcoming Functionalities:
Batch retrieval
ID mapping
PRO ID
-------PRO:000000124
PRO:000000123
PRO:000000121
MGI ID
-----------------1927343
1270849
107926
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Outline
 Introduction to the PRotein Ontology
 Consortium
 Framework
 Curation
 PRO Website
 PRO Entry
 Search and Browse
 Annotation
Annotation
Annotation
tool
Annotation tool for PRO: RACE-PRO



Obtain a PRO ID for the protein objects of interest
Define a protein object (based on literature, experimental data)
Add annotation to that protein object
1 protein form=1 RACE-PRO

How does it work?
 Input your personal information (only for internal use)
 Complete form with sequence information and annotation
 Submit when ready (otherwise you can save for later)
 PRO curation team will take the data, revise it, and create
the corresponding PRO node in the ontology
 Use will be informed through email about the new PRO IDs
and when they will be public
Annotation tool for PRO: RACE-PRO
1) Fill your personal information. This allows saving your data as well as future
communication
2) Define the protein object. This allows retrieving or pasting a sequence, defining a
subsequence, and/or a post-translational modification.
O75475-2
insert UniProtKB Accessions
(including isoforms) and Retrieve
MTRDFKPGDLIFAKMKGYPHWPARVDEVPDGAVKPPTNKLPIFFFGTHETAFLGPKDIFP
YSENKEKYGKPNKRKGFNEGLWEIDNNPKVKFSSQQAATKQSNASSDVEVEEKETSV
SKETDHEEKASNEDVTKAVDITTPKAARRGRKRKAEKQVETEEAGVVTTATASVNLKV
OR paste a sequence
Annotation tool for PRO: RACE-PRO
Select a region (e.g. if a cleaved product)
Annotation tool for PRO: RACE-PRO
Select PTMs: use numbering in reference
to the sequence displayed in the box 1
Use this section to indicate names assigned from this protein form as indicated in the source
Annotation tool for PRO: RACE-PRO



Modifiers: used to modify a relation between a PRO term and
another term. It includes the GO qualifiers NOT, contributes_to
plus increased, decreased, and altered (to be used with the
relative to column).
Relation to the specific annotation. For some database/ontology
there is a single relation and that is displayed
Add ID for the specific database/ontology.
Save means you are
not finished.
Use the reference
number to come back
and finish
What to expect when you are done?
Submit means you
are done, your
entry will be
reviewed
Receive an email with the ref number in the subject when your
entry is under reviewed
A PRO curator will be assigned to review your entry and create
the corresponding PRO node.
You will receive an email with the PRO ID, and the terms for your
final check
How to link to PRO

To link to PRO entry, please use persistent URL
http://purl.obolibrary.org/obo/PRO_xxxxxxxxx
where PRO_xxxxxxxxx is the corresponding PRO ID with an
underscore (_) instead of colon (:).
Example: link to PRO:000000447 would be
http://purl.obolibrary.org/obo/PRO_000000447.
Some ongoing collaborations

Ontology Providers


Semantic Resources




Royal Society of Chemistry (RSC)
Science Collaboration Framework
Semantic Web Applications in Neuromedicine (SWAN)
Process-Modeling Resources




Dendritic Cell Ontology (DC_CL)
Reactome, MouseCyc
EcoCyc
Pathway Logic
Molecule-Modeling Resources

Int’l Union of Basic and Clinical Pharmacology (IUPhar)
PRO Consortium Team (so far…)
 Principal Investigators
Cathy Wu
Judith Blake
Carol Bult
Barry Smith
Peter D’Eustachio
(UD/GU-PIR)
(Jackson Lab-MGI)
(Jackson Lab-MGI)
(SUNY Buffalo-NCBO)
(NYU/Reactome)
 Co-Investigators, Curators and Developers
Darren Natale
Harold Drabkin
Alexei Evsikov
Michael Caudy
Cecilia Arighi
Jian Zhang
Hongzhan Huang
Natalia Roberts
(GU-PIR)
(Jackson Lab-MGI)
(Jackson Lab-MGI)
(NYU)
(UD-PIR)
(GU-PIR)
(UD-PIR)
(UD-PIR)
Jules Nchoutmboube
Funded by NIGMS
63
Questions?
64