Transcript Slide 1

Protein Information Resource
(PIR) for Functional Annotation:
Protein Family Classification, Literature
Mining and Protein Ontology
In-Silico Analysis of Proteins
Celebrating the 20th anniversary of Swiss-Prot
Fortaleza, Brazil
August 4, 2006
Cathy H. Wu, Ph.D.
Director, Protein Information Resource
Professor, Biochemistry and Molecular & Cellular Biology
Georgetown University Medical Center
Wu CH, Zhao S, Chen HL. (1996)
A protein class database organized with PROSITE
protein groups and PIR superfamilies.
Journal of Computational Biology, 3 (4), 547-562.
2
Protein Information Resource (PIR)
Integrated Protein Informatics Resource for Genomic/Proteomic Research




http://pir.georgetown.edu
3

UniProt Universal Protein Resource:
Central Resource of Protein
Sequence and Function
PIRSF Family Classification System:
Protein Classification and Functional
Annotation
iProClass Integrated Protein
Database: Data Integration and
Protein Mapping
iProLINK Literature Mining Resource:
Annotation Extraction
Other Projects: NIAID Proteomics,
caBIG Grid-Enablement
PIR Protein Sequence Database




4
The PIR-International Protein Sequence
Database (PIR-PSD) grew out of the
Atlas of Protein Sequence and Structure
(1965-1978), Vol 1-5, Suppl 1-3.
Margaret Dayhoff collected all the known
protein sequences to study protein
evolution.
The first Atlas contained 65 proteins, the
final volume had 1081 proteins.
300,000
Joined UniProt (Jan 2002)
Number of Sequences

250,000
The PIR-PSD was produced from
200,000
1984 (Release 1, 2900 proteins) to
2004 (Release 80, 283,416 proteins). 150,000
100,000
PIR-PSD has been integrated with
the50,000
UniProt since 2002.
0
1
6 11 16 21 26 31 36 41 46 51 56 61 66 71 76
PIR-PSD Release Number
UniProt Activities at PIR




5
Integration of PIR-PSD into UniProtKB
 Incorporation of unique PIR entries
 Incorporation of PIR annotations: references,
experimental features with literature evidence tag
Functional annotation of UniProtKB proteins
 Development of PIRSF family classification system & PIRSF
curation => Comprehensive coverage of all UniProtKB proteins
 Development of rule-based annotation system & PIRNR (name
rule) /PIRSR (site rule) curation => Rule curation and
integration into Swiss-Prot/TrEMBL annotation pipelines &
propagation of annotations (e.g., name, GO, site feature)
Production of UniRef100/90/50 databases
Creation of UniProt web site and help system => Unified UniProt
web site & user community interaction
PIRSF Classification System
Protein Classification and Functional Annotation




PIRSF: Evolutionary relationships of proteins from super- to sub-families
Curated families with name rules and site rules
Curation platform with classification/visualization tools
Dissemination: UniProtKB annotations, InterPro
families, PIRSF reports, PIRSF curation platform
Domain Superfamily
• One common Pfam
domain
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PIRSF Homeomorphic Family
• Exactly one level
• Full-length sequence similarity and
common domain architecture
PIRSF Homeomorphic
Subfamily
• 0 or more levels
• Functional specialization
PIRSF003033: Ku70 autoantigen
PF02735: Ku70/Ku80 beta-
barrel domain
PIRSF800001: Ku70/80 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
PF00219: Insulin-like growth
factor binding protein
(IGFBP)
PIRSF001969: IGFBP
PIRSF018239: IGFBP-related protein, MAC25 type
6
…
PIRSF500006: IGFBP-6
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
iProClass Integrated Protein Database
Data Integration and Protein Mapping




Data integration from >90 databases
Underlying data warehouse for protein ID/name/bibliography mapping &
pre-computed BLAST results
Integration of protein family, function, structure for functional annotation
Rich link (link + summary) for value-added reports of UniProt proteins
Structure
Family
Protein Sequence
PDB
SCOP
CATH
PDBSum
MMDB
PIRSF
InterPro
Pfam
Prosite
COG
UniProt
UniRef
UniParc
RefSeq
GenPept
…
…
…
Function/Pathway
iProClass
Integrated Protein
Knowledgebase
…
Protein Expression
NCBI X-Refs
Gene/Genome
…
GEO
GXD
ArrayExpress
CleanEx
SOURCE
…
Additional Refs
Gene Ontology
Disease/Variation
Swiss-2DPAGE
PMG
OMIM
HapMap
…
Modification
Interaction
RESID
PhosphoBase
DIP
BIND
Ontology
…
Taxonomy
GO
…
GenBank/EMBL/DDBJ
LocusLink
UniGene
MGI
TIGR
Gene Expression
EC-IUBMB
KEGG
BioCarta
EcoCyc
WIT
7
Gene/Genome
…
NCBI Taxon
NEWT
Literature
PubMed
EC
KEGG Pathway
Structure Homolog
PTM
iProLINK Text Mining Resource
Annotation Extraction and Literature-Based Protein Annotation



Curated datasets and literature corpus for development of literature mining
and annotation extraction tools
RLIMS-P text-mining tool for extracting protein phosphorylation data
BioThesaurus of gene/protein names to resolve synonym and ambiguity
i ProLINK
NLP Text Mining
Research
Bibliography Display
• Mapping of PubMed IDs to Proteins
• Papers Categorized by Annotations
Literature Corpus
• Mapping to Proteins/Features
• Annotation-Tagged
• Name-Tagged
Dictionary and Ontology
• Protein Names and Synonyms
• PIRSF Family Names in DAG
Guidelines
8
http://pir.georgetown.edu/iprolink
Bibliography Mapping
Text Categorization
Annotation Extraction
Named Entity Recognition
• Protein/Family Naming Guidelines
• Name Tagging Guidelines
integrated Protein
Literature,
INformation and
Knowledge
Literature-Based Curation
Literature Mining &
Protein Curation
Bibliography
PubMed
Databases
UniProt
PIRSF
iProClass
GO
NIAID Biodefense Proteomic Program

Goals




Characterize proteomes of pathogens and host cells
Identify proteins associated with the biology of the microbes
Elucidate mechanisms of microbial pathogenesis
Understand immune responses and non-immune mediated host responses
Adm Ctr
PRC
Data Type
9
Organism
PIRSF
iProClass
UniProt
Data Integration at
NIAID Admin Center
Integrated Data
at VBI
Protein ID
Peptide/Protein
Sequence
Mapping
Master Protein Directory
& Complete Proteomes
at GU-PIR
http://pir.georgetown.edu/proteomics/
Data Exchange Format
Controlled Vocabulary
Ontology
Multiple Data Types
from Proteomics
Research Centers
10
Rich annotation - capture experimental data and scientific
conclusion; integrate with major databases
NCI caBIG Initiative

caBIG (cancer Biomedical Informatics Grid)

Cancer research platform to enable sharing of research infrastructure, data, tools
 Designed and built by an open federation of organizations
 Based on common standards and open source/open access principles
One of four caBIG grid reference projects
 PIR Grid-Enablement: UniProtKB as central
protein information resource for cancer research
caBIG Workspaces
 Integrative Cancer Research
PIR Developer Project: Grid Enablement of PIR
PIR Adopter Project: SEED Genome Annotation
caGrid Architecture
PIR Adopter Project: GeneConnect ID mapping
 Vocabularies and Common Data Elements
PIR Participant Project: Protein models, objects, vocabularies, ontologies


11
UniProt Knowledgebase:
Accurate, Consistent, and Rich Annotation of
Protein Sequence and Function



12
Family Classification-Driven and Rule-Based Curation
 Functional inference of uncharacterized hypothetical proteins
 Systematic detection and correction of genome annotation errors
 Improvement of under- or over-annotated proteins
Text Mining-Assisted and Literature-Based Curation
 Annotation extraction from scientific literature
 Attribution of experimental evidence
Ontology and Controlled Vocabulary-Based Curation
 Standardization of protein/gene/family names and annotation terms
 Annotation of specific protein entities
PIR Superfamily Classification


13
Tree of Life and Evolution of
Protein Families (Dayhoff)
The protein superfamily concept
(1976) was based on sequence
similarity, where sequences were
categorized into superfamilies,
families, subfamilies, and entries
using different % identity
thresholds.
PIRSF Classification System






14
A network classification system from superfamily to subfamily levels to
reflect the evolutionary relationships of full-length proteins and domains
Basic unit is homeomorphic family: Full-length similarity, common domain
architecture
Provide annotation of generic biochemical and specific biological functions
Basis for evolutionary and comparative genomics research
Basis for accurate and consistent automated protein annotation (protein
name, biochemical and biological functions, functional sites)
Basis for standardization of protein names and development of ontology
for protein evolution
Domain Superfamily
• One common Pfam
domain
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PIRSF Homeomorphic Family
• Exactly one level
• Full-length sequence similarity and
common domain architecture
PIRSF Homeomorphic
Subfamily
• 0 or more levels
• Functional specialization
PIRSF003033: Ku70 autoantigen
PF02735: Ku70/Ku80 beta-
barrel domain
PIRSF800001: Ku70/80 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
PF00219: Insulin-like growth
factor binding protein
(IGFBP)
PIRSF001969: IGFBP
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PF01817: Chorismate
mutase (CM)
…
PIRSF500006: IGFBP-6
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
15
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF Classification/Curation Workflow
Unclassified UniProtKB proteins
Unassigned Proteins
Automatic Procedure
1
New Proteins
1.
2.
Automatic Clustering
3
Uncurated Homeomorphic Clusters
Orphans
Map Domains on Clusters
Computerassisted Manual
Curation
Merge/Split
4
Clusters
Add/Remove Members
Preliminary Homeomorphic Families
Automatic Placement
2
3.
4.
5.
5
Hierarchies (Superfamilies/Subfamilies)
Name, Refs, Abstract, Domain Arch.
6
6.
7.
Final Families, Subfamilies, Superfamilies
7
16
Protein Name Rules/Site Rules
8
Build and Test HMMs
8.
Computational generation
of homeomorphic clusters
Computational domain
mapping and annotation
of preliminary clusters
Automatic placement of
new proteins into families
Computer-assisted expert
analysis to define
homeomorphic families
Family hierarchy created
as needed
Expert annotation
Name rules and optional
site rules created
Seed members to
generate family HMMs
PIRSF Classification Tools



Iterative BlastClust Tree with Annotation Table
Multiple Alignment and Phylogenetic Tree
PIRSF Classification in DAG Editor
HPS
KGPDC
Phylogenetic Tree
17
Classification/Annotation
ISMB: PIRSF Protein
Classification System Demo
Alignment
PIRSF Analysis/Visualization Tools



18
Taxonomy Distribution and Phylogenetic Pattern
Domain Display
Family Hierarchy (DAG Browser)
PIRSF
Family
Report
Curated
family
name
Description
of family
Sequence
analysis
tools
19
Classification and Functional Annotation
Example - Phosphofructokinase (PFK) classification shows that
functional specialization can occur as a result not only of major
sequence changes but also by mutation of a single amino-acid residue.
Families
Classification Tree
ATP_PFK_DR0635
20
ATP_PFK_euk
E. coli (P06998)
Gly105 Gly125
ATP-PFK:
Gly105
+
Gly125
PPi_PFK_PfpB
PPi_PFK_TM0289
PPi_PFK_TP0108
PPi_PFK_SMc01852
PFK_XF0274
PPi-PFK:
Gly/Asp105
+
Lys125
Family-Based Rules for Annotation
Functional Site Rule: tags
active site, binding, other
residue-specific information
?
21
Functional Name Rule:
gives name, EC, GO, other
function-specific information
iProLINK Literature Mining Resource
i ProLINK
NLP Text Mining
Research
Bibliography Display
• Mapping of PubMed IDs to Proteins
• Papers Categorized by Annotations
Literature Corpus
• Mapping to Proteins/Features
• Annotation-Tagged
• Name-Tagged
Dictionary and Ontology
• Protein Names and Synonyms
• PIRSF Family Names in DAG
Guidelines
22
http://pir.georgetown.edu/iprolink
Bibliography Mapping
Text Categorization
Annotation Extraction
Named Entity Recognition
• Protein/Family Naming Guidelines
• Name Tagging Guidelines
integrated Protein
Literature,
INformation and
Knowledge
Literature-Based Curation
Literature Mining &
Protein Curation
Bibliography
PubMed
Databases
UniProt
PIRSF
iProClass
GO
iProLINK Literature Mining Resource
1.
2.
3.
4.
5.
UniProtKB Bibliography mapping in iProClass
RLIMS-P Rule-based NLP method for extracting protein phosphorylation data
Substring-based machine learning method for PTM text categorization
BioThesaurus of protein/gene names with UniProtKB association
Entity-named tagging Guide
i ProLINK
1
2 3
NLP Research
Bibliography Display
• Mapping of PubMed IDs to Proteins
• Papers Categorized by Annotations
Literature-Based Curation
Bibliography Mapping
Literature Corpus
• Mapping to Proteins/Features
• Annotation-Tagged
• Name-Tagged
Dictionary and Ontology
• Protein Names and Synonyms
• PIRSF Family Names in DAG
4
5
Guidelines
Text Categorization
Annotation Extraction
Named Entity Recognition
• Protein/Family Naming Guidelines
• Name Tagging Guidelines
23
integrated Protein
Literature,
INformation and
Knowledge
http://pir.georgetown.edu/iprolink
Literature Mining &
Protein Curation
Bibliography
PubMed
Databases
UniProt
PIRSF
iProClass
GO
Literature Corpus for Text Mining


Literature survey and manual tagging for evidence attribution
Training and benchmarking sets for information retrieval and extraction


24
Protein phosphorylation data used to develop RLIMS-P for extracting
phosphorylation information
The five PTM datasets used to develop a machine learning algorithm for
text categorization
A
Online RLIMS-P
2
1. Summary table: PMIDs &
top-ranking annotation
1
25
3. Name mapping
searches BioThesaurus
2. Report: Full annotation with
evidence tagging and PMID
mapping to UniProtKB entry
3
BioThesaurus
Name Filtering
NCBI
Genome
Entrez Gene
RefSeq
GenPept
FlyBase
WormBase
MGD
SGD
RGD
UniProt
UniProtKB
UniRef90/50
PIR-PSD
iProClass
Name
Extraction
Highly
Ambiguous
Nonsensical
Terms
Raw
Thesaurus
Semantic Typing
Other




26
HUGO
EC
OMIM
BioThesaurus
UniProtKB
Entries:
Protein/Gene
Names &
Synonyms
UMLS
Comprehensive collection of protein/gene names from 23 databases
Associate names (~3.2 million) with UniProtKB entries (>2 million)
Web-based searches to retrieve synonymous names, resolve
ambiguous names, evaluate name coverage
FTP download for automatic dictionary-based named entity tagging
Online BioThersaurus
Name ambiguity of CLIM1
1
2
1. Search protein
entries sharing the
same names
2. Retrieve
BioThesaurus report
27
Annotation error detection
BioThesaurus Report
Gene/Protein Name Mapping
1. Search Synonyms
2. Resolve Name Ambiguity
3. Underlying ID Mapping
Synonyms for Metalloproteinase inhibitor 3
1
Name ambiguity of TIMP-3
2
28
3
ID Mapping
Protein Ontology (PRO)




29
PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies)
Framework
Two sub-ontologies:
 Ontology for Protein Evolution (ProEvo) for the classification of
proteins on the basis of evolutionary relationships
 Ontology for Protein Modified Forms (ProMod) to represent the
multiple protein forms of a gene (genetic variation, alternative
splicing, proteolytic cleavage, and post-translational modification).
Why PRO?
 Allow the specification of relationships between PRO and other
ontologies, such as GO and Disease Ontology
 Facilitate precise protein annotation of specific proteins/classes
The PRO prototype is illustrated using human proteins from the TGFbeta signaling pathway (http://pir.georgetown.edu/pro).
PRO Conceptual Framework
ProEvo
evolutionary unit
Root level
is_a
is_a
Unit Level
PRO
Protein
Ontology
• The two types of evolutionary units
• Not substituted by any other terms
domain
is_a
protein
is_a
is_a
GO
Domain Family Level (structure)
• Related by structural similarity
• Source: SCOP Superfamily
structure domain
has_ancestral_property
has_function
lacks_function
is_a
Domain Family Level (sequence)
• Related by sequence similarity
• Source: Pfam domain
sequence domain
biological process
lacks
has_part
Protein Family Level
homeomorphic
protein
• Evolutionarily-related full-length protein
• May contain finer-grain sub-categories
• Sources: PIRSF family/subfamily, Panther subfamily
ProMod
is_a
gene product
Gene level
• All protein products encoded by one gene
• Source: UniProtKB
is_a
Gene Ontology
molecular function
is_a
is_a
has_ancestral_property
participates_in
cellular component
has_ancestral_property
part_of (for complexes)
located_in (for compartments)
HGNC/MGI
Gene Name
gene name
Transcript level
• Possible transcript forms
• Source: UniProtKB
encoded_by
genetic
variant
splice
variant
reference
protein
PSI-MOD
Modification
protein modification
derives_from
derives_from
Post-translation level
• Protein as modified after translation
• Source: UniProtKB
30
cleaved
product
derives_from
has_modification
DO/UMLS Disease Ontology/Term
modified
product
disease
agent_of
Protein Ontology (PRO)
31
Acknowledgements

PIR Team




Collaborators




UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams
NIAID: Margaret Moore (SSS), Bruno Sobral (VBI)
Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay
Shanker (U Delaware), Zoran Obradovic (Temple U)
Funding Support


32
Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey,
Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR
Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan
Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, HsingKuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata
Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank


NHGRI/NIGMS (UniProt)
NCI caBIG
NIAID (Proteomic Admin Center)
NSF: iProClass, text mining