Poster - Protein Information Resource

Download Report

Transcript Poster - Protein Information Resource

PIRSF protein family classification system
Anastasia Nikolskaya, Sehee Chung, Hongzhan Huang, Raja Mazumder, Darren Natale, Lai-Su Yeh, Cathy Wu
Protein Information Resource, Georgetown University Medical Center
[email protected]
http://pir.georgetown.edu/
Family-driven Protein Annotation
PIRSF Classification System
Abstract


PIRSF: A network structure from Superfamilies to Subfamilies
Reflects evolutionary relationships of full-length proteins


Basic unit = Homeomorphic Family
Homologous (Common Ancestry): Inferred by sequence similarity
Homeomorphic: Full-length sequence similarity and common domain
architecture
Hierarchical Structure: Flexible number of levels with varying degrees of
sequence conservation
Network Structure: Interconnection between different hierarchies
Advantages:


Annotation of both generic biochemical and specific biological functions
Accurate propagation of annotation and development of standardized
protein nomenclature and ontology
Pfam Domain
• Exactly one level
• Full-length sequence similarity and
common domain architecture
• One or more common domains
• 0 or more levels
• Functional specialization
••
PIRSF003033: Ku70 autoantigen
PF02735:
Ku70/Ku80 beta-barrel
domain
PIRSF800001: Ku70/80 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
PF00219:
PIRSF001969: IGFBP
Insulin-like growth factor
binding protein (IGFBP)
•
PIRSF500006: IGFBP-6
PIRSF018239: IGFBP-related protein, MAC25 type
•
PIRSF017318: CM of AroQ class, eukaryotic
PF01817:
••
PIRSF001501: CM of AroQ class, prokaryotic
PIRSF001500: Bifunctional CM/PDT (P-protein)
IGFBP subfamilies
PIRSF001499: Bifunctional CM/PDH (T-protein)
Basic
unit==
Basic unit
Homeomorphic
Family
Homeomorphic Family
Network
Structure:
Network Structure:
Flexible
number
Flexible number
of of
levels
withvarying
varying
levels with
degrees
sequence
degrees ofofsequence
conservation
conservation
Advantages
Advantages
…
Chorismate mutase
(CM)
Automatic clustering
Preliminary
Curation (4,500
PIRSFs )


Membership
Signature
Domains
Full Curation
(2,300 PIRSFs )

Family Name,
Description,
Bibliography
PIRSF Name
Rules
Map domains on Families
Computer assisted Manual
Curation
Add/remove members



Final Homeomorphic Families

Protein name rule/site rule
Build and test HMMs
Annotation
generic
Annotation ofofgeneric
biochemical and
biochemical
and
specific biological
specific
biological
functions
functions
Accurate propagation
Accurate
propagation
of annotation
of
annotation
Development ofof
Development
standardized protein
standardized
protein
nomenclature & &
nomenclature
ontology
ontology


Monitor such variables to ensure accurate propagation

Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase)
Propagate other properties that describe function:
Name Rules
Name
Rules

7
Reflects the function when possible
Indicates the maximum specificity that still describes the entire group
Standardized format
Name tags: validated, tentative, predicted, functionally heterogeneous


Define conditions under which names propagate to individual proteins
Enable further specificity based on taxonomy or motifs
Names adhere to Swiss-Prot conventions (though we may make suggestions
for improvement)
Name Rule types:

Define conditions under which features propagate to individual proteins
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
“Zero” Rule
Default rule (only condition is membership in the appropriate family)
Information is suitable for every member



Site Rules
Site
Rules


“Higher-Order” Rule
Has requirements in addition to membership
27
Can have multiple rules that may or may not have mutually exclusive conditions

26

Example Name Rules
Curated family
name
Rule ID
Taxonomic
distribution of
PIRSF can be
used to infer
evolutionary
history of the
proteins in the
PIRSF
Description of
family
Sequence
analysis tools
Phylogenetic
tree and
alignment
view allows
further sequence
analysis
Defined rules for
annotation
Rule Conditions
PIRNR000881-1
Name Rule in Action at UniProt
Propagated Information
PIRSF000881 member
and vertebrates
Name: S-acyl fatty acid synthase thioesterase
EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14)

PIRNR000881-2
PIRSF000881 member
and not vertebrates
Name: Type II thioesterase
EC: thiolester hydrolases (EC 3.1.2.-)
PIRNR025624-0
PIRSF025624 member
Name: ACT domain protein
Misnomer: chorismate mutase



Automatic annotations (AA) are in a separate field
AA only visible from www.ebi.uniprot.org
Future:
Automatic name annotations will become DE line if DE line will
improve as a result

Note the lack of a zero rule for PIRSF000881
28
AA will be visible from all consortium-hosted web sites


Position-Specific Site Features:


Yes
Name rule exists?
No
Nothing to propagate
PIRSF in DAG View


Mapping to
other protein
classification
databases
Name rules and
site rules allow
precise
annotation of
UniProt proteins
within the PIRSF
Protein fits criteria for
any higher-order rule?
No
Yes
Assign name from
Name Rule 1 (or 2 etc)
PIRSF has zero rule?
Assign name from
Name Rule 0


30
Nothing to propagate
at least one PDB structure
experimental data on functional sites: CATRES database (Thornton)
Rule Definition:

Yes
active sites
binding sites
modified amino acids
Current requirements:


No
29
PIR Site Rules
Name Rule Propagation Pipeline
Affiliation of Sequence: Homeomorphic Family or Subfamily
(whichever PIRSF is the lowest possible node)
Integrated
value-added
information from
other databases
Current:

PF02153:
Lack of active site residues necessary for enzymatic activity
Certain activities relevant only to one part of the taxonomic tree
Evolutionarily-related proteins whose biochemical activities are known to
differ
EC, GO terms, misnomer info, pathway
Name, refs, abstract, domain arch.
Create hierarchies (superfamilies/subfamilies)
Account for functional variations within one PIRSF, including:

Hierarchy
Hierarchy



Curated Homeomorphic Families
PIRSF001499: Bifunctional CM/PDH (T-protein)
Prehenate dehydrogenase
(PDH)




PIRSF
Classification
Name
PIRSF Classification
Name

Preliminary Homeomorphic Families
Merge/split
clusters
PIR Name Rules
Objective: Optimize for protein annotation


Definitions
Definitions
••
Family-Driven Protein Annotation
Unassigned proteins
Automatic Procedure
PIRSF Report
A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins
PIRSF Homeomorphic
Subfamily


3
PIRSF
classification system
PIRSF Protein Classification
System
Computer Generated
(Uncurated )
Clusters (35,000
PIRSFs )
New proteins
Orphans

25
PIRSF Homeomorphic Family

Definitions:




UniProtKB proteins
Automatic placement
The PIRSF protein classification system reflects evolutionary relationship
of full-length proteins and domains. PIRSF families are extensively
curated using a bioinformatics infrastructure implemented in a J2EE
framework. Expert manual curation includes membership, annotation of
specific biological functions, biochemical activities, and sequence
features. Novel functional predictions for uncharacterized “hypothetical”
proteins and protein families are routinely made in the annotation
process.
Fully curated families and their protein members provide basis for rich
and accurate functional annotation of protein sequences in the UniProt
Knowledgebase.
The PIRSF database is accessible at http://pir.georgetown.edu/pirsf/
PIRSF Superfamily
• 0 or more levels
Creation and Curation of PIRSFs

Select template structure
Align PIRSF seed members with structural template
Edit MSA to retain conserved regions covering all site residues
Build Site HMM from concatenated conserved regions
31
System Implementation
PCS Architecture
Client s
Middle Tier
Data Source
Web Browser
DB2
Servlet
[ Controller ]
(JavaWebStart)
Application s
DAO
Manager
H
T
T
P
D
SQL
DAO
FLAT
DAO
JSP,
HTML,
XML (XSLT)
[ Presentation ]
Domain Objects
[Model]
Graphical Analysis Tool Integration
PCS Web Interface: Shopping Cart View
PIRSF DAG Editor/Viewer Collaborative Curation Platform
XML
DAO
JDBC
FlatFile
Adapter
XML
Adapter
MySql
Acknowledgements
Oracle
Legacy
Databases
XML
Repositories
UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01
•
•
•
•
Curator -guided clustering
Single -linkage clustering
using BLAST
Retrieve all proteins sharing
a common domain
Iterative BlastClust
(fixed length coverage)