Transcript Slide 1

Protein Bioinformatics – Advances
and Challenges
BY
Sona Vasudevan
Peter McGarvey
1
Outline
• What is Bioinformatics?
Past & Present
• About PIR
• PIR resources
• UniProt resources
• PIR’s leading role in
CaBig; Biodefense and
Ontology
2
What is Bioinformatics?
NIH Biomedical Information Science and Technology Initiative
(BISTI) Working Definition (2000)

Bioinformatics: Research, development, or
application of computational tools and
approaches for expanding the use of biological,
medical, behavioral or health data, including
those to acquire, store, organize, archive,
analyze, or visualize such data.
Computer + Mouse = Bioinformatics
(Information)
(Biology)
3
“A science which hesitates to
forget its founders is lost.”
---- A. N. Whitehead
4
Evolution of Protein
databases
(Georgetown University)
Dr. Margaret Oakley
Dayhoff (1925 – 1983)
The origin of the single-letter code for the amino acids
5
Challenges we are facing today!
Total number of
sequences in NR
~4,919,302
Total number of
environmental
sequences
~6,028,191(NCBI)
Number of domain
Families (Pfam)
~8957
Number of domain
Families (SMART)
~665
Number of Structures
(PDB)
~43339
Number of COGS
~4873 (Unicellular)
~4852 (Eukaryote)
6
Molecular Biology
Databases
NAR Molecular Biology Database Collection
800
Database number
700
600
500
400
300
200
100
0
1999
The DNA sequence database
has exceeded 100 gigabases.
2000
2001
2002
2003
2004
2005
Year
719 Databases in 14 categories
7
the birth of “omes”
& "omic" era in
biology
8
Genomics
Proteomics
Functionomics
Unknomics
Metagenomics
9
10
Protein Information Resource
Integrated Protein Informatics Resource for Proteomics Research

UniProt Universal Protein
Resource: Central
Resource of Protein
Sequence and Function
 PIRSF Protein Family
Classification System:
Protein Classification and
Functional Annotation
 iProClass Integrated
Protein Knowledgebase:
Data Integration and
Functional Associative
Analysis
http://pir.georgetown.edu
11
UniProt Databases



UniParc: Comprehensive Sequence Archive with Sequence
History
UniProt: Knowledgebase with Full Classification and Functional
Annotation
UniRef: Non-redundant Reference Databases for Sequence
Search
UniRef50
Clustering at 100,
90, 50% Identity
UniRef90
UniRef100 (NREF)
Classification,
Literature-Based &
Automated Annotation
UniProt (Knowledgebase)
Merging
UniParc (Archive)
SwissProt
TrEMBL
PIR-PSD
RefSeq
GenBank/
EMBL/DDBJ
Ensembl
PDB
Patent
Data
Other
Data
12
UniProt Knowledgebase

Objective: Stable, Comprehensive, Fully Classified,
Richly and Accurately Annotated

Information Content





Isoform Presentation
Nomenclature
Family Classification and Domain Identification
Functional Annotation
Approaches






Full Classification
Automated Annotation
Literature-Based Curation
Database Cross-References
Controlled Vocabularies & Ontologies
Evidence Attribution
13
PIRSF Classification System

PIRSF:



Definitions:

Homeomorphic Family (HF): Basic Unit

Homologous: Common ancestry, inferred by sequence similarity




Reflects evolutionary relationships of full-length proteins
A network structure from superfamilies to subfamilies
Homeomorphic: Full-length similarity & common domain
architecture
Hierarchy: Flexible number of levels with varying degrees of
sequence conservation
Network Structure: Allows multiple parents
Advantages:


Annotate both general biochemical and specific biological
functions
Accurate propagation of annotation and development of
standardized protein nomenclature and ontology
14
Credit AN Nikolskaya
PIRSF Classification System
Protein Classification and Functional Annotation
(http://pir.georgetown.edu/pirsf/)
 Comprehensive Classification of All UniProt Proteins
 Curated Families with Protein Name and Site Rules
 Classification and Visualization Tools
Taxonomy Distribution
and Phylogenetic Pattern
Iterative BlastClust Tree with Annotation
Table, MSA & Phylogenetic tree
15




Curatorguided
clustering
Singlelinkage
clustering
using BLAST
Retrieve all
proteins
sharing a
common
domain
Iterative
BlastClust
(fixed length
coverage)
Classification Tool:
BlastClust
16
PIRSF-Based Protein Annotation
Classification-Driven Rule-Based Annotation
Provides Consistent Annotation and Database Integrity Check
Includes:
Site Rule (PIRSR): Position-Specific Site Feature (FT)
Name Rule (PIRNR): transfer name from PIRSF to individual
proteins
Protein Name (DE) with Synonym, EC, Misnomer
GO Term
Rule ID
Rule Condition
Rule Description (Name Rule Interface)
PIRNR000881
-1
PIRSF000881
member and
vertebrates
Name: S-acyl fatty acid synthase thioesterase
EC: oleoyl-[acyl-carrier-protein] hydrolase (EC
3.1.2.14)
PIRNR000881
-2
PIRSF000881
member and not
vertebrates
Name: Type II thioesterase
EC: thiolester hydrolases (EC 3.1.2.-)
PIRNR025624
-1
PIRSF025624
member
Name: ACT domain protein
Misnomer: chorismate mutase
17
Rule-based Annotation of Protein
Entries Using PIRSF
Structure
Binding/active sites
Identification of residues
18
Methodology

Defining a Rule






Rule Condition




Select template structure
Align curated PIRSF seed members and structural
template
Structure-based sequence alignment of seeds
Edit MSA retaining conserved regions covering all site
residues
Build Site HMM from concatenated conserved regions
Membership Check (PIRSF HMM threshold)
Conserved Region Check (site HMM threshold)
Site Residue Check (position-specific residue in
HMMAlign)
Rule Propagation

Propagate conserved feature annotation to all
members that fit the rule
19
An example of PIR rule Integrated into SP record
PIR Rule
20
PIRSF Protein Classification provides
a platform for protein annotation

Improves Annotation Quality


Annotation of biological function of whole
proteins
Annotation of uncharacterized hypothetical
proteins (functional predictions helped by newly
detected family relationships)



Correction of annotation errors
Improvement of under- or over-annotated
proteins
Standardization of Protein Names
21
Data Integration

Data Warehouse



Hypertext Navigation



Local Copy of Databases in a Unified Database Schema
Allows Local Control of Data; Update Problem
Browsing Model with Hypertext Links
Allows Direct Interaction; Easily Lost in Cyberspace
iProClass Approach



Data Warehouse + Hypertext Navigation
Rich Links (Links + Executive Summaries)
Modular and Open Framework for Adding New
Components in Distributed Networking Environment
22
iProClass Database
Integrated Protein Family, Function, Structure Information
Function/Pathway
EC-IUBMB
KEGG
BRENDA
WIT
MetaCyc
EcoCyc
Gene Ontology
Structure
Protein Sequence
Gene/Genome
PIR-NREF
PIR-PSD
Swiss-Prot
TrEMBL
RefSeq
GenePept
GenBank/EMBL/DDBJ
LocusLink
UniGene
GDB
OMIM
SGD
MGI
FlyBase
MIPS
TIGR
iProClass
Protein Sequence
PDB
SCOP
CATH
PDBSum
MMDB
FFSP
Superfamily/Domain/Motif
Protein Function/Pathway
Protein Interaction
Protein Modification
Modification
RESID
PhosphoBase
PhosphorylationSite
Protein Expression
Protein Structure
~5,000,000 Protein
Sequences

Rich Links to >80
Databases

Value-Added Views
for UniProt
Family
PIR Superfamily
PIR-ASDB
InterPro
Pfam
PROSITE
COG
BLOCKS
ProClass
MetaFam
Gene
Taxonomy
Interaction
DIP
BIND

Expression
Literature
PMG
PubMed
NCBI Taxon
23
iProClass Views
Family Report
Sequence
Report
24
PIR iProClass Searches
ID Mapping
Peptide Search
Text Search
BLAST Search
25
1. Albert Einstein College of Medicine
T. gondii, C. parvum
2. Caprion Pharmaceuticals
B. abortus
Albert Einstein
PNNL
U of Michigan
Harvard
Myriad
D
A
T
A
Scripps
Caprion
3. Harvard Institute of Proteomics
V. cholerae, B. anthracis
SSS
4. Myriad Genetics
B. anthracis, Y. pestis, F. tularensis, Vaccinia,
Variola
5. Pacific Northwest National Laboratory
S. typhimurium, S. typhi, Vaccinia, Monkeypox
PIR
Resource
Center
VBI
6. Scripps
SARS CoV, Influenza
7. University of Michigan
B. anthracis
26
Organism
Research Center
Data Type
27
Master Protein Directory
28
Colonization
Pathway Proteins
Currently contains 3,733 ORF Clones out of293,784
Proteins
Search
forand
Related
ProteinsInformation
in Catalog by
Protein
Reagent
Protein
Order
Clone
Clones
Summary
Sequences
fromFamily
Report
Repositories
Classification or Similarity Searches
29
Mouse proteins detected in B. anthracis and S. typhimurium infected macrophages
NCI caBIG Initiative
cancer Biomedical Informatics Grid:
•
•
•
Informatics platform to enable sharing of research, data and tools
• Designed and built by an open federation of organizations
• Facilitate connectivity via common standards and unifying architecture
• Open source and open access principles
Domain Workspaces
• Clinical Trial Management Systems
• Integrative Cancer Research
• Imaging
• Tissue Banks and Pathology Tools
Cross Cutting Workspaces
• Architecture
• Vocabularies and Common Data Elements
PIR Activities in caBIG™
•Integrative Cancer Research Workspace
• Developer
• Grid-enablement of PIR
• Adopter
• SEED Genome Annotation Tool
(completed)
• GeneConnect Genomic Identifier Mapping
Service
•Vocabularies and Common Data Elements
• Participant
33