Class: Protein functional Annotation and Family Classification

Download Report

Transcript Class: Protein functional Annotation and Family Classification

FUNCTIONAL ANALYSIS OF
PROTEIN SEQUENCES:
ANNOTATION AND FAMILY
CLASSIFICATION
Anastasia Nikolskaya
PIR (Protein Information Resource),
Georgetown University Medical Center
Overview
Problem:



Most new protein sequences come from genome sequencing projects
Many have unknown functions
Large-scale functional annotation of these sequences based simply on
BLAST best hit has pitfalls; results are far from perfect
Functional Analysis of Protein Sequences:


Homology-based (sequence analysis, structure analysis)
Non-homology (genome context, phylogenetic distribution)
Solution for Large-scale Annotation:


Highly curated and annotated protein classification system
Automatic annotation of sequences based on protein families
PIRSF Protein Classification System




Full-length protein family classification based on evolution
Highly annotated, optimized for annotation propagation
Functional predictions for uncharacterized proteins
Used to facilitate and standardize annotations in UniProt
2
Proteomics and Bioinformatics

Data: Gene expression profiling
Genome-wide analysis
of gene expression


Data: Protein-protein interaction
Data: Structural genomics 3D structures of all protein
families


Data: Genome projects (Sequencing)
….
Bioinformatics
Computational analysis and integration of these data
Making predictions (function etc), reconstructing
3
pathways
What’s In It For Me?




When an experiment yields a sequence (or a set of
sequences), we need to find out as much as we can about
this protein and its possible function from available data
Especially important for poorly characterized or
uncharacterized (“hypothetical”) proteins
More challenging for large sets of sequences generated
by large-scale proteomics experiments
The quality of this assessment is often critical for
interpreting experimental results and making hypothesis
for future experiments
Sequence
function
4
Work with Protein, not DNA Sequence
Genomic DNA Sequence
DNA
Sequence
Gene
Gene
Gene Recognition
Promoter
Gene
Protein
Sequence
C
A
C
A
C
A
A
T
Exon1
5' UTR
T
A
T
A
A
T
G
T
Exon2
A
G
Exon3
Intron
G
T
Protein Sequence
Exon1
Exon2
Structure
Determination
Family
Classification
3' UTR
A
A
T
A
A
A
A
G
G
Protein Structure
Function
Intron
Exon3
Function
Analysis
Protein Family
Molecular Evolution
Gene Network
Metabolic Pathway
5
The Changing Face of Protein Science
20th century

Few well-studied
proteins
21st century

Many “hypothetical”
proteins (Most new
proteins come from genome
sequencing projects, many
have unknown functions)

Mostly globular
with enzymatic activity

Various, often with no
enzymatic activity

Biased protein set

Natural protein set
6
Credit: Dr. M. Galperin, NCBI
Knowing the Complete Genome Sequence
Advantages:




All encoded proteins can be predicted and identified
The missing functions can be identified and analyzed
Peculiarities and novelties in each organism can be
studied
Predictions can be made and verified
Challenge:

Accurate assignment of known or predicted functions
(functional annotation)
7
Escherichia coli
Methanococcus jannaschii
Yeast
Human
E. coli M. jannaschii S. cerevisiae H. sapiens
Characterized experimentally 2046
97
3307
10189
Characterized by similarity
1083
1025
1055
10901
Unknown, conserved
285
211
1007
2723
Unknown, no similarity
874
411
966
7965
8
from Koonin and Galperin, 2003, with modifications
Functional Annotation
for Different Groups of Proteins

Experimentally characterized


Find up-to-date information, accurate interpretation
Characterized by similarity (“knowns”) = closely related to
experimentally characterized
 Avoid propagation of errors

Function can be predicted (no close sequence similarity, may be
distant similarity to characterized proteins)



Extract maximum possible information, avoid errors and
overpredictions
Most value-added (fill the gaps in metabolic pathways, etc)
“Unknowns” (conserved or unique)
 Rank by importance
9
How are Protein Sequences Annotated?
“regular approach”
Protein
Sequence
Automatic assignment based on sequence
similarity (best BLAST hit):
gene name, protein name, function
Large-scale functional annotation of sequences
based simply on BLAST best hit has pitfalls;
results are far from perfect
Function
To avoid mistakes, need human intervention
(manual annotation)
Quality vs Quantity
10
Functional Annotation
for Different Groups of Proteins

Experimentally characterized


Find up-to-date information, accurate interpretation
Characterized by similarity (“knowns”) = closely related to
experimentally characterized
 Avoid propagation of errors

Function can be predicted (no close sequence similarity, may be
distant similarity to characterized proteins)



Extract maximum possible information, avoid errors and
overpredictions
Most value-added (fill the gaps in metabolic pathways, etc)
“Unknowns” (conserved or unique)
 Rank by importance
11
Problems in Functional Assignments
for “Knowns”


Misinterpreted experimental results (e.g. suppressors, cofactors)
Biologically senseless annotations
Arabidopsis: separation anxiety protein-like
Helicobacter: brute force protein
Methanococcus: centromere-binding protein
Plasmodium: frameshift




-
“Goofy” mistakes of sequence comparison (e.g. abc1/ABC)
Multi-domain organization of proteins
Low sequence complexity (coiled-coil, transmembrane, nonglobular regions)
Enzyme evolution:
Divergence in sequence and function (minor mutation in active12site)
Non-orthologous gene displacement: Convergent evolution
Problems in Functional Assignments for
“Knowns”: multi-domain organization of proteins
New sequence
ACT domain
BLAST
Chorismate
mutase
Chorismate mutase domain ACT domain
In BLAST output, top hits are to chorismate mutases ->
The name “chorismate mutase” is automatically assigned to
new sequence. ERROR ! (protein gets erroneous name, EC
13
number, assigned to erroneous pathway, etc)
Problems in
Functional Assignments for “Knowns”
Previous low quality annotations lead to
propagation of mistakes
14
Functional Annotation
for Different Groups of Proteins

Experimentally characterized


Find up-to-date information, accurate interpretation
Characterized by similarity (“knowns”) = closely related to
experimentally characterized
 Avoid propagation of errors

Function can be predicted (no close sequence similarity, may be
distant similarity to characterized proteins)



Extract maximum possible information, avoid errors and
overpredictions
Most value-added (fill the gaps in metabolic pathways, etc)
“Unknowns” (conserved or unique)
 Rank by importance
15
Functional Prediction:
I. Sequence and Structure Analysis
(homology-based methods)
in non-obvious cases:



Sophisticated database searches (PSI-BLAST, HMM)
Detailed manual analysis of sequence similarities
Structure-guided alignments and structure analysis
Often, only general function can be predicted:

Enzyme activity can be predicted, the substrate remains unknown
(ATPases, GTPases, oxidoreductases, methyltransferases,
acetyltransferases)

Helix-turn-helix motif proteins (predicted transcriptional
regulators)
16

Membrane transporters
Using Sequence Analysis:
Hints

Proteins (domains) with different 3D folds are not
homologous (unrelated by origin). Proteins with
similar 3D folds are usually (but not always)
homologous

Those amino acids that are conserved in divergent
proteins within a (super)family are likely to be
functionally important (catalytic or binding sites, ect).

Reaction chemistry often remains conserved even
when sequence diverges almost beyond recognition
17
Using Sequence Analysis:
Hints

Prediction of 3D fold (if distant homologs have known
structures!) and of general biochemical function is much
easier than prediction of exact biological function

Sequence analysis complements structural comparisons
and can greatly benefit from them

Comparative analysis allows us to find subtle sequence
similarities in proteins that would not have been noticed
otherwise
18
Credit: Dr. M. Galperin, NCBI
Structural Genomics: Structure-Based
Functional Predictions
Protein
Structure
Initiative:
Determine
3D structures
of all protein
families
Methanococcus jannaschii MJ0577 (Hypothetical Protein)
Contains bound ATP => ATPase or ATP-Mediated
Molecular Switch
19
Confirmed by biochemical experiments
Crystal Structure is Not a Function!
20
Credit: Dr. M. Galperin, NCBI
Functional Prediction:
II. Computational Analysis Beyond Homology

Phylogenetic distribution (comparative genomics)





Wide - most likely essential
Narrow - probably clade-specific
Patchy - most intriguing
Clues: specific to
niche, pathway type
Domain association – “Rosetta Stone”
Genome context (gene neighborhood, operon
organization)
21
Using Genome Context for
Functional Prediction
SEED
analysis
tool
(by FIG)
Embden-Meyerhof and Gluconeogenesis pathway:
6-phosphofructokinase (EC 2.7.1.11)
22
Functional Prediction: Problem Areas




Identification of protein-coding regions
Delineation of potential function(s) for distant
paralogs
Identification of domains in the absence of
close homologs
Analysis of proteins with low sequence
complexity
23
What to do with a new protein sequence

Basic:
- Domain analysis (SMART = most sensitive; PFAM= most complete, CDD)
- BLAST
- Curated protein family databases (PIRSF, InterPro, COGs)
- Literature (PubMed) from links from individual entries on BLAST output
(look for SwissProt entries first)

-
-
-
If not sufficient:
PSI-BLAST
Refined PubMed search using gene/protein names, synonyms,
function and other terms you found
Genome neighborhood (prokaryotes)
Advanced:
Multiple sequence alignments (manual)
Structure-guided alignments and structure analysis
- Phylogenetic tree reconstruction
•
24
Case Study:
Prediction Verified: GGDEF domain





Proteins containing this domain: Caulobacter crescentus PleD
controls swarmer cell - stalk cell transition (Hecht and Newton,
1995). In Rhizobium leguminosarum, Acetobacter xylinum,
required for cellulose biosynthesis (regulation)
Predicted to be involved in signal transduction because it is found
in fusions with other signaling domains (receiver, etc)
In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide
regulator of cellulose synthase (signalling molecule). Multidomain
protein with GGDEF domain was shown to have diguanylate
cyclase activity (Tal et al., 1998)
Detailed sequence analysis tentatively predicts GGDEF to be a
diguanylate cyclase domain (Pei and Grishin, 2001)
Complementation experiments prove diguanylate cyclase activity
25
of GGDEF (Ausmees et al., 2001)
The Need for Classification
Problem:




Most new protein sequences come from genome sequencing projects
Many have unknown functions
Large-scale functional annotation of these sequences based simply on
BLAST best hit has pitfalls; results are far from perfect
Manual annotation of individual proteins is not efficient
Solution:


Highly curated and annotated protein classification system
Automatic annotation of sequences based on protein families
Facilitates:




Automatic annotation of sequences based on protein families
Systematic correction of annotation errors
Protein name standardization
Functional predictions for uncharacterized proteins
26
This all works only if the system is optimized for annotation
Levels of Protein Classification
Level
Example
Similarity
Evolution
Class
/
Structural elements
No relationships
Fold
TIM-Barrel
Topology of backbone
Possible monophyly
Domain
Superfamily
Aldolase
Recognizable sequence
similarity (motifs); basic
biochemistry
Monophyletic origin
Family
Class I Aldolase
High sequence similarity
(alignments); biochemical
properties
Evolution by ancient
duplications
Orthologous
group
2-keto-3-deoxy-6phosphogluconate
aldolase
Orthology for a given set of
species; biochemical activity;
biological function
Traceable to a
single gene in LCA
Lineagespecific
expansion
(LSE)
PA3131 and
PA3181
Paralogy within a lineage
Recent duplication
27
Protein Evolution
Domain: Evolutionary/Functional/Structural Unit
Sequence changes
With enough similarity, one
can trace back to a
common origin
Domain shuffling
What about
these?
28
Consequences of Domain Shuffling
PIRSF001501
PIRSF006786
CM (AroQ type)
PDH
CM?
CM (AroQ type)
PDH
CM = chorismate mutase
PDH = prephenate dehydrogenase
PDT = prephenate dehydratase
ACT = regulatory domain
PIRSF001499
PDH?
PDT?
CM/PDH?
PDH
ACT
PIRSF005547
PDT
ACT
PIRSF001424
PDT
ACT
PIRSF001500
CM/PDT?
CM (AroQ type)
29
Whole Protein = Sum of its Parts?
PIRSF006256
Acylphosphatase
- ZnF - ZnF - YrdC -
Peptidase M22
On the basis of domain composition alone, biological
function was predicted to be:
● RNA-binding translation factor
● maturation protease
Actual function:
● [NiFe]-hydrogenase maturation factor,
carbamoyltransferase
Full-length protein functional annotation is best done
30
using annotated full-length protein families
Practical classification of proteins:
setting realistic goals
We strive to reconstruct the natural classification
of proteins to the fullest possible extent
BUT
Domain shuffling rapidly degrades the continuity in the protein
structure (faster than sequence divergence degrades similarity)
THUS
The further we extend the classification, the finer
is the domain structure we need to consider
SO
We need to compromise between the depth of analysis and
protein integrity
OR …
31
Credit: Dr. Y. Wolf, NCBI
Complementary Approaches
Full-length protein
Classification
Domain Classification


Allows a hierarchy that can
trace evolution to the deepest
possible level, the last point
of traceable homology and
common origin
Can usually annotate only
general biochemical
function
Can
Can

Cannot build a hierarchy deep
along the evolutionary tree
because of domain shuffling

Can usually annotate specific
biological function (preferred
to annotate individual proteins)
map domains onto proteins
classify proteins even when domains are not defined32
Levels of Protein Classification
Level
Example
Similarity
Evolution
Class
/
Structural elements
No relationships
Fold
TIM-Barrel
Topology of backbone
Possible monophyly
Domain
Superfamily
Aldolase
Recognizable sequence
similarity (motifs); basic
biochemistry
Monophyletic origin
Family
Class I Aldolase
High sequence similarity
(alignments); biochemical
properties
Evolution by ancient
duplications
Orthologous
group
2-keto-3-deoxy-6phosphogluconate
aldolase
Orthology for a given set of
species; biochemical activity;
biological function
Traceable to a
single gene in LCA
Lineagespecific
expansion
(LSE)
PA3131 and
PA3181
Paralogy within a lineage
Recent duplication
33
Protein Classification Databases
Domain classification
Pfam


SMART

Full-length protein
classification

PIRSF
CDD
Mixed
Based on structural fold
•TIGRFAMS
•SCOP
•COGs
34
InterPro: integrates various types of classification databases
InterPro
Integrated resource for protein families, domains and sites.
Combines a number of databases: PROSITE, PRINTS,
Pfam, SMART, ProDom, TIGRFAMs, PIRSF
CM
PDT
ACT
SF001500
Bifunctional chorismate mutase/
prephenate dehydratase
35
The Ideal System…

Comprehensive: each sequence is classified either as a member of a
family or as an “orphan” sequence

Hierarchical: families are united into superfamilies on the basis of
distant homology, and divided into subfamilies on the basis of close
homology

Allows for simultaneous use of the full-length protein and domain
information (domains mapped onto proteins)

Allows for automatic classification/annotation of new sequences
when these sequences are classifiable into the existing families

Expertly curated membership, family name, function, background, etc.

Evidence attribution (experimental vs predicted)
36
http://pir.georgetown.edu/
PIRSF Classification System

PIRSF:



Definitions:

Homeomorphic Family: Basic Unit

Homologous: Common ancestry, inferred by sequence similarity

Homeomorphic: Full-length similarity & common domain architecture

Hierarchy: Flexible number of levels with varying degrees of sequence
conservation
Network Structure: allows multiple parents


Reflects evolutionary relationships of full-length proteins
A network structure from superfamilies to subfamilies
Advantages:

Annotate both general biochemical and specific biological functions

Accurate propagation of annotation and development of standardized
protein nomenclature and ontology
37
PIRSF Classification System
A protein may be assigned to only one homeomorphic family, which may have zero or
more child nodes and zero or more parent nodes. Each homeomorphic family may
have as many domain superfamily parents as its members have domains.
Domain Superfamily
• One common Pfam
domain
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PIRSF Homeomorphic Family
• Exactly one level
• Full-length sequence similarity and
common domain architecture
PIRSF Homeomorphic
Subfamily
• 0 or more levels
• Functional specialization
PIRSF003033: Ku70 autoantigen
PF02735: Ku70/Ku80 beta-
barrel domain
PIRSF800001: Ku70/80 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
PF00219: Insulin-like growth
factor binding protein
(IGFBP)
PIRSF001969: IGFBP
…
PIRSF500006: IGFBP-6
PIRSF018239: IGFBP-related protein, MAC25 type
38
Creation and Curation of PIRSFs
UniProtKB proteins




Membership
Signature
Domains
Full Curation
(3,300 PIRSFs)


Family Name,
Description,
Bibliography
PIRSF Name
Rules
Unassigned proteins
Automatic Procedure
Automatic clustering
Preliminary Homeomorphic Families
Automatic placement

ComputerGenerated
(Uncurated)
Clusters
Preliminary
Curation (4,700
PIRSFs)
New proteins
Orphans
Map domains on Families
Computerassisted Manual
Curation
Merge/split
clusters
Add/remove members
Curated Homeomorphic Families
Name, refs, description
Protein name rule/site rule
Final Homeomorphic Families
Create hierarchies (superfamilies/subfamilies)
39
Build and test HMMs
PIRSF Family Report:
Curated Protein Family Information
Taxonomic
distribution of
PIRSF can be
used to infer
evolutionary
history of the
proteins in the
PIRSF
Phylogenetic tree and
alignment view allows
further sequence
analysis
40
PIRSF Protein Classification:
Platform for Protein Analysis and
Annotation




Matching a protein sequence to a curated protein family
rather than searching against a protein database
Provides value-added information by expert curators,
e.g., annotation of uncharacterized hypothetical proteins
(functional predictions)
Improves automatic annotation quality
Serves as a protein analysis platform for broad range of
users
43
Family-Driven Protein Annotation
Objective: Optimize for protein annotation

PIRSF Classification Name





Hierarchy


Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase)
Name Rules




Reflects the function when possible
Indicates the maximum specificity that still describes the entire group
Standardized format
Name tags: validated, tentative, predicted, functionally heterogeneous
Define conditions under which names propagate to individual proteins
Enable further specificity based on taxonomy or motifs
Names adhere to Swiss-Prot conventions (though we may make suggestions
for improvement)
Site Rules

Define conditions under which features propagate to individual proteins
44
PIR Name Rules

Account for functional variations within one PIRSF, including:



Lack of active site residues necessary for enzymatic activity
Certain activities relevant only to one part of the taxonomic tree
Evolutionarily-related proteins whose biochemical activities are known to
differ
Monitor such variables to ensure accurate propagation

Propagate other properties that describe function:
EC, GO terms, misnomer info, pathway

Name Rule types:

“Zero” Rule



Default rule (only condition is membership in the appropriate family)
Information is suitable for every member
“Higher-Order” Rule


Has requirements in addition to membership
45
Can have multiple rules that may or may not have mutually exclusive conditions
Example Name Rules
Rule ID
Rule Conditions
Propagated Information
PIRNR000881-1
PIRSF000881 member
and vertebrates
Name: S-acyl fatty acid synthase thioesterase
PIRNR000881-2
PIRSF000881 member
and not vertebrates
Name: Type II thioesterase
PIRNR025624-0
PIRSF025624 member
Name: ACT domain protein
Misnomer: chorismate mutase
Note the lack of a zero rule for PIRSF000881
46
Name Rule Propagation Pipeline
Affiliation of Sequence: Homeomorphic Family or Subfamily
(whichever PIRSF is the lowest possible node)
Name rule exists?
Yes
Protein fits criteria for
any higher-order rule?
PIRSF has zero rule?
No
No
No
Yes
Yes
Nothing to propagate
Assign name from
Name Rule 1 (or 2 etc)
Assign name from
Name Rule 0
47
Nothing to propagate
Name Rule in Action at UniProt
Current:
• Automatic annotations (AA) are in a separate field
• AA only visible from www.ebi.uniprot.org
Future:
• Automatic name annotations will become DE line if DE line
will improve as a result
48
• AA will be visible from all consortium-hosted web sites
PIR Site Rules

Position-Specific Site Features:




Current requirements:



active sites
binding sites
modified amino acids
at least one PDB structure
experimental data on functional sites
Rule Definition:




Select template structure
Align PIRSF seed members with structural template
Edit alignment to retain conserved regions covering all site residues
49
Build Site HMM from concatenated conserved regions
Match Rule Conditions

Only propagate site annotation if all rule
conditions are met:

Membership Check (PIRSF HMM threshold)



Ensures that the annotation is appropriate
Conserved Region Check (site HMM threshold)
Residue Check (all position-specific residues in
HMMAlign)
50
Rule-based Annotation of Protein Entries
Functional variations within one PIRSF (family or subfamily):
binding sites with different specificity
Monitor such variables for accurate propagation
Site Rules Feed Name Rules
?
Functional Site rule: tags
active site, binding, other
residue-specific information
Functional Annotation rule:
gives name, EC, other
activity-specific information
51
Overview
Problem:



Most new protein sequences come from genome sequencing projects
Many have unknown functions
Large-scale functional annotation of these sequences based simply on
BLAST best hit has pitfalls; results are far from perfect
Functional Analysis of Protein Sequences:


Homology-based (sequence analysis, structure analysis)
Non-homology (genome context, phylogenetic distribution)
Solution for Large-scale Annotation:


Highly curated and annotated protein classification system
Automatic annotation of sequences based on protein families
Facilitates:




Automatic annotation of sequences based on protein families
Systematic correction of annotation errors
Name standardization in UniProt
Functional predictions for uncharacterized proteins
52
Impact of Protein Bioinformatics and
Genomics



Single protein level
 Discovery of new enzymes and superfamilies
 Prediction of active sites and 3D structures
Pathway level
 Identification of “missing” enzymes
 Prediction of alternative enzyme forms
 Identification of potential drug targets
Cellular metabolism level
 Multisubunit protein systems
 Membrane energy transducers
 Cellular signaling systems
53
PIR Team





Dr. Cathy Wu, Director
Protein Science team
 Dr. Darren Natale (lead)
Dr. Peter McGarvey
 Dr. Cecilia Arighi
Dr. Anastasia Nikolskaya
 Dr. Winona Barker
Dr. Sona Vasudevan
 Dr. Zhang-zhi Hu
Dr. CR Vinayaka
 Dr. Raja Mazumder
Dr. Lai-Su Yeh
Bioinformatics team
 Dr. Hongzhan Huang (lead)
Yongxing Chen, M.S.
 Dr. Leslie Arminski
Baris Suzek, M.S.
 Dr. Hsing-Kuo Hua
Xin Yuan, M.S.
 Dr. Robel Kahsay
Jian Zhang, M.S.
Students
 Natalia Petrova
UniProt Collaborators
 Dr. Rolf Apweiler (EBI)
Dr. Amos Bairoch (SIB)
UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01
54