UniProt Non-redundant Reference Cluster (UniRef) Databases

Transcript UniProt Non-redundant Reference Cluster (UniRef) Databases

The PIRSF Protein Classification System
as a Basis for
Automated UniProt Protein Annotation
www.uniprot.org
The Phosphofructokinase (PFK) Case:
ATP versus PPi dependence
Correct protein annotation relies on both global
(whole protein) and local (domain and motif)
sequence similarities. We have developed a method
by which annotation of site-specific features can be
confidently
propagated
from
experimentallycharacterized proteins to uncharacterized proteins.
The method relies upon rules that identify the
specific amino acids in a protein chain eligible for
tagging with appropriate information. Rules are
specific for a particular protein family, and rely upon
the identification of active site, binding site,
modified or other functionally-important residues in
a template sequence.
What to do?
Functional Site rule: tags active
site, binding, other residuespecific information
?
Functional Annotation rule:
gives name, EC, other activityspecific information
Figure 8. Functional variation within one protein family: binding sites with different specificity drive choice
of applicable rule to ensure appropriate annotation. Members of the phosphofructokinase (PFK) family evolved
Figure 2. Importance of annotation based on site-specific features. Rule definition begins
A general approach for functional characterization of an unknown
protein is to infer function based on similarity to a “best-hit” protein
in sequence databases. This is powerful method is nonetheless
susceptible to error. In part, these errors can be avoided by using
a curated, hierarchical, whole-protein classification database.
The advantage is conferred by “strength in numbers.” Instead of
relying on the (hopefully) accurate annotation of a single
(hopefully related) protein (usually, the BLAST best hit), using
curated classification databases allows reliance on the collected
wisdom of multiple proteins, or at least the assurance that the
members are truly related.
The annotation power of protein classification databases is even
more powerful if a single database contains families with
progressively greater levels of similarity (that is, hierarchies).
Theoretically, one query could be confidently predicted to be a
member of a parent family, but not a child family, while a different
query might be confidently assigned to both levels. In such cases,
the most-specific possible annotation could be propagated.
Despite these precautions, use of protein classification databases
cannot resolve one particular source of error: asserting that a
query protein has a particular enzymatic activity, even though it
lacks the specific residues responsible for that activity. This is
because all current classification algorithms rely on global
similarity to make functional inferences. Here we describe a more
robust method—PIR Site Rules—for inferring function of
uncharacterized proteins. The method has the added benefit of
being able to explicitly flag the residues important for a given
activity (feature [FT] line), and allowing a rule-based annotation of
other data fields, such as protein name (definition [DE] line). After
a brief summary, the case of ATP- and PPi- dependent
phosphofructokinases will be presented.
Summary
• Position-Specific Features:
– active sites
– binding sites
– modified amino acids
• Current requirements:
– at least one PDB structure
– experimental data on
functional sites
• Rule Definition:
– Select template and align with
PIRSF seed members
– Edit MSA: retain conserved
regions covering all site
residues
– Build Site HMM from
concatenated conserved
regions
pir.georgetown.edu/pirsf
with knowledge about residues important for catalytic activity or binding. PFK is a key regulatory
enzyme in the Embden-Meyerhoff glycolytic pathway. Classification of PFK proteins revealed
that major functional specialization can occur as a result of even a single amino-acid residue
change. Two amino acid positions (105 and 125, E. coli numbering) are critical determinants of
ATP or PPi utilization (boxes), with one position especially key (arrow). The ability to use ATP
depends on the presence of a glycine at the indicated position. Accurate propagation of protein
function therefore depends on crafting rules that take advantage of the ability to distinguish
between thesepossibilities,as illustrated inthenext figure.
into ATP- or PPi-dependent forms. While propagating the name “Phosphofructokinase” to all members would not be
inaccurate, it fails to take full advantage of current knowledge. The residues that contribute to each dependency are
known (see Functional Site rules PIRSR000532-4 and PIRSR000532-5). Therefore, DE line annotation should
depend on which site rule “fits,” if any. The schematic above indicates the tests and results that occur to propagate
name information. Members of PIRSF000532 are tested against the relevant site rules (black arrows). A positive
result for the ATP-dependency rule (green arrow) means that the entry would be named “ATP-dependent
phosphofructokinase,” according to the PIR Name Rule (PIRNR) PIRNR000532-1, while a positive result for the PPidependency rule (blue arrow) means that the entry would be named “Pyrophosphate-dependent
phosphofructokinase” (PIRNR000532-2). Note that failure to match either one does not mean the activity is
missing. Thus, a fall-back rule can be created (the “zero rule,” dotted red arrow) that would propagate simply
“Phosphofructokinase” without any qualifier. In the case of the query Q6AG22, having failed both specificity rules,
the zero rule would apply.
Another possibility. Most ATP-PFKs have the G-G combination, while most PPi-PFK have the D-K combination.
Figure 3. PIR Site Rule (PIRSR) definition.
Important residues on a template sequence are
indicated, along with the appropriate annotation if a
query passes all match conditions. Two rules regulate
the annotation of ligand. A match to rule 4 means the
query is ATP-dependent, while a match to rule 5
means thequery is PPi-dependent.
Hence, only rules for these two possibilities were written. However, recent evidence indicates that the G-K
combination also functions as an ATP_PFK. Thus, a new rule can be written to cover this scenario, and accordingly
Q6AG22 would be annotated as an ATP_PFK.
PIRSR000532-4
PIRSR000532-5
Conclusion
Critical to our understanding of biology is accurate and up-to-date information. The process of evolution affords us the
ability to make inferences about the nature of the proteins that govern biological processes, since like proteins often
perform like (if not exact) functions. Unfortunately, this same process has been far from a smooth transition from state
to state. The result is that inferences made about one protein based on similarity to another protein using automated
methods are often suspect. This is more than a mere annoyance. The lack of rigorous methods for propagating
appropriate information hampers knowledge discovery by either reducing the associations that can be made, or by
producing associations that should not be made. However, the recent development of methods for better annotation
hold much promise for preventing—and even reversing—the previous trend toward rampant misinformation. The
combination of hierarchical, whole-protein classifications and rule-based large-scale annotation pipelines is a significant
step in theright direction
Figure 4. Global similarity check. A
Leifsonia protein was tested against
HMMs for protein families. Q6AG22
matches PIRSF000532, a family that
contains mostly ATP-PFKs, but also a
few PPi-PFKs.
Site Rules Status
Algorithm
• Match Rule Conditions
– Membership Check
• Ensures that the annotation is
appropriate
– Conserved Region Check (site
HMM threshold)
– Site Residue Check (all
position-specific residues in
HMMAlign)
Figure 5. Further confirmation. All the proteins hit by Q6AG22 using BLAST are members of
PIRSF000532 (only partial results are shown), hence the protein was added to this family. Note
that the best characterized matchesare PPi-PFKs, but….
• Propagate Information
– Feature annotation using
controlled vocabulary
– Evidence attribution
(experimental/computational)
– Attribute sources and strengths
of evidence
PPi-PFK
ATP-PFK
Figure 6. Looks can be deceiving. Iterative clustering using BlastClust makes the initial
observationthat thequery (red arrow)might bea PPi-PFK (blue arrows)less certain.
Q9KH71
DALIAIGGEDTLGVASKFSKLGLPMIGVPKTIDKD
query
DAIIAIGGEGTLTAARRLTDAGLRIVGVPKTIDND
P0A796
DALVVIGGDGSYMGAMRLTEMGFPCIGLPGTIDND
**::.***:.:
* :::. *: :*:* ***:*
PIRSR000532-5 comparison
FAIL
PIRSR000532-4 comparison
FAIL
Figure 7. Motif check. The query protein was tested against each of the rules governing ligand
Figure 1. Annotate carefully. Annotation is propagated only if all the required residues are present. A protein
(P29780) fails the alignment test (red oval), since the rule calls for a cysteine at that position (the query has a
glycine). The other two residues (green ovals) are a match. Nonetheless, no information will be propagated (red
arrow).
binding. Rule 4 (bottom) stipulates that annotation of ATP dependence requires a G-G
combination in key positions. The query passes for only the first position, and thus fails the test.
Rule 5 (top) stipulates that annotation of PPi dependence requires a D-K combination in key
positions.The querypasses for onlythe second position,and thus fails thetest.
• 301 PIR site rules covering 168 PIRSFs have been defined.
• Site information was imported from the Catalytic site residue dataset and
Catalytic Site Atlas.
• 32 PIR site rules covering 19 PIRSF families have been manually curated
and submitted to SIB for comments & suggestions.
• The SIB suggestions will be incorporated and logfiles for annotation
propagation to Swiss-Prot entries not already in HAMAP will be submitted to
SIB.
D.A. Natale, C.R. Vinayaka, and C.H. Wu. Large-scale, classification-driven, rule-based functional
annotation of proteins. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics.
Bioinformatics Volume, Subramaniam, S. (Ed.) John Wiley & Sons, Ltd. 2004.
Swiss Institute of Bioinformatics (SIB)
European Bioinformatics Institute (EMBL-EBI)
Contact [email protected]
Protein Information Resource (PIR)
UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01
HG02712-04. Additional support for the EBI's involvement in UniProt comes from the
European Commission contract FELICS (021902) and from the NIH grant 5 P41 HG0227306. UniProtKB/Swiss-Prot activities at the SIB are supported by the Swiss Federal
Government through the Federal Office of Education and Science. PIR activities are also
supported by the NIH grants for NIAID proteomic resource (HHSN266200400061C) and
grid enablement (NCI-caBIG-ICR), and National Science Foundation grants for protein
ontology (ITR-0205470) and BioTagger (IIS-0430743).
www.uniprot.org
UniProt Non-redundant Reference Cluster (UniRef)
Databases
UniProt Non-redundant Reference Cluster (UniRef) databases, UniRef100, UniRef90 and UniRef50 are automatically
generated from UniProt Knowledgebase and selected UniParc records. The databases provide complete coverage of
sequence space while hiding redundant sequences from view. The non-redundancy allows faster sequence similarity
searches by using UniRef90 and UniRef50
UniProtKB Sequences
UniProtKB Isoform Sequences
Selected UniParc Sequences from ENSEMBL, RefSeq and PDB
databases
String Comparison:
Identifying sub-fragments and identical
sequences
UniRef100
Identical sequences and sub-fragments with 11 or more residues
are placed into a single record
CD-HIT computation:
UniRef90
Clustering UniRef100 representative
sequences at 90% level
40% size Reduction
UniRef90
Members of related UniRef100s at 90% level form a UniRef90 cluster.
The representative is selected based on the quality of the entry, name,
organism and sequence length.
Title and identifier are derived from the representative sequence.
CD-HIT computation:
Clustering UniRef90 representative
sequences at 50% level
UniRef50
Members of related UniRef90s at 50% level form a UniRef90 cluster.
UniRef50
The representative is selected based on the quality of the entry, name,
organism and sequence length.
65% size Reduction
Title and identifier are derived from the representative sequence.
Generating data files for
distribution
UniRef Release
XML file
FASTA file
>UniRef90_P00439 Phenylalanine-4-hydroxylase related cluster
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
IEVLDNTQQLKILADSINSEIGILCSALQKIK
UniRef Usages
<?xml version="1.0" encoding="ISO-8859-1" ?>
<UniRef90 xmlns="http://uniprot.org/uniref"
<entry id="UniRef90_P00439" updated="2006-05-16">
<name>Phenylalanine-4-hydroxylase related cluster</name>
<representativeMember>
<dbReference type="UniProtKB ID" id="PH4H_HUMAN">
<property type="UniProtKB accession" value="P00439"/>
<property type="UniProtKB accession" value="Q16717"/>
<property type="UniProtKB accession" value="Q8TC14"/>
<property type="UniRef100 ID" value="UniRef100_P00439"/>
<property type="protein name" value="Phenylalanine-4-hydroxylase"/>
<property type="source organism" value="Homo sapiens (Human)"/>
<property type="NCBI taxonomy" value="9606"/>
<property type="length" value="452"/>
</dbReference>
<sequence length="452" checksum="018F00EBBBDDCE2F">
MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV
NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW
FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM
EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF
RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA
QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE
KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR
IEVLDNTQQLKILADSINSEIGILCSALQKIK
</sequence>
</representativeMember>
●Speed up similarity search
●Reducing bias in homology searches by providing more even sequence space
●Using he clusters for family classification
●Using the clusters to annotate EST and other sequence databases
●Using the clusters to check the consistency of UniProtKB annotations
Swiss Institute of Bioinformatics (SIB)
European Bioinformatics Institute (EMBL-EBI)
Contact [email protected]
Protein Information Resource (PIR)
UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01
HG02712-04. Additional support for the EBI's involvement in UniProt comes from the
European Commission contract FELICS (021902) and from the NIH grant 5 P41 HG0227306. UniProtKB/Swiss-Prot activities at the SIB are supported by the Swiss Federal
Government through the Federal Office of Education and Science. PIR activities are also
supported by the NIH grants for NIAID proteomic resource (HHSN266200400061C) and
grid enablement (NCI-caBIG-ICR), and National Science Foundation grants for protein
ontology (ITR-0205470) and BioTagger (IIS-0430743).

UniProt Non-redundant Reference Cluster (UniRef) Databases

Transcript UniProt Non-redundant Reference Cluster (UniRef) Databases

Directory