uniprot_lecture_v2.4

Download Report

Transcript uniprot_lecture_v2.4

Protein Sequence Database:
UniProt
Jennifer McDowall
EBI is an Outstation of the European Molecular Biology Laboratory.
Overview
1) The UniProt databases
2) UniProt/SwissProt annotation
3) UniProt/TrEMBL automatic annotation
4) Using the uniprot.org website
5) Computational access
2
1) The UniProt databases
Source of protein sequence data
Large-scale
sequencing
projects
Individual
scientists
Patent Offices
• Protein sequencing is rare
• Most protein sequence
derived from nucleotide data
Nucleotide sequencing
Submit
Protein sequencing
Submit
Nucleotide
sequence
database
4
Derive
protein
sequence
Protein
sequence
database
Protein sequence is mainly derived data
submit
DNA sequence
ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT
transcribe
Derived mRNA sequence
AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC
translate
Derived protein sequence
5
MRSNECCCAMSC
Protein sequence is mainly derived data
submit
DNA sequence
may not
have direct
evidence
ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT
Predicted
start
Derived mRNA sequence
Predicted transcribe
splice sites
AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC
translate
Derived protein sequence
6
Predicted
stop
MRSNECCCAMSC
How to find the information you need?
High quality protein sequence
GAATCATCGTCTACG
AATCATCACGAT
ATAGACATCA
CGCAGCACCAT
GACGCGCATAACT
GCAGCATCAG
TAGCGAGCAGCAGCA
TAGAGGCTATCAGCA
CTATCTGT
CAGCATC
CTAAGCGACA
AGATCGC
TATCTACAG
GATCTACGA
• Non-redundant data
• Splice isoforms, disease variants, PTMs
• Sequence archiving essential
Protein identification
• Stable identifiers
• Consistent nomenclature
Protein annotation
• Information
7
protein function
biological processes
molecular interactions
pathways
UniProt
Since 2002 a merger and collaboration of three databases:
Swiss-Prot & TrEMBL
PIR-PSD
Funded mainly by NIH (US) to be the highest quality,
most thoroughly annotated protein sequence database
http://www.uniprot.org/
8
UniProt Consortium
9
Where does the data come from?
ENA
Sequence sources
UniParc
10
exchange
data daily
Where does the data come from?
ENA
History of
sequences
UniParc
Sequence sources
PDB
RefSeq
Taxonomy
known
Ensembl
VEGA
Patents
Model
organisms
11
Metagenomic &
environmental
more…
UniMES
UniProtKB/
TrEMBL
Manual
annotation
Remove
redundancy
UniProtKB/
SwissProt
High quality
annotation
Where does the data come from?
ENA
UniParc
Sequence sources
PDB
RefSeq
Taxonomy
known
Ensembl
VEGA
Patents
Model
organisms
12
Metagenomic &
environmental
more…
UniMES
UniProtKB/
TrEMBL
UniRef
Clusters
UniMES
Clusters
UniProtKB/
SwissProt
4 components of UniProt
UniParc
 Complete history of sequences
(no annotation)
 Cross-links to external sequence sources
 Swiss-Prot: non-redundant, manual annotation
UniProtKB
UniMES
UniRef
13
 TrEMBL: redundant, automatic annotation
 Sequences from metagenomic projects
 Combines sequences (speed searching)
 UniRef100, UniRef90, UniRef50
Browsing a UniParc entry
Accession
Download
data
List of databases
containing
sequence
Navigate to
individual entries
Deleted
entries
identified
(greyed out)
Sequence
14
Browsing a UniProtKB/SwissProt entry
Download data
Names (synonyms)
and taxonomy
Protein attributes
Annotation
Ontologies
Protein interactions
Splice variants
Sequence features
Sequence
References
15
Navigate to
external data
sources
e.g. Ensembl
General information
Browsing a UniRef90 entry
Faster and more sensitive
sequence search with no
loss of information
Status
Cluster
(SwissProt
name
and/or TrEMBL)
16
List of
entries in
cluster
Taxonomy of
each entry
% identity of
sequences
in cluster
Taxonomic distribution of species
All kingdoms:
Within Eukaryota:
Other mammals
Bacteria
(27%)
(61%)
Other
Vertebrata
(10%)
Homo
(12%)
Archaea
(4%)
Viruses
(3%)
Other
(8%)
Viridiplantae
Nematoda
(18%)
(2%)
Insecta
Eukaryota
(32%)
17
(5%)
Fungi
(18%)
SwissProt – most represented species
Mainly model organisms
18
Protein Existence tag
!! Not sequence validation !!
19
Protein existence level:
Total
Evidence at protein level
13%
Evidence at transcript level
12%
Inferred from homology
70%
Predicted
5%
Uncertain (mainly TrEMBL)
-
Protein existence categories
!! Not sequence validation !!
20
Protein existence level:
Human
Evidence at protein level
59%
Evidence at transcript level
37.5%
Inferred from homology
1%
Predicted
0.5%
Uncertain (mainly TrEMBL)
2%
2) UniProtKB/SwissProt
annotation
Annotation sources for UniProtKB
Protein
classification
Some data sources for annotation
Data sources
22
GO
Functional info
* Manual curation
PRIDE
Protein
identification data
* Literature-based
annotation
InterPro
Protein families and
domains
IntAct
Molecular
interactions
IntEnz
Enzymes
HAMAP
Microbial protein
families
RESID
Post-translational
modifications
InterPro
classification
* Sequence analysis
Signal
prediction
UniProtKB
* Automated
annotation
Transmembrane
prediction
Other
predictions
Features of UniProtKB
Splice variants
Sequence
Sequence
features
Ontologies
Annotations
Nomenclature
23
References
A wealth of external links
Organism-specific DBs
DictyBase
AGD
EchoBASE
CGD
EcoGene
CTD
euHCVdb
CYGD
FlyBase
HGNC
GeneCards
HPA
GeneFarm
MGI
Gramene
MIM
H-InvDB
RGD
LegioList
SGD
Leproma
TAIR
ListiList
ZFIN
MaizeGDB
MypuList
Orphanet
PharmGKB
PhotoList
PseudoCAP
SagaList
SubtiList
TubercuList
WormBase
WormPep
Xenbase
GeneDB_Spombe
ArachnoServer BuruList
Enzyme & pathway
DBs
BioCyc
BRENDA
Reactome
Pathway_Interaction_DB
Proteomic DBs
Genome annotation DBs
Family and domain DBs
PeptideAtlas
PRIDE
ProMEX
Ensembl
KEGG
GeneID
NMPDR
VectorBase
UCSC
GenomeReviews
TIGR
Gene3D
HAMAP
InterPro
PANTHER
Pfam
SMART
Phylogenomic DBs
HOGENOM
HOVERGEN
InParanoid
125 links!
dbSNP
Ontologies
GO
2D gel DBs
HSSP
PDBsum
Gene expression DBs
ArrayExpress Bgee
GermOnline
CleanEx
Genevestigator
Others
Protein family/group DBs
PTM DBs
CAZy
PeroxiBase
PptaseDB
GlycoSuiteDB
PhosphoSite
PhosSite
24
MEROPS
REBASE
TCDB
OMA
PhylomeDB
OrthoDB
Polymorphism DBs
3D structure DBs
DisProt
PDB
SMR
PIRSF
PRINTS
ProDom
PROSITE
TIGRFAMs
BindingDB
PMAPCutDB
DrugBank
NextBio
Sequence DBs
Protein-protein
interaction DBs
EMBL
PIR
UniGene
DIP
IntAct
STRING
IPI
RefSeq
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no
server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
HSC-2DPAGE
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
World-2DPAGE
SwissProt manual annotation
1. Protein sequence
• Merge available CDS (coding sequence)
• Annotate sequence discrepancies
• Report sequencing errors...
2. Biological information
• Extract literature information
• Orthologue data propagation
• Protein sequence analysis...
25
Problem #1: sequence correction
~20% of Swiss-Prot entries required correction
• Typical problems:
– Unsolved conflicts (sequencing errors)
– Erroneous gene model predictions
– Wrong initiation sites
– Frameshifts...
26
Sequence quality from genome projects
• Drosophila:
• Well-curated
• 1.8% of gene models incorrect
• Arabidopsis:
• Annotated when sequenced, but no update
• 19.5% of gene models incorrect
• Tetraodon nigroviridis:
• Automatic run through (no manual intervention)
• >90% of gene models incorrect
27
Sequence curation
Sequencing errors
Other examples of sequencing errors include:
premature stop codons, read-throughs, erroneous initiator methionines
28
Problem #2: proteome complexity
1 SwissProt entry = 1 gene (1 species)
genome
~20,000 human
protein-coding genes
alternative splicing,
alternative initiation,
mRNA editing...
proteome
>1,000,000 human
proteins
Post-translational modification
transcriptome
~100,000 human
transcripts
29
Annotation of
sequence differences
Merging entries
Because of:
1) Errors
•
Erroneous gene model predictions; sequence errors
2) Natural variation
•
Polymorphisms; Alternative start sites; Alternative splicing
Multiple entries for the same protein exist in TrEMBL
(redundancy)
Apart from 100% identical sequences all merged sequences
are analyzed by a curator so they can be annotated
accordingly.
30
Example
Multiple alignment of the end of the available GCR sequences:
Annotation of the sequence differences (protein diversity):
31
Merging entries
32
Sequence curation
Alternative Splicing
33
Sequence curation
Alternative Splicing
34
Sequence curation
Alternative Splicing
35
Sequence curation
Alternative Splicing
36
Sequence curation
Alternative Splicing
37
Sequence curation
Identification of amino acid variants
....and of PTMs
....and also
38
Sequence curation
Domain annotation
Binding sites
39
SwissProt manual annotation
1. Protein sequence
• Merge available CDS (coding sequence)
• Annotate sequence discrepancies
• Report sequencing errors...
2. Biological information
• Extract literature information
• Orthologue data propagation
• Protein sequence analysis...
40
Sources of annotated information
UniProtKB/SwissProt gathers
information from multiple sources:
• Publications (literature/PubMed)
• Prediction proteins (Prosite, Anabelle)
• Contact with experts
• Other databases
• Nomenclature committees
41
Nomenclature
Synonyms useful for
literature searching
42
Nomenclature
Provides synonyms
and cleavage
products of
bifunctional proteins
43
Annotation comments
>30 comment fields
Controlled vocabularies used whenever possible…
44
Disease association
Mendelian Inheritance in Man
provides information on genetic
disease associations
Pharmacogenomics database
45
Sequence annotation (Features)
…enable researchers
to obtain a summary
of what is known
about a protein…
46
Sequence annotation (Features)
Feature (e.g. domain)
highlighted on sequence
47
Gene Ontology
1. Biological Process
A commonly recognized
series of events
2. Molecular Function
An elemental activity or
task or job
3. Cellular Component
Where a gene product
is located
48
•
•
•
Cell division
Mitosis
Organelle fission
•
•
•
Protein kinase activity
Insulin binding
Insulin receptor activity
• Mitochondrion
• Mitochondrial matrix
• Mitochondrial membrane
Gene Ontology
Annotation for human Rhodopsin:
49
Imported annotation
Binary interactions are taken from the
database
Interactors of human p53
50
Evidence for annotation
UniProtKB/Swiss-Prot distinguishes between
experimental and predicted data
Type of evidence
1st: Experimental evidence
2nd: Light experimental evidence
3rd: Inferred by similarity with
homologous protein
4th: Inferred by sequence prediction
51
Evidence tag
Reference provided
Probable
By similarity
Potential
Evidence for annotation
Proven
Proven
Proven
Potential
By similarity
52
Sources references included
53
Versioning and archiving
54
Versioning and archiving
Able to compare
versions directly
55
Versioning and archiving
56
3) UniProtKB/TrEMBL
automatic annotation
UniProtKB/TrEMBL
!! Caution !!
Quality of UniProtKB/TrEMBL entries
depends upon quality of submissions
in original EMBL/GenBank/DDBJ entry.
58
Annotated proteins guide TrEMBL entries
Example for rhodopsin:
• 379 annotated UniProtKB/Swiss-Prot entries
• 9,186 un-annotated UniProtKB/TrEMBL entries
Don’t want un-annotated TrEMBL to be skeleton
entries with no information
Automatic annotation added using Swiss-Prot and InterPro
(function prediction database)
59
Automatic annotation
UniProtKB uses 2 prediction programs:
SAAS:
UniRule:
generates a set of
maintains a set
decision trees using
of manual
data mining.
annotation rules.
(new set every
UniProtKB release)
Swiss-Prot
60
InterPro
Automatic annotation - InterPro
TrEMBL
uncharacterised
sequence
CGCGCCTGTACGC
TGAACGCTCGTGA
CGTGTAGTGCGCG
CGCGCCTGTACGC
TGAACGCTCGTGA
CGTGTAGTGCGCG
manually annotated
sequence
Swiss-Prot
61
automatic
annotation
pipeline
InterPro
protein
signatures
groups of related
proteins
(same family or
share domains)
Browsing a UniProtKB/TrEMBL entry
Name
(could be clone name)
Taxonomy
Automatic annotation .
(derived from InterPro)
Ontologies
(both automatic and
manual curation)
62
4) Using the
www.uniprot.org website
www.uniprot.org
Useful Features
Simple and modular
advanced searching
Integrated BLAST and
Alignments
Batch retrieval in a
variety of formats
64
uniprot.org: anatomy of an entry
Entry Info
Link to UniSave
Link to UniRef
Variety of formats
Navigation bar
Customize order
65
uniprot.org: anatomy of an entry
Entry Info
Link to UniSave
Link to UniRef
Variety of formats
Navigation bar
Customize order
66
Searching UniProt
Search tools include:
• Text Search
• Blast sequence search
• Additional search engines through
EBI (e.g. SSearch and FASTA)
http://www.uniprot.org/
67
Search
Powerful text search tool with
autocompletion and refinement options
look for UniProt entries and documentation
using biological information
68
Search
More search
options
Search sequence database,
literature, taxonomy…
69
Search
Refine search
70
Search results
71
Search results
Define type and order
of search results
72
Search results
Each result linked to
the UniProt entry
SwissProt
TrEMBL
Select specific entries
73
Search results
74
Keeps selected entries
Can retrieve or
throughout session
BLAST sequence
Search results
Can retrieve or align
>2 sequences
75
BLAST
Search refinement
(change parameters)
A tool with standard options to search
sequences in UniProt databases by
sequence blast
76
BLAST
Can query using protein
or nucleotide sequences
77
BLAST
P00750
Can query using identifier:
• UniProtKB accession (P00750)
• Specific version (P00750:2)
• Splice variant (P00750-2)
• Name (A4_HUMAN)
• UniParc accession (UPI0000000001)
• UniRef accession (UniRef100_P00750)
78
BLAST
= best
Threshold =
expectation (E)
value
= should verify
= biological significance less likely
Provides cut-off between good and poor hits
79
BLAST
Matrix = assigns
probability score
for each position
Controls sensitivity of search
80
BLAST
Filtering = masks low
complexity regions
Stretches of cysteines or hydrophobic
regions can cause spurious matches
81
Replaces them with X’s
BLAST
Gapped = allows gaps in sequence
• Yes = to find more distant homologues
• No = to find closest matches (strict)
82
BLAST
Hits = limits
number of results
83
BLAST results
Can filter or
customize results
84
BLAST results
Shows length of
query sequence
aligned
Select match to
see alignment
85
BLAST results – pairwise alignment
Alignment of
selected sequence
86
BLAST results – pairwise alignment
87
Colour alignment by
annotation or
properties
BLAST results
..
.
88
Further down the
results page…
details about matching
protein sequences
BLAST results
.
.
.
Can align checked
sequences
89
BLAST results – multiple alignment
Alignment of
selected sequence
Can add additional
sequences to
alignment
90
BLAST results – multiple alignment
Colour alignment
by annotation or
properties
91
Align
ClustalW multiple alignment tool with
amino-acids highlighting options
and feature annotation highlighting option
92
Retrieve
UniProt-specific tool:
- retrieve a list of entries in several standard formats.
- then query retrieved sequences with UniProt search tool.
93
ID Mapping
Allows mapping between different
databases for a given protein
94
Other tools
Sequence Similarity &
Analysis
95
http://www.ebi.ac.uk/
Other tools
BLAST
FASTA
specialized
searches
96
http://www.ebi.ac.uk/Tools/sss/
5) Computational access
Computational access to UniProt
98
http://www.uniprot.org/
Computational access to UniProt
99
http://www.ebi.ac.uk/uniprot/
Acknowledgements
Rolf Apweiler
Ioanis Xenarios
Cathy H Wu
+100 annotators
100