No Slide Title

Download Report

Transcript No Slide Title

UniProtKB/Swiss-Prot and ExPASy: Protein
sequence databases and proteomics tools
developed at the
Swiss Institute of Bioinformatics
Andrea Auchincloss ([email protected])
Tunis, March 19, 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Outline
•
•
•
•
•
•
The Swiss Institute of Bioinformatics
What is UniProt?
UniProt Knowledgebase: Swiss-Prot and TrEMBL
HPI, post-translational modifications, HAMAP
UniRef and UniParc
Databases for protein function and domains:
PROSITE, InterPro etc.
• ExPASy; other tools
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Swiss Institute of Bioinformatics
(SIB)
• Non-profit foundation created in 1998;
• Groups in Geneva, Lausanne and Basel;
• Federation of several groups (some of which
existed and collaborated long before the
foundation of the institute), about 170
researchers in 2006.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
www.isb-sib.ch
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
SIB missions
• Development of databases and software tools;
• High-quality bioinformatics research program;
• Courses and seminars for the training of
bioinformatics research scientists. This includes a
master’s degree in proteomics and bioinformatics,
several weekly courses and a doctoral school
• Services to the Swiss Life Sciences community
(EMBnet node).
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Swiss Institute of Bioinformatics:
20 research and service
groups
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Proteins are organic compounds made of amino acids
arranged in a linear chain and joined by peptide bonds…
Wikipedia
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Proteins are composed of 20 "standard" amino acids,
symbolised by a LETTER.
Different ‘views’ of a protein
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Proteins can also work
together to perform a particular
function, and they often
associate to form complexes.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Proteins are essential parts of all living organisms and participate in
every process within cells.
-> enzymes
-> structural or mechanical functions
-> important in cell signaling, immune response, cell adhesion, cell
cycle, toxins….
Proteins are a necessary component in our diet, since animals cannot
synthesize all the amino acids and must obtain essential amino acids
from food.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Protein/Gene number
Organism
Number
Bacteria
182-8,591
6,127
17,947
13,849
∼ 25,674
∼21,000
S. cerevisiae
C. elegans
Drosophila
A. thaliana
Human
The universe in which protein
databases evolve
1953: 1st sequence (bovine insulin)
1986: 4,000 sequences
2006: 3.5 million sequences
Where will it stop?
AMB, SP20
179,000,021,000
1st estimate: ~30 million species (1.5 million named)
2nd estimate:
20
million bacteria/archaea
x
4,000 genes
5
million protists
x
6,000 genes
3
million insects
x
14,000 genes
1
million fungi
x
6,000 genes
0.6 million plants
x
20,000 genes
0.2 million molluscs, worms, arachnids, etc.
x
20,000 genes
0.2 million vertebrates
x
21,000 genes
The calculation:
2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x
20000+2x105x20000+2x105x21000+21000(you!)
Caveat: this is an estimate of the number of potential sequence entries, but
not that of the number of distinct protein entities in the biosphere.
AMB, SP20
What is sequencing is underway right now?
Many eukaryotic & bacterial genomes (varying sizes)
Metagenomics (environmental samples)
~ 6 million sequences submitted/published in
December 2006,
~ 17 million sequences being generated at the Venter
Institute, 6 million proteins are being submitted from the
GOS (Global Ocean Sampling) trip
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Protein sequences; what is sequenced?
Currently about 3.5 to 4.0 million ‘known’ protein sequences
More than 99% of these are derived by translation of
nucleotide sequences
Less than 1%: direct protein sequencing (Edman,
MS/MS…)
-> It is important that users know where the
protein sequence comes from…
(sequence & gene prediction quality)!
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Level of DNA/RNA sequence quality
- DNA/RNA sequencing quality (genome or WGS, cDNA
or EST …)
- Gene prediction quality; programs used, is there manual
intervention afterwards?
For example:
Authors can specify the nature of the CDS in the nucleotide
databases by using qualifiers:
"/evidence=experimental" or "/evidence=not_experimental".
Very rarely done…
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Public nucleic
acid databases
EMBL, GenBank, DDBJ
…if the submitters provide an
annotated Coding Sequence (CDS)
Public protein
sequence databases
CDS: CoDing Sequence (CDS)
CDS provided by the submitters
The first Met !
CDS translation provided by EMBL
Data not submitted
Complete genome (submitted)
only ~ 1,858 CDS available!
Issue for the users:
the protein database jungle
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Public nucleic
acid databases
EMBL, GenBank, DDBJ
…if the submitters provide an
annotated Coding Sequence (CDS)
Public protein
sequence databases
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
EMBL, GenBank, DDBJ
Scientific publications
derived sequences
CoDing Sequences
provided by submitters
TrEMBL
UniProtKB
GenPept
RefSeq*
PRF
PIR
IPI
Swiss-Prot
Manually annotated
UniParc
EnsEMBL*
CCDS
* Also gene prediction
PDB
+ species-specific databases (EcoGene, TubercuList, TIGR…)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Major public protein sequence database ‘sources’
PIR
PDB
PRF
Integrated
resources
‘cross-references’
UniProtKB: Swiss-Prot + TrEMBL
Separated resources
NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq
UniProtKB/Swiss-Prot: manually annotated protein sequences (11,000 species)
UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot
(127,000 species)
GenPept: submitted CDS (GenBank); redundant with UniProtKB (about 130,000 species)
PIR: Protein Information Resource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4,000 species)
Other protein sequence databases
CCDS: EBI + NCBI + Wellcome Trust Sanger + UC Santa Cruz (2 species)
Consensus human and mouse sequences between 4 institutions…
Combining different approaches – ab initio, by similarity - and taking
advantage of the expertise acquired by different institutes, including
manual annotation…
EnsEMBL: UniProtKB + RefSeq + gene prediction (31 species)
aligns some eukaryotic genomic sequences with all the sequences found
in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (→
known genes)- Also does some gene prediction (→ novel genes)
IPI: UniProtKB + RefSeq + EnsEMBL + (H-InvDB, TAIR, VEGA) (7 species)
provides a guide to the main databases that describe the human, mouse,
rat, zebrafish, Arabidopsis, chicken, and cow proteomes.
…
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The UniProt consortium
European Bioinformatics Institute
European Molecular Biology Laboratory
Swiss Institute of
Bioinformatics
Protein
Information
Resource
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most
comprehensive catalogue of protein information
www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).
Provides 3 databases:
-UniProtKB (Swiss-Prot + TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and Environmental Sequences)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The Universal Protein resource components
UniProt UniProtKB
KnowledgeBase
UniProtKB Release 9.7 consists of:
UniProtKB/TrEMBL
Computer annotated
protein sequences
3’600’000 entries
~100’000 species
UniProtKB/Swiss-Prot
Manually annotated
protein sequences
260’000 entries
~10’000 species
produced by
SIB and EBI
UniRef100
UniRef 90
UniRef 50
• One UniRef100 entry =
All identical sequences
(including fragments).
• One UniRef90 entry =
Sequences that have at least
90% or more identity.
• One UniRef50 entry =
Sequences that are at least
50% or more identity.
Independent of species.
Allows comprehensible BLAST
similarity searches by providing sets
of representative sequences
produced by
PIR
UniProt Archives
~8’000’000 entries
Archived raw
protein
sequences, found
in publicly
accessible
databases:
Swiss-Prot, TrEMBL,
PIR, EMBL, Ensembl,
IPI, PDB, RefSeq,
FlyBase, WormBase,
Patent Offices.
Use with extreme
caution:
Contains
pseudogenes,
incorrect CDS
predictions, etc…
produced by
EBI
The Universal Protein resource components
UniProt UniProtKB
KnowledgeBase
UniProtKB/TrEMBL
Computer annotated
protein sequences
3,900,000 entries
~127,000 species
UniProtKB/Swiss-Prot
Manually annotated
protein sequences
260,000 entries
~11,000 species
produced by
SIB and EBI
UniRef100
UniRef 90
UniRef 50
• One UniRef100 entry =
All identical sequences
(including fragments).
• One UniRef90 entry =
Sequences that have at least
90% or more identity.
• One UniRef50 entry =
Sequences that are at least
50% or more identity.
Independent of species.
Allows comprehensible BLAST
similarity searches by providing sets
of representative sequences
produced by
PIR
UniProt Archives
~8’000’000 entries
Archived raw
protein
sequences, found
in publicly
accessible
databases:
Swiss-Prot, TrEMBL,
PIR, EMBL, Ensembl,
IPI, PDB, RefSeq,
FlyBase, WormBase,
Patent Offices.
Use with extreme
caution:
Contains
pseudogenes,
incorrect CDS
predictions, etc…
produced by
EBI
The Universal Protein resource components
UniProt UniProtKB
KnowledgeBase
UniProtKB/TrEMBL
Computer annotated
protein sequences
3,900,000 entries
~127,000 species
UniProtKB/Swiss-Prot
Manually annotated
protein sequences
260,000 entries
~11,000 species
produced by
SIB and EBI
UniRef100
UniRef 90
UniRef 50
• One UniRef100 entry =
All identical sequences
(including fragments).
• One UniRef90 entry =
Sequences that have at least
90% or more identity.
• One UniRef50 entry =
Sequences that are at least
50% or more identity.
Independent of species.
Allows comprehensible BLAST
similarity searches by providing sets
of representative sequences
produced by
PIR
UniProt Archives
~8’000’000 entries
Archived raw
protein
sequences, found
in publicly
accessible
databases:
Swiss-Prot, TrEMBL,
PIR, EMBL, Ensembl,
IPI, PDB, RefSeq,
FlyBase, WormBase,
Patent Offices.
Use with extreme
caution:
Contains
pseudogenes,
incorrect CDS
predictions, etc…
produced by
EBI
The Universal Protein resource components
UniProt UniProtKB
KnowledgeBase
UniProtKB/TrEMBL
Computer annotated
protein sequences
3,900,000 entries
~127,000 species
UniProtKB/Swiss-Prot
Manually annotated
protein sequences
260,000 entries
~11,000 species
produced by
SIB and EBI
UniRef100
UniRef 90
UniRef 50
• One UniRef100 entry =
All identical sequences
(including fragments).
• One UniRef90 entry =
Sequences that have at least
90% or more identity.
• One UniRef50 entry =
Sequences that are at least
50% or more identity.
Independent of species.
Allows comprehensible BLAST
similarity searches by providing sets
of representative sequences
produced by
PIR
UniProt Archives
~8,800,000 entries
Archived raw
protein
sequences, found
in publicly
accessible
databases:
Swiss-Prot, TrEMBL,
PIR, EMBL, Ensembl,
IPI, PDB, RefSeq,
FlyBase, WormBase,
Patent Offices.
Use with extreme
caution:
Contains
pseudogenes,
incorrect CDS
predictions, etc…
produced by
EBI
UniProt web sites…
http://www.expasy.org/sprot/
http://www.pir.uniprot.org/
http://www.ebi.ac.uk/uniprot/
http://www.uniprot.org/
Soon, a new unified web site,
with a very powerful search engine….
http://beta.uniprot.org/
Test it! Logon:guest
Password: amazing
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The UniProt groups from SIB,
EBI and PIR (Antibes, September 2004)
In Geneva (SIB):
2 Group Leaders
44 Annotators
4 Prosite annotators
22 Programmers and Researchers
5 Administrators, science communicators
3 System Administrators
4 Students
1 GISAID
At EBI:
-----------------(Swiss-Prot + EMBL + TrEMBL)
85 people
75 people (29 Annotators)
A. Auchincloss
UniProtKB and ExPASy
At PIR:
1 Group Leader
13 Protein Science Team
12 Informatics Team
-----------------26 people
Tunis, March 2007
UniProtKB has biweekly releases; available
from about ~100 servers, the main sources
being ExPASy and www.uniprot.org
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB
From EMBL (DNA) to
TrEMBL (protein)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Gene/protein name
Taxonomy
Reference
CDS
TrEMBL
EMBL
Automated extract of the
protein sequence (CDS),
gene name, taxonomy and
references.
Automated annotation (KWs
and protein family).
! TrEMBL does not translate DNA sequences,
nor does it use gene prediction programs: only
takes the existing CDS proposed by the
submitting authors in the
EMBL/Genbank/DDBJ entry
In particular, the proposed CDS and derived
protein sequences can be experimentally
proven or derived from gene prediction
programs (this is not obvious from the TrEMBL
entry)
TrEMBL does not validate any sequences
!!!!
The quality of UniProtKB/TrEMBL data is directly
dependent on the information provided by the submitter of
the original nucleotide entry.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
CDS
Automated extraction
of the protein
sequence (CDS),
gene name and
references.
Automated annotation.
TrEMBL
Manual annotation of
the sequence and
associated biological
information (derived
from literature, external
experts, databases…)
Annotation of sequence differences
(conflicts, variants, splicing…)
EMBL
Average of 6 independent sequence reports
for each human protein
Swiss-Prot
Distinguishing Swiss-Prot and TrEMBL
– A TrEMBL entry is a computer-annotated record
derived from a coding sequence (CDS) in the
nucleotide sequence databases, not in Swiss-Prot,
after some redundancy removal and automated
annotation.
– A Swiss-Prot entry is a manually annotated record
for a given protein.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot
Step 1: Sequence check
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB/Swiss-Prot
Non-redundant
1 entry -> 1 gene (1 species)
i) Merge all known protein sequences (CDS and amino acid) derived
from the same gene
-> decreases redundancy and improves sequence reliability
ii) Annotation of the sequence differences
(including conflicts, polymorphisms, splice variants etc..)
-> annotation of protein diversity
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Redundancy…
UniProtKB/Swiss-Prot
~11,000 species
UniProtKB/TrEMBL
~127,000 species
260,000 + 3,800,000  3,600,000
Redundancy in TrEMBL
&
Redundancy between TrEMBL and Swiss-Prot
In the future: redundancy is going to decrease:
"new" genome sequencing → "new" proteins
- 13 sequences (complete or partial)
- derived from mRNA (n=6) or genomic DNA (n=7)
All alternatively spliced sequences are available for BLAST
searches, protein identification tools and are downloadable…
Human: ~2/3 of the human genes are alternatively spliced
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
- 6 genomic sequences (complete or partial)
- 1 protein sequence from PIR
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Multiple alignment of the available clpB sequences
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Within Swiss-Prot?
• A snapshot of the situation (December 2006):
– 28,200 entries with 82,000 sequence conflicts;
– 2,600 entries with corrected frameshifts;
– 15,100 entries with corrected initiation sites;
– 4,300 entries with other sequence ‘problems’.
• At least 43,000 entries (19% of Swiss-Prot) required a
minimal amount of annotation effort to obtain the “correct”
sequence.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Quality of protein information from genome
projects
• Proteins originating from different genome projects:
– Drosophila: what a curated (thanks to FlyBase) genome
effort should look like: only 1.8% of the gene models
conflict with what we have in UniProtKB/Swiss-Prot;
– Arabidopsis: a genome where lots of work was done to
annotate it when it was sequenced, but where nothing as
been done since (at least in the public view): 19.5% of
the gene models are erroneous;
– Tetraodon nigroviridis: a quick and dirty automatic run
through a genome with no manual intervention: >90% of
the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so
prediction is “easier”, however errors are still made…
• Producing a clean set of sequences is not a trivial task;
• It is not getting easier as more and more types of sequence
data is submitted;
• It is important to pursue our efforts in making sure we
provide to our users the most correct set of sequences for a
given organism.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
New ‘Protein existence evidence’ tag
•
As most protein sequences are derived from translation of
nucleotide sequence and are only predictions, the new PE
line indicates whether there is any evidence that proves the
existence of a protein;
•
The ‘Protein existence evidence’ will have 5 different
qualifiers:
1. Evidence at protein level
2. Evidence at transcript level
3. Inferred from homology
4. Predicted
- Unassigned (used mostly in TrEMBL)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Righting the wrongs
“Sequences are rarely deposited in a “mature” state; as with
all scientific research, DNA and protein annotation is a
continual process of learning, revision and corrections.”
“Sequencing error rates: ~1 base in 10’000”
“Making people aware of errors is good and great; making
people aware that they’re responsible also for correcting
errors is even greater”
C. Hardley, EMBO reports, 4(9), 2003.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot
Step 2: Annotation:
literature
controlled vocabulary
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Annotation
•
The focal point of the efforts to maintain and develop
UniProtKB/Swiss-Prot;
•
It is becoming more and more important as it provides:
 a summary of what is known about a protein;
 creates template for automatic annotation for the many
organisms whose genome sequence is/will be available but
whose proteins will not be characterized;
 provides well annotated (corpus) entries to train literature
mining tools (text mining).
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
….
Source of data
- publications (> 1,700 journals cited)
-also external scientific expertise & other
databases
(…)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Comments: “structured free text”, 27 defined topics
Manually annotated
Information from papers,
specialized databases, computer prediction,
external experts, brain storming
Distinction between data obtained
experimentally and computerized inferences
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot
Step 3: Sequence analysis
(bioinformatics tools)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The annotation platform
Annotators could not work without the help of our
software developers;
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Anabelle: much more than a
domain annotation platform
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
We manually check the
results !
What else is in a UniProtKB/Swiss-Prot entry?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Cross-references; a central hub
Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001)
www.expasy.org/cgi-bin/lists?dbxref.txt
• Swiss-Prot was the first database with X-references;
• Explicitly X-referenced to 85 databases:
– DNA (EMBL/GenBank/DDBJ),
– 3D-structure (PDB)
– Family and domain (InterPro, HAMAP, PROSITE, Pfam,
etc.)
– genomic (OMIM, MGI, FlyBase, SGD, SubtiList, etc.)
– 2D-gel (e.g. SWISS-2DPAGE)
– specialized db (e.g.GlycoSuiteDB, PhosSite, MEROPS);
– literature (PubMed)
• Each UniProtKB/Swiss-Prot entry can be seen as a central hub
for the data available about the protein it describes
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Organism-specific
databases
AGD
CYGD
DictyBase
EchoBASE
EcoGene
euHCVdb
FlyBase
GeneDB_Spombe
GeneFarm
Gramene
H-InvDB
HGNC
HIV
HPA
LegioList
Leproma
ListiList
MaizeGDB
MGI
MIM
MypuList
PhotoList
RGD
SagaList
SGD
StyGene
SubtiList
TAIR
TubercuList
WormBase
WormPep
ZFIN
Genome annotation
databases
Ensembl
GenomeReviews
KEGG
TIGR
Sequence databases
EMBL
PIR
UniGene
Enzyme and pathway
databases
Family and domain
databases
BioCyc
Reactome
Gene3D
HAMAP
InterPro
PANTHER
PIRSF
Pfam
PRINTS
ProDom
PROSITE
SMART
TIGRFAMs
2D-gel databases
UniProtKB/Swiss-Prot
explicit links
ANU-2DPAGE
Aarhus/Ghent-2DPAGE
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE
HSC-2DPAGE
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
Miscellaneous
3D structure
databases
HSSP
PDB
SMR
PTM databases
GlycoSuiteDB
PhosSite
ArrayExpress
dbSNP
DIP
DrugBank
GO
IntAct
LinkHub
RZPD-ProtExp
Protein family/group
databases
GermOnline
MEROPS
PeroxiBase
PptaseDB
REBASE
TRANSFAC
Implicit cross-references on new
web server and ExPASy
Implicit X-references to 26 additional db added by the
ExPASy server on the www (i.e.: GeneCards,
ModBase, etc.)
These X-refs are not present as hard-coded DR lines in
the Swiss-Prot entry as it can be downloaded by ftp,
but are added on the fly when someone views an
entry on ExPASy. This can be done because enough
information is present in the UniProtKB entry to
access the related information in another db.
Example: All Swiss-Prot/TrEMBL are linked to the
BLOCKS domain db, via the Swiss-Prot/TrEMBL
accession number
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Keyword definition and usage in Swiss-Prot
Linked to Gene Ontology to further facilitate
information retrieval via controlled vocabularies
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
In a UniProtKB/Swiss-Prot entry, you
can expect to find:
•
•
•
•
•
•
•
•
All the names of a given protein (and of its gene);
Its biological origin with links to the taxonomic databases;
A selection of references;
A summary of what is known about the protein: function,
alternative products, PTM, tissue expression, disease, 3Dstructures, etc.…;
Numerous cross-references;
Selected keywords;
A description of important sequence features: domains,
PTMs, variations, etc.;
A (often corrected) protein sequence and the description of
various isoforms/variants.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Monitoring entry history: The UniProtKB
Sequence/Annotation Version archive
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
… and many useful links:
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
And on the new website
other tools are not yet available…
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProt Knowledgebase
• Swiss-Prot: Manually annotated section
• TrEMBL: Automatically annotated
section
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Distinguishing Swiss-Prot and
TrEMBL
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Accession number: to be used when you cite a UniProt
entry in anywhere (never cite the entry name (ID) alone)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Non-Redundant Complete
Proteome Sets
• Text search UniProtKB keyword “Complete
proteome”, combined with an organism
name
• Or download precomputed sets (bacteria,
archaea, some eukaryotes):
ftp://ftp.expasy.org/databases/complete_proteomes/entries
• Or EBI Integr8 http://www.ebi.ac.uk/integr8/
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Swiss-Prot annotation priorities
The main annotation programs:
•
•
•
•
•
•
•
•
•
•
•
HAMAP (High quality Automated and Manual Annotation
of microbial Proteomes; bacteria, archaea, plastids);
HPI (Human Proteomics Initiative);
PPAP (Plant Proteome Annotation Project);
FPAP (Fungal Proteome Annotation Project);
Viral proteins;
Tox-Prot (Toxin Annotation Project);
ENZYMES (proteins with EC numbers);
PTMs
3D-structure
Protein-protein interactions
Quality assurance, includes controlled vocabularies
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Model organisms
• Organisms for which we want to have a more
in-depth coverage;
• Completeness, links with specialized
databases, specific documents;
• Examples: E.coli, B.subtilis, human, mouse,
fruitfly, C.elegans, yeast, S.pombe,
A.thaliana.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Human Proteomics Initiative
(HPI)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
From genome to proteome
~ 1,000,000 human
proteins
~ 21,000 human genes
alternative splicing
of mRNA
2-5 fold increase
post-translational modifications
of proteins (PTMs)
5-10 fold increase
~ 100,000 human
transcripts
Considerable increase in
complexity
In the case of human genes, the Swiss-Prot/TrEMBL
redundancy is still very high:
15,803 + 53,100  about 20,000*
* human gene number estimation:
21,000-35,000
MS proteomics has verified more than 10% of human genes
products, but has not identified significant numbers of
unpredicted proteins
What is missing:
• Sequences not submitted to EMBL/GenBank/DDJB (and PIR)
• Not yet predicted or known genes ("no CDS provided by
the submitters" or no DNA sequence)
• Confidential data (Patent application sequences)
• Immunoglobulins, T-cell receptors (-> UniParc)
•…
1000
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Post-translational modifications
(PTMs)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
PTM definition
a post-translational modification or PTM is
a modification of a polypeptide chain involving the making or
the breaking of covalent bond(s) that occurs during (cotranslational class) or after translation.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
PTMs influence or even define protein function
phosphorylation and possibly GlcNAcylation and S-nitrosylation are
a means of transducing extracellular signals to the inside of the
cells.
methylation has a role in nuclear protein import.
lipid addition allows protein to membrane association (e.g. GPIanchor, myristate, palmitate).
intrachain disulfide bonds and N-glycosylation influence protein
folding.
interchain disulfide bonds bind subunits together.
other PTMs are directly involved in the protein function, as for
example the binding of cofactors (e.g. pyridoxal phosphate), or the
synthesis of a cofactor by the modification of amino acids present
in the protein (e.g. quinones).
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
PTM variety
Gly
acetylation

methylation
acylation

phosphorylation
oxidation
crosslinks

hydroxylation
cofactor binding
sulfation
C-linked sugar
N-linked sugar
O-linked sugar
S-linked sugar
Ala


Val
Leu Ile

Lys Arg His Asp Glu Asn Gln
side-chain modifications















Cys Ser
Thr








Met Pro Phe Tyr
Trp









Each protein can be modified
at
sites…which
gives
a


 various

















high number of ‘alternative’ peptides.






















N-terminal modifications
283 different protein modifications
are annotated in
acetylation











methylation









UniProtKB/Swiss-Prot…
acylation
crosslinks











C-terminal modifications


 









GPI



amidation








crosslinks




methylation






in black: cytoplasmic modifications
in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type
in light grey: extracellular modifications
A. Auchincloss
UniProtKB and ExPASy









Tunis, March 2007

Large scale experiments (LSE) for PTMs!
• PTM information can now be obtained from results
of proteomics large scale experiments (LSE);
• In the past 12 months we have added about 6’000
experimental PTMs using data originating from
some of these projects.
AMB, SP20
Proteomic studies have lead to the updating of 2767 human
Swiss-Prot entries, mainly with PTM information
(UniProt release 10.0 , March 2007)
Phosphorylation
(83%)
Subcellular location
(4%)
Glycosylation
(9%)
Other PTMs (4%)
Bacteria and Archaea
(HAMAP)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
In 2006, ≈130 new bacterial and archaeal genomes (not
WGS) were submitted to the DNA databases;
If on "average" 4,000 proteins/genome=>500,000 proteins!
How to cope????
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
High quality
Automated and
Manual
Annotation of
microbial
Proteomes
HAMAP
Lots of microbial genomes, lots of
proteins. What should we do with them
in UniProt?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.expasy.org/unirule/MF_00319
Automatic annotation of proteins
belonging to specified families (1)
• This program requires the continuous
development and adaptation of software tools
as well as the development of a database of
annotation rules for each family (so far about
1,400).
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Allows us to annotate automatically, yet with a very
high level of quality, proteins that belong to well
defined protein families;
Can be applied to both characterized proteins and
to some UPF’s (Uncharacterized Protein Family);
The families are based on UniProtKB/Swiss-Prot
entries, so we first do all the annotation steps
described earlier!
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
/www.expasy.ch/sprot/hamap/
Using HAMAP, we can currently annotate to Swiss-Prot quality
level between 10% to 50% of a complete microbial proteome
(next step: HAMAP for Fungi…)
Updates
• DNA sequence archives
– EMBL/GenBank/DDBJ is an archive
• All submitted data goes into the archive
• Submitters are responsible for the submitted sequences
and the accompanying annotation
• Nobody else can change them (including the curators at
EMBL/GenBank/DDBJ)
• Protein sequence databases
– UniPRotKB/Swiss-Prot is NOT an archive
• Swiss-Prot chooses what goes into the database and
where to place it
• Swiss-Prot updates annotation and sequences when
necessary
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
**ZB SYP, 28-NOV-2003; ALB, 16-NOV-2004; MIM, 31-Jan-2006;
**ZB BER, 13-FEB-2006; LYG, 14-JUN-2006; LYG, 21-SEP-2006;
**ZB CHH, 05-DEC-2006;
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
User updates or annotation requests
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Accessing & Searching UniProtKB
Direct access (keyword search)
• New search tool – we’ll use it later
• Sequence Retrieval System (SRS, Europe), will
disappear
• Entrez (NCBI, USA) – UniProtKB/Swiss-Prot (not
TrEMBL) is integrated in GenPept, but with a
changed format, and with some information (e.g.
implicit cross-references) is missing
• Query tools on ExPASy & UniProt
(http://www.expasy.org/sprot/, http://www.uniprot.org)
Indirect access (sequence search)
• Bioinformatics & sequence analysis tools (Blast,
Fasta, GCG, Emboss, MS Identification tools…)
Downloading the UniProt
Knowledgebase
http://www.expasy.org/sprot/download.html
• Swiss-Prot and TrEMBL form a complete, non-redundant
database, the UniProt Knowledgebase
• Can be downloaded from
ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase
• In “Swiss-Prot” format, fasta or xml format
• Complemented by sequences of alternative splice isoforms
• “everything” about “ all” proteins! (at least all CDS submitted to
the public nucleotide sequence databases)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
If you want to develop tools to work with
your local copy of UniProtKB:
Swissknife – a PERL parser for UniProtKB
Constantly updated according to latest format
changes
Advantage: you do not need to know how exactly
the information is stored in the flat file
• http://swissknife.sourceforge.net/
• ftp://ftp.ebi.ac.uk/pub/software/swissprot/Swissknife/
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Take home message
• Swiss-Prot is the non redundant, manually annotated and highly
cross-referenced section of the UniProt Knowledgebase
• Be aware of the differences between UniProtKB/TrEMBL and
UniProtKB/Swiss-Prot
– Computer vs. Human
– Redundant vs. Non-redundant
• Always cite the Accession number, not the entry name
– The AC is stable
– The entry name might change
We need your feedback and your expertise!
[email protected]
http://www.expasy.org/sprot/update.html
(and from every UniProtKB entry page on our servers)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most
comprehensive catalogue of protein information
www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).
Provides 3 databases:
-UniProtKB (Swiss-Prot + TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and Environmental Sequences)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniRef100, 90 and 50 clusters
One UniRef100 entry -> all identical sequences from
UniProtKB and some sections of UniParc (including
fragments, Swiss-Prot splice variants).
One UniRef90 entry -> sequences that have at least
90% or more identity.
One UniRef50 entry -> sequences that are at least
50% identical.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniRef100, 90 and 50 clusters
One cluster can contain sequences of several species,
clustering is done independently of the organism
Each cluster has a “representative”, “reference”
sequence, preferably that of the best-annotated
Swiss-Prot entry
UniRef identifiers are of the form UniRef100_P99999,
UniRef50_P00414 – not stable, as clusters are
recomputed with every biweekly release, and cluster
representatives can change!
UniRef is useful for comprehensive BLAST sequence
searches by providing sets of representative
sequences.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Implicit cross-link UniProtKB to UniRef:
new web view:
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most
comprehensive catalogue of protein information
www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).
Provides 3 databases:
-UniProtKB (Swiss-Prot + TrEMBL)
-UniRef
-UniParc
and soon UniMES (for Metagenomic and Environmental Sequences)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniParc – the UniProt Archive
• 8.8 million sequences
• Sequences and cross-references (AC numbers)
• A comprehensive collection of the raw protein
sequences in public databases (including those not
submitted to the DNA databases):
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI,
PDB, RefSeq, FlyBase, WormBase, Patent
Offices.
• UniParc can be used to track sequence versions
Use with extreme caution: also contains pseudogenes, incorrect
CDS predictions, etc…and is highly redundant !
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniParc tracks a protein
sequence and its integration in various databases
http://www.pir.uniprot.org/cgi-bin/textSearch_AR
Patent data
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniParc entry UPI0000033477 part 2
TrEMBL entry probably to
be merged into Swiss-Prot
TrEMBL entry was merged
into Swiss-Prot
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
www.expasy.ch/prosite
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
PROSITE
A database of protein families and domains using two kinds of
motif descriptors:
Patterns or regular expressions :
•User friendly (easy to understand and to use)
•Well designed for the detection of biologically meaningful sites
such as residues playing a structural or functional role
•Can be used to scan a protein database in reasonable time
on any computer
Generalized profiles or weight matrices :
•Well adapted to cover the full length of the protein or domain
•Are able to detect highly divergent families or domains with
only a few well conserved positions
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Identification of protein domains and families
• There are two non-exclusive approaches for the determination
of the function of an uncharacterized protein:
– Comparison with a complete sequence database (BLAST)
– Scanning a database of patterns and profiles
• Most proteins can be grouped into families. Proteins belonging
to a particular family share functional attributes and are
derived from a common ancestor;
• Some regions in the sequence are more conserved than
others during evolution because they are important for the
function or the structure of the protein;
• Like fingerprints for police identification, signatures built out of
sequence patterns or profiles can be used to formulate
hypotheses about the function of uncharacterized proteins.
Definitions of conserved regions
Conserved regions can be classified into 5 different groups:
• Families: proteins that have the same domain arrangement,
be 1 or many domains.
• Domains: specific combination of secondary structures that
assume characteristic three dimensional structures or folds.
• Repeats: structural units always found in two or more copies
that assemble in specific fold. Assemblies of repeats might
also be thought of as domains.
• Motifs: short regions with conserved active- or binding-sites
that usually adopt a folded conformation only in association
with their ligands.
• Sites: functional residues (active sites, disulfide bridges,
post-translationally modified residues)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Conserved regions (2)
CSA_PPIASE
Binding cleft (motif)
Cys 181: active site residue
PPID family: 1 CSA_PPIASE domain + 3 TPR repeat
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.expasy.org/tools/scanprosite/
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Functionally and structurally relevant
residues in PROSITE motif descriptors
A new concept to extract more information from
profiles
Principle :
• Combining the advantages of profiles (high
sensitivity) and patterns (position-specific
information)
• Tagging of amino acids at precise positions in
the profile and checking their presence in the
matched sequence
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
ProRule
Aim:
• Provide users with biologically meaningful functional and
structural information:
active sites,
post-translational modification sites,
binding sites,
disulfide bonds,
transmembrane regions.
• Help the UniProtKB/Swiss-Prot annotation and provide
enhanced homogeneity:
domain name and boundaries,
keywords and linked GO terms,
EC numbers,
false negative PROSITE patterns.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
www.expasy.ch/prosite/prorule.html
Sigrist et al.: Bioinformatics 21:4060-4066(2005)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Other methods for protein/domain
identification
Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD: Hidden
Markov Models (HMM), Probabilistic models;
PRINTS: “Unweighted” matrices; protein fingerprints
BLOCKS: Weight matrix derived from ungapped alignments;
PIRSF, SUPERFAMILY: classification system based on
evolutionary relationship of whole proteins
ProDom: automatic compilation of homologous domains based
on recursive PSI-BLAST searches.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The InterPro project
www.ebi.ac.uk/interpro
Integrated Documentation Resource of Protein Families,
Domains and Functional Sites
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The InterPro project
www.ebi.ac.uk/interpro
• Unification of PROSITE, PRINTS, Pfam and ProDom into an
integrated resource of protein families, domains and
functional sites in 2000;
• Joint effort in creating a unified yet methodologically diverse
system for protein family/domain identification;
• Single set of “documents” linked to the various methods;
• Distributed with tools by anonymous FTP and through www
servers;
• Used to enhance the functional annotation of UniProtKB
(Swiss-Prot and TrEMBL)
• Has progressively incorporated other databases
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Current status of InterPro
Release 14.1 (February 2007) was built from Pfam, PRINTS,
PROSITE, ProDom, SMART, TIGRFAMs, PIRSF, Scop based
SUPERFAMILY, Gene3D and PANTHER, and the current
UniProt/Swiss-Prot + TrEMBL data.
(for details see http://www.ebi.ac.uk/interpro/release_notes.html)
InterPro release 14.1 contains 13,953 entries, representing 3,911
domains, 9,610 families, 232 repeats, 34 active sites, 20 binding
sites and 19 post-translational modification sites. Overall, there are
15,880,845 InterPro hits from 3,100,874 UniProtKB protein
sequences.
92.4% of Swiss-Prot and 76.4% of TrEMBL protein sequences
have one or more InterPro hits.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.ebi.ac.uk/interpro/
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001304
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
InterPro: Graphical domain representation
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeID=25
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=18
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
The ExPASy www server
• First molecular biology server on the Web (August
1993); ~500 million accesses since;
• Dedicated to proteomics:
– Databases: UniProtKB, PROSITE, Swiss-2DPAGE, etc.;
– Many 2D/MS protein identification/characterization and
sequence analysis tools;
• Mirror sites in Australia, Brazil, Canada, China and
Korea: http://{au|br|ca|cn|kr|www}.expasy.org
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
ExPASy software tools
• Tools for the display and management of databases
(NiceProt, Swiss-Shop sequence alerting system,
etc.);
• Tools for sequence analysis (ScanProsite,
ProtParam, ProtScale, RandSeq, Translate, etc.);
• Proteomics tools (AACompIdent, FindMod,
FindPept, Aldente, PeptideMass, TagIdent, etc.);
• 3D-structure analysis and display tools (SwissModel, Swiss-PDBviewer)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
http://www.expasy.org/tools/
Identification:
Aldente,
TagIdent,
AAcompIdent,
MultiIdent
Characterization:
FindMod,
GlycoMod,
FindPept
Analysis:
PeptideMass,
GlycanMass,
BioGraph,
- Use annotation in Swiss-Prot and TrEMBL
PeptideCutter
(preprocessing, PTMs, etc.)
ProtScale,
A. Auchincloss
UniProtKB
and ExPASy
- Hyper-links between tools
and
databases Tunis, March 2007
ProtParam
http://www.expasy.org/links.html
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Finding out about recent
developments:
UniProtKB/Swiss-Prot recent format changes:
http://www.expasy.org/sprot/relnotes/sp_news.html
UniProtKB/Swiss-Prot planned format changes:
http://www.expasy.org/sprot/relnotes/sp_soon.html
Subscribe to the electronic Swiss-Flash bulletins:
http://www.expasy.org/swiss-flash/
What’s new on ExPASy:
http://www.expasy.org/history.html
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
UniProtKB/Swiss-Prot:
http://www.expasy.org/sprot/sprot-ref.html
References (1)
Wu C. et al. The Universal Protein Resource (UniProt): an expanding universe of protein
information.
Nucleic Acids Res. 34:D187-191(2006).
Boeckmann B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its
biological context
Comptes Rendus Biologies 328:882-99(2005).
Bairoch A.
Swiss-Prot: Juggling between evolution and stability
Brief. Bioinform. 5:39-55(2004).
Farriol-Mathis N. et al. Annotation of post-translational modifications in the Swiss-Prot
knowledgebase. Proteomics 4:1537-1550(2004).
Gasteiger E. et al. A. Swiss-Prot: Connecting biological knowledge via a protein
database
Curr. Issues Mol. Biol. 3:47-55(2001).
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
PROSITE:
References (2)
Hulo N., et al., The PROSITE database. Nucleic Acids Res. 34:D227D230(2006).
Sigrist C.J.A., et al., PROSITE: a documented database using patterns and
profiles as motif descriptors. Brief Bioinform. 3:265-274(2002).
Gattiker A., et al., ScanProsite: a reference implementation of a PROSITE
scanning tool. Applied Bioinformatics 1:107-108(2002).
Sigrist C.J.A., et al., ProRule: a new database containing functional and
structural information on PROSITE profiles. Bioinformatics. 2005
21(21):4060-6.
ExPASy:
Gasteiger E. et al.ExPASy: the proteomics server for in-depth protein knowledge and
analysis. Nucleic Acids Res. 31:3784-3788(2003).
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Useful general publications
• Nucleic Acids Res. Database issue 2006, vol.
34, supplement 1:
http://nar.oupjournals.org/content/vol34/suppl_1/
• Nucleic Acids Res. Web server issue 2005,
vol. 33, supplement 2:
http://nar.oupjournals.org/content/vol33/suppl_2/
• Book: Bioinformatics for Dummies, by J.-M.
Claverie and C. Notredame
Publisher: For Dummies; 2nd edition (December, 2006)
ISBN: 0764516965
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Take home message
• We need your feedback!
[email protected]
Or via the website
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
Before the introduction to Swiss-Prot/ExPASy…
After the introduction to Swiss-Prot /ExPASy …
Some practical exercises:
http://education.expasy.org/cours/Tunis/
1. Finding databases
2. Comparing protein databases
3. Comparing BLAST programs
4. BLAST output
5. Bacterial start sites
6. UniRef
7. Different views of UniProtKB
8. Environmental sequences
9. Inter-database links & PROSITE
10. InterPro
11. Using UniProtKB/Swiss-Prot to create datasets