An introduction to informatics - Swiss

Download Report

Transcript An introduction to informatics - Swiss

UniProtKB/Swiss-Prot:
Questions, Answers
and a few Tips
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Everything you always wanted to
know about UniProtKB/Swiss-Prot…
and others were not afraid to ask !
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Two main contact points:
[email protected]
[email protected]
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Some have problems finding
a protein…
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Troubles finding a protein…
I cannot find the IgG protein from Lama pacas in your
server.
“Lama pacas” = Lama guanicoe pacos (Alpaca) (Lama pacos)
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Troubles finding a protein…
I cannot find the IgG protein from Lama pacas in your
server.
“Lama pacas” = Lama guanicoe pacos (Alpaca) (Lama pacos)
40 entries in UniProtKB (5 Swiss-Prot, 35 TrEMBL), but
no IgG;
98 entries at the EMBL database, no IgG;
In addition:
Ig are not annotated in UniProtKB/Swiss-Prot (currently
many Ig sequences are stored only in UniParc);
Lama pacos is not an annotation priority.
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB/Swiss-Prot annotation priorities
(see poster SP106)
Model-organism oriented annotation
1.
Complete microbial proteomes and plastid–encoded proteins (HAMAP)
2.
3.
4.
5.
6.
7.
8.
9.
Human proteins and their orthologs in other mammals (HPI) (SP129)
Plant proteins (A.thaliana and rice) (PPAP) (SP133)
Fungal proteomes (FPAP) (SP134)
Proteomes of representative subsets of viral strains (SP135)
Toxins and anti-microbial peptides (ToxProt) (SP139)
Drosophila proteome (SP137)
C.elegans proteome (SP138)
Xenopus proteome (SP136) …
(SP131&132)
Priorities shared by all organisms
1.
2.
3.
Post-Translational Modifications (PTMs) (SP126)
3D structures (SP128)
Protein-protein interactions (SP) …
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Troubles finding a protein…
Dear Folks,
I cannot find an entry for human apolipoprotein
B100 in Swiss-Prot/TrEMBL.
Am I doing something wrong?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
In the future, our search engines will cope
with dashes, Roman/Arabic figures, etc.
In the annotation process, we try to
add all synonyms found for a given
protein/gene in the literature and other
databases.
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Troubles finding a protein…
I am trying to locate the entry ofr the human beta-2
adrenoreceptor protein, but I don't seem to get any
entries. Can you help me to locate this entry, please?
The missing synonym was added
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Troubles finding a protein…
I could not find the information of protein gi/34906958
From the NCBI documentation:
=> 1. restricted to GenBank (not agreed upon with EMBL and DDBJ)
2. not stable identifiers
Of note, cross-references to RefSeq soon available from UniProtKB
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
http://www.pir.uniprot.org/search/idmapping.shtml
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
… eventually they find it !
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
help #12995
This is my new question:
DO all the Swiss-Prot proteins of human and
Arabidopsis have CDS nucleotide sequences in
database? What should I do to get them ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
From EMBL to TrEMBL
CDS
From EMBL to TrEMBL
CDS
From EMBL to TrEMBL
Ref.
CDS
From EMBL to UniProtKB/TrEMBL
Ref.
CDS
In the current UniProt release (8.4 –
25-Jul-2006), there are 8’133
UniProtKB/Swiss-Prot entries without
cross-references to
EMBL/GenBank/DDBJ (over a total of
230’133 entries – 3.5%).
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Fortaleza
http://www.ebi.ac.uk/swissprot/Submissions/submissions.html
UniProtKB: Questions and answers
31.VII.2006
help #12995
This is my new question:
DO all the Swiss-Prot proteins of human and
Arabidopsis have CDS nucleotide sequences in
database? What should I do to get them ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
I found that the UNIPROT entry for human
MAPKAKK3 is still a TREMBL entry (since 1996)
and could not be found in SWISSPROT. Is there a
specific reason why certain entries do not enter
the SWISSPROT section and get an'correct
UNIPROT ID' ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
I found that the UNIPROT entry for human
MAPKAKK3 is still a TREMBL entry (since 1996)
and could not be found in SWISSPROT. Is there a
specific reason why certain entries do not enter
the SWISSPROT section and get an'correct
UNIPROT ID' ?
MAPKAKK3 is not a valid gene name;
the corresponding TrEMBL entry was not found and
could not be annotated.
Please use the update request form
(or cite accession numbers)!
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
I found that the UNIPROT entry for human
MAPKAKK3 is still a TREMBL entry (since 1996)
and could not be found in SWISSPROT. Is there
a specific reason why certain entries do not
enter the SWISSPROT section and get
an'correct UNIPROT ID' ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB:
From TrEMBL to Swiss-Prot
and ~60
uperannotators
at SIB and EBI
supported by a
dedicated
programming team
UniProtKB:
From TrEMBL to Swiss-Prot
Sequence merge & analysis
High performance bioinformatics tools
Sequence annotation
1 gene / 1 species = 1 Swiss-Prot entry
Alternative splicing ?
Same gene ?
Polymorphisms ?
Alternative initiation ?
RNA editing ?
Usage of an alternative promoter ?
Fragment ?
Sequencing errors ?
Selenocysteine ?
-> Annotation and documentation of all the differences
UniProtKB:
From TrEMBL to Swiss-Prot
Sequence merge & analysis
High performance bioinformatics tools
Literature information
(>1’700 journals cited)
Databases and
external scientific expertise
Annotation and sequence check
X
In order to avoid redundancy,
once manually annotated and
integrated into Swiss-Prot, the
entry is deleted from TrEMBL
Dear Curator,
I am the main author of the paper describing two new
phopshorylation sites for human growth hormone
(P01241) published in Proteomics 4:587-598(2004). One
of two phosphorylation sites, ser 176 described by us
in the paper is not listed in the expasy web site.
If the curator simply missed the site, please make the
necessary update. If ser 176 was not included in the
table feature for other reasons, please let us know.
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
www.expasy.org
The reference has been added…
… and the modifications described
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Searching UniProtKB/Swiss-Prot
I wish to retrieve separately, all the bacteria and
viruses protein sequences with virulence factors, but
what I manage to get when i type "virulence" as a
keyword are all the protein sequences with virulence
as a keyword. Are the sequences i got here only from
bacterial and virus? Any other organisms have this
virulence factors? How could I specified the
sequences,based on viral and bacterial virulense
factors?
I ll be really appreciated if you could help me. Thank
you.
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Currently:
Sequence Retrieval System (SRS)
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
(PR#6943)
Dear Sir/Madame,
I have a question concerning selection of data from
UniProt protein database. I wonder if there are any
examples of two or more protein entries, which concern
exactly the same protein of two or more individuals
representing the same species. In other words, I would
like to know, if each protein of a given species is
represented by exactly one amino acids sequence. If
there are some proteins of a given species which are
represented by more than one amino acids sequence,
which line of the entry should I use to group such entries
together?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB/Swiss-Prot is non-redundant:
One Swiss-Prot entry

All protein products encoded by one gene in
one species (including fragments,
variations/polymorphisms, splice variants,
sequencing errors…)
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Proteome
Genome
~ 1'000'000
human proteins
~ 25’000 human
genes (with
polymorphisms)
Post-translational
modifications (PTMs)
alternative promoter usage
alternative splicing
mRNA editing
etc.
Transcriptome
~ 100’000 human
transcripts
Increase in complexity
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
- 13 sequences (complete or partial)
- derived from mRNA (n=6) or genomic DNA (n=7)
Multiple alignment of the C-terminus of available GCR sequences
Annotation of the sequence differences
Sequencing error (frameshift) ?
Alternative splicing ?
Polymorphism ? Disease mutation ?
Sequencing error (conflict) ?
RNA editing ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Multiple alignment of C-terminus of the available GCR sequences
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Where to find the annotation
about alternative splicing in
UniProtKB/Swiss-Prot ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Identifier & accession nr.
View « by default » on the
ExPASy server
(ID, AC, DT)
Protein and gene names
Taxonomy
(DE, GN, OC, OS, OG)
Cross-references
(DR)
References
(RN, RP, RC, RX, RA, RL)
Keywords
(KW)
Sequence description
(Feature Table)
Comments
(CC)
Sequence (SQ)
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Identifier & accession nr.
View « by default » on the
ExPASy server
(ID, AC, DT)
Protein and gene names
Taxonomy
(DE, GN, OC, OS, OG)
Cross-references
(DR)
References
(RN, RP, RC, RX, RA, RL)
Keywords
(KW)
Sequence description
(Feature Table)
Comments
(CC)
Sequence (SQ)
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
P04150 (GCR_HUMAN)
…
All the alternative sequences are available
for Blast searches and protein identification
tools (on the ExPASy server).
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
Currently in UniProtKB/Swiss-Prot, for
Homo sapiens,
14’445 entries (~ as many genes)
7’975 alternative splicing isoforms
-> 22’420 human sequences described
not taking into account other diversity
generating events…
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
How to download the
sequences ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
UniProtKB: Questions and answers
Fortaleza
31.VII.2006
And if bioinformatics is not funded properly,
we could start a new business…
Dear Sirs,
We need deacetylase for the following purposes:
1. Deacetylation of fiber obtained from chitin.
2. Chitin deacetylation for obtaining chitosan
oligosaccahrides.
Evidently, it will be different types of deacetylase, because
in case of the fiber decrease of molecular weight is not
allowed, while in case of chitin deacetylation it is allowable
and even desirable for oligomerisation of the product
during deacetylation.
We ask you to send us the example Deacetylase for
chitin and its price.
Dear,
At this moment I am looking for : bovine TGF beta1
I saw in web that you have this product with part# P18341
Could you inform me the price and delivery time ?
UniProtKB: Questions and answers
Fortaleza
31.VII.2006