Transcript Document
Practical exercises
Answers…
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
1
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
2
- Nucleic acid database in Japan: DDBJ: http://www.ddbj.nig.ac.jp/
- Microarrays data: Arrayexpress: http://www.ebi.ac.uk/microarray/
- Mass spectrometry data: PRIDE: http://www.ebi.ac.uk/pride/,
OPD http://www.ebi.ac.uk/pride/
- Protein-protein interaction: INTACT: http://www.ebi.ac.uk/intact/site/
DIP: http://dip.doe-mbi.ucla.edu/ JCB: http://www.imb-jenade/jcb/ppi/
- rat enamel 2D gel electrophoresis:
http://biocadmin.otago.ac.nz/fmi/xsl/toothprint/home.xsl
(Last revision August 2006)
- CFTR mutation http://www.genet.sickkids.on.ca/cftr/; This web site was last updated
March 2007
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
3
Exercise 2
E.coli K12 recombinase A (recA) in different protein sequence databases
Find, if it exists, the entry corresponding to the E.coli (strain K12) recA protein
sequence in the following protein sequence databases
- EMBL http://www.ebi.ac.uk/embl/
- RefSeq http://www.ncbi.nlm.nih.gov/RefSeq/
- UniProtKB http://www.expasy.org/sprot/ or http://beta.uniprot.org/ find sequence(s) in
UniProtKB/Swiss-Prot and sequence(s) in UniProtKB/TrEMBL
- PIR-PSD http://pir.georgetown.edu/pirwww/dbinfo/pir_psd.shtml
- PDB http://www.rcsb.org/pdb/home/home.do
- UniParc (use SRS ) or the UniParc query tool.
- EnsEMBL http://www.ensembl.org/index.html
-Find the UniProtKB/Swiss-Prot entry corresponding to the RefSeq entry NP_036231
Hints:
• You can use the query tool provided by each database.
• You can use SRS
• You can use the crosslinks (if they exist) to go from one database to another...
• You can use the mapping tool on the new UniProt web site http://beta.uniprot.org/
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
4
EMBL: U00096
RefSeq: NC_000913
UniProtKB: P0A7G6, Swiss-Prot only
(there are 2 fragments in TrEMBL, but they are not
from K12)
PIR-PSD: G65049; RQECA. Retrieved from UniProt
UniParc: UPI0000112C1C
PDB: 1AA3,1N03,1REA,1U94 etc, easily retrieved
from UniProt.
EnsEMBL: Not possible, bacteria are not in EnsEMBL!
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
5
Exercise 3
-Find the human erythropoietin protein sequence in
UniProt.
- BLASTp it at ExPASy (http://www.expasy.org/tools/blast/);
restrict the BLAST to human sequences (Homo sapiens).
- Look at the Blast results and guess from which
database(s) the protein sequences are derived. How many
distinct human erythropoietin protein sequences do you
get?
-Do the same, but at (http://www.ncbi.nlm.nih.gov/BLAST/)
- How many distinct human erythropoietin protein
sequences do you get?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
7
BLASTp at ExPASy against UniProtKB
Only 2 entries; one annotated in Swiss-Prot, the
other unannotated in TrEMBL.
Looking at the Swiss-Prot entry you see a lot of rich
information.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
8
BLASTp at NCBI against nr
At least 9 entries; RefSeq (ref, 1), GenPept (embl,
gb, 6) and PDB (pdb, 2). The Swiss-Prot entry has
most of these cross-references, and more besides.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
9
Exercise 4:
Understanding BLAST output
Compare the results of BLASTp for entry O05891
-against UniProtKB (http://www.expasy.org/tools/blast/)
-against NCBI-nr (http://www.ncbi.nlm.nih.gov/BLAST/)
Look for the same best hits and compare the scores,
why are they different?
Keep the UniProtKB output, we will use it again in a
minute.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
10
BLASTp at ExPASy against UniProtKB
BLASTp at NCBI against nr
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
11
NCBI BLAST FAQ:
http://www.ncbi.nlm.nih.gov/blast/blast_FAQs.shtml
Q: What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one
can "expect" to see just by chance when searching a database of a particular
size. It decreases exponentially with the Score (S) that is assigned to a
match between two sequences. Essentially, the E value describes the
random background noise that exists for matches between sequences. For
example, an E value of 1 assigned to a hit can be interpreted as meaning
that in a database of the current size one might expect to see 1 match with a
similar score simply by chance. This means that the lower the E-value, or the
closer it is to "0" the more "significant" the match is.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
12
The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
The E-value of equation (1) applies to the comparison of two proteins of
lengths m and n. How does one assess the significance of an alignment that
arises from the comparison of a protein of length m to a database containing
many different proteins, of varying lengths? One view is that all proteins in the
database are a priori equally likely to be related to the query. This implies that
a low E-value for an alignment involving a short database sequence should
carry the same weight as a low E-value for an alignment involving a long
database sequence. To calculate a "database search" E-value, one simply
multiplies the pairwise-comparison E-value by the number of sequences in
the database.
The E value depends on the size of the database.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
13
Exercise 5:
Start site issues: bacteria
Take the UniProtKB BLASTp output for O05891.
Align the first 9 sequences using ClustalW (tool on the
BLAST output page). What do you see, what is one
possible interpretation?
Look at the entry in UniProt, what can you see to
strengthen this interpretation?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
14
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
15
MYCTF = Mycobacterium tuberculosis strain F11. It is not
clear if it is a WGS or a fully finished genome…
In either case there has probably been an error in the
start codon prediction. In bacteria there are several other
codons beside ATG that can start a protein (Val (GTG)
and Leu (TTG)). That is probably what happened here,
and the fact that there is another potential start a few
residues upstream, that corresponded to predictions for
other Mycobacteria, was not noticed…
O05891 has the KW Direct protein sequencing, and one
reference has the descriptor “Protein sequence of Nterminus…”
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
16
Exercise 6:
BLASTp and UniRef
Compare the results of BLASTing P04150 against UniProtKB,
UniRef100, UniRef90 and UniRef50 (use BLAST at ExPASy).
Compare the results.
In which cluster(s) do you find the alternatively spliced sequences
(how many are there)?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
17
The UniProt Non-redundant Reference (UniRef) databases combine
closely related sequences (including some from UniParc) into a single
record to speed searches.
One UniRef100 entry -> all identical sequences (including fragments) –
reduction of 12% of DB.
One UniRef90 entry -> sequences that have at least 90% identity –
reduction of 45% of DB.
One UniRef50 entry-> sequences that are at least 50% identical –
reduction of 69% ofDB.
Species independent!!
UniRef is useful for comprehensive BLAST sequence searches by
providing sets of representative sequences.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
18
First BLAST
against
UniProtKB
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
19
UniRef100:
+ more further
down the output
They are not all
in the same
cluster
(remember 12%
reduction in DB
size)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
20
UniRef90:
Still not all in
the same
cluster
(remember
45% reduction
in DB size)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
21
UniRef50:
All in the same
cluster
(remember
69% reduction
in DB size)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
22
Exercise 7
Different looks and tools for a same entry depending on the server...
Starting with the new UniProt server (http://beta.uniprot.org/):
a. Look for the amino acid sequence of human carbonic anhydrase 2.
b. Get the corresponding nucleic acid entries in EMBL and GenBank:
try to find a nucleic acid sequence derived from genomic DNA
sequencing and another one derived from cDNA sequencing.
c. From the UniProtKB/Swiss-Prot entry, look at the data available for
the variant Pro-92 and in particular its position in the 3D structure
(Use the “Astex viewer”).
Starting with the NCBI server (http://www.ncbi.nlm.nih.gov/):
a. Look for the amino acid sequence of human carbonic anhydrase 2
using ENTREZ protein at the NCBI server.
b. b. Find the UniProtKB/Swiss-Prot entry and as above: - Get the
corresponding nucleic acid entries in EMBL and GenBank. - Find the
data available for the variant Pro-92.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
23
UniProt: P00918, follow links
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
24
NCBI, Entrez protein, can also just type in P00918
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
25
Note differences in
UniProt cross-reference
presentation, and in
information present
about a given crossreference
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
26
Feature table ordering is
very different here,
numerical only
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
27
Exercise 8
Environmental sequences: how to check the quality of a protein
sequence...
a. Look at DQ284920 at EMBL (http://srs.ebi.ac.uk/srs6bin/cgibin/wgetz?-page+top+-newId): where does the sequence come
from? How reliable is the translated CoDing Sequences (CDS)?
b. How many environmental sequences are found in the acid nucleic
databases (use SRS (ENV)?
c. Look at DQ380558: can you find the protein sequence in
UniProtKB? Where does the annotation come from (from which
type of analysis)?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
28
a.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
29
b.
(March 11, 22:00)
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
30
c.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
31
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
32
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
33
Exercise 9
Genomic databases (I)
a. Look for the Swiss-Prot entry of the E.coli gene gutQ
(http://beta.uniprot.org/).
b. Follow the link to EcoGene (EcoGene Database of Escherichia coli
sequence and function) and find the chromosomal location.
c. Get the next E.coli gene on the same strand.
d. Follow the link to Swiss-Prot.
e. Find the subcellular localisation of the protein.
f. What regions and domains does the protein contain, visualize them.
g. Have a look at the domain structure in the different domain
databases. In PROSITE, get the list of proteins with at least one
common domain.
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
34
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
35
EcoGene page
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
36
Or in this pull down list
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
37
Note: Currently the NiceProt view
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
38
Zinc metallo-hydrolase
Flavodoxin-like
Rubredoxin-like
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
39
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
40
From InterPro
or
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
41
…
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
42
From PROSITE
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
43
Exercise 10
Protein domain / family databases
a. How many different databases are used by InterPro?
b. Do an InterPro scan with the sequence on the next page.
c. How many different domains does the protein contain?
d. How many phosphopantetheine-binding domain does the protein
contain?
e. How many different protein domain databases have a discriminator
for the phosphopantetheine-binding domain? Are they using
patterns, profiles or HMMs? What are the most frequent domains
found in Mycobacterium tuberculosis H37Rv? (Go to the integr8 site
complete proteome:
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteo
meId=30). What percentage of proteins in M.tuberculosis have a
phosphopantetheine-binding domain?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
44
MVHATACSEI IRAEVAELLG VRADALHPGA NLVGQGLDSI RMMSLVGRWR RKGIAVDFAT
LAATPTIEAW SQLVSAGTGV APTAVAAPGD AGLSQEGEPF PLAPMQHAMW VGRHDHQQLG
GVAGHLYVEF DGARVDPDRL RAAATRLALR HPMLRVQFLP DGTQRIPPAA GSRDFPISVA
DLRHVAPDVV DQRLAGIRDA KSHQQLDGAV FELALTLLPG ERTRLHVDLD MQAADAMSYR
ILLADLAALY DGREPPALGY TYREYRQAIE AEETLPQPVR DADRDWWAQR IPQLPDPPAL
PTRAGGERDR RRSTRRWHWL DPQTRDALFA RARARGITPA MTLAAAFANV LARWSASSRF
LLNLPLFSRQ ALHPDVDLLV GDFTSSLLLD VDLTGARTAA ARAQAVQEAL RSAAGHSAYP
GLSVLRDLSR HRGTQVLAPV VFTSALGLGD LFCPDVTEQF GTPGWIISQG PQVLLDAQVT
EFDGGVLVNW DVREGVFAPG VIDAMFTHQV DELLRLAAGD DAWDAPSPSA LPAAQRAVRA
ALNGRTAAPS TEALHDGFFR QAQQQPDAPA VFASSGDLSY AQLRDQASAV AAALRAAGLR
VGDTVAVLGP KTGEQVAAVL GILAAGGVYL PIGVDQPRDR AERILATGSV NLALVCGPPC
QVRVPVPTLL LADVLAAAPA EFVPGPSDPT ALAYVLFTSG STGEPKGVEV AHDAAMNTVE
TFIRHFELGA ADRWLALATL ECDMSVLDIF AALRSGGAIV VVDEAQRRDP DAWARLIDTY
EVTALNFMPG WLDMLLEVGG GRLSSLRAVA VGGDWVRPDL ARRLQVQAPS ARFAGLGGAT
ETAVHATIFE VQDAANLPPD WASVPYGVPF PNNACRVVAD SGDDCPDWVA GELWVSGRGI
ARGYRGRPEL TAERFVEHDG RTWYRTGDLA RYWHDGTLEF VGRADHRVKI SGYRVELGEI
EAALQRLPGV HAAAATVLPG GSDVLAAAVC VDDAGVTAES IRQQLADLVP AHMIPRHVTL
LDRIPFTDSG KIDRAEVGAL LAAEVERSGD RSAPYAAPRT VLQRALRRIV ADILGRANDA
VGVHDDFFAL GGDSVLATQV VAGIRRWLDS PSLMVADMFA ARTIAALAQL LTGREANADR
LELVAEVYLE IANMTSADVM AALDPIEQPA QPAFKPWVKR FTGTDKPGAV LVFPHAGGAA
AAYRWLAKSL VANDVDTFVV QYPQRADRRS HPAADSIEAL ALELFEAGDW HLTAPLTLFG
HCMGAIVAFE FARLAERNGV PVRALWASSG QAPSTVAASG PLPTADRDVL ADMVDLGGTD
PVLLEDEEFV ELLVPAVKAD YRALSGYSCP PDVRIRANIH AVGGNRDHRI SREMLTSWET
HTSGRFTLSH FDGGHFYLND HLDAVARMVS ADVR
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
45
c.
6 domains, 1 PTM, 1
family detected
d.
2
phosphopantetheine
-binding domains
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
46
Pfam: HMM,
PROSITE; Profile
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
47
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
48
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
49
Exercise 11
Use of UniProtKB/Swiss-Prot for creating dataset and prediction tools.
-Find proteins with the following EC numbers; 3.5.1.1, 3.5.1.38
-Look for proteins which have been experimentally proven to have an
active site.
- Alignment the sequences.
-From the alignment suggest a pattern based around the active threonine
(do this manually).
-Scan your pattern against UniProtKB/Swiss-Prot
(http://expasy.org/tools/scanprosite/). How many matches do you find?
- Compare your pattern with that found in the PROSITE database
PS00144, (http://www.expasy.org/cgi-bin/prosite-search-ac?PDOC00132).
How many matches in UniProtKB/Swiss-Prot are there with PS00144?
- Can you do the same with the NCBInr data ?
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
50
PS00144, have to give the EC numbers
ATGGTIAG
Scan against SP, get 13 hits
PROSITE pattern gives 517 hits against UniProt, 45 against SP
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
51
Done March 2007, ANA
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
52
A. Auchincloss
UniProtKB and ExPASy
Tunis, March 2007
53