tuesday_lect_prot_DBs
Download
Report
Transcript tuesday_lect_prot_DBs
Protein sequence databases
Petri Törönen
Shamelessly copied from material done by Eija Korpelainen
This also includes old material from my thesis
www.hytti.uku.fi/~toronen/Gradu_verkkoon.zip
and from CSC bio-opas
http://www.csc.fi/oppaat/bio/
http://www.csc.fi/oppaat/bio/bio-opas.pdf
Why protein sequences?
• most (laboratory) analysis is done with
nucleotide sequences
• therefore the analysis at the nucleotide
level is natural
But there are drawbacks
-divergence in codons => same protein,
different nucleotide sequence!
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html
-similarity between different aminoacids
Therefore all the similarity is not visible at the
nucleotide level!
…more…
Protein databases also include often more
detailed information.
Protein (not the RNA) is often the actual
functional unit that has a biological function.
-note the exceptions like structural RNAs.
Protein databases
• SwissProt
• TrEMBL
• PIR-PSD
Swissprot and TrEMBL (Translated EMBL) have been unified to UniProt
THIS INFO IN PART ERRONEOUS! SwissProt still also available as a
separate entity.
Differences between databases
• Some include all the available information (more
or less reliable information)
– large coverage, everything is stored in the database
– small reliablity, information has not been confirmed
– computer annotation => updating fast
• Some cover only the reliable information
– small coverage
– information is reliable
– expert curation => updating slow
• SwissProt – TREMBL – RemTREMBL
Why Swissprot is nice?
• Sequences are manually annotated and
checked
• No multiple entries for the same sequence
• Annotations include protein function,
modifications after translation, active sites
etc.
• Linked to many other databases
So how to search protein sequences
from available databases?
• Search with a protein name
• Search with a proteins
function/derscriptive words
• Search with a protein/RNA sequence
Next slides handle first two options…
Ways to access Swiss/UniProt
http://au.expasy.org/sprot/
Expasy server for Uniprot
Note that the page includes links to ’full text search’ and to
’advanced search’
http://www.ebi.uniprot.org/uniprot-srv/uniProtPowerSearch.do
Power Search to UniProt database
http://srs.csc.fi/
One of the SRS servers availble in WWW
http://srs.ebi.ac.uk
http://srs.embl-heidelberg.de:8000/srs5/
SRS
• Sequence Retrieval System
• Allows search from several databases
• not limited to SwissProt!
• AND, OR, BUTNOT type boolean operations
can be used in the search (useful with keywords)
=> Works with sequence name and with complex
keyword queries.
• Obtained results can be further processed:
– linking to new set of databases
– includes sequence analysis, sequence alingment
Select ’start a temporary project’
Select database(s). Here I select SwissProt
Note that also other databases can be searched with SRS!
Available databases vary between the different SRS servers.
These are available fields
that can be searched with
the search term
Insert the query for looking the sequence.
Here I search with the sequence name (csk_mouse).
Search goes through all the text fields (AllText) in the SwissProt files
obtained result
More information from here
Available information on the sequence.
• Obtained result demonstrated the detailed
information available from the SwissProt
• Note that the stored information includes
–
–
–
–
information on the organism
gene name, gene description
links to the articles discussing about the seq.
part comments has a detailed description on
• function
• tissue localization
– part features has a detailed description on
• domains
• various functional components
SRS Search with boolean operators (AND, OR, BUTNOT)
Queries can be combined with & (= AND), | (= OR), ! (=NOT)
Different rows are also combined (by default) with AND
The example looks for proteins with organism Name either mouse OR rat.
Also the description field must include words receptor AND kinase BUTNOT tyrosine.
Further linking to other databases
Go to the
results of
the
previous
search..
We can link
the obtained
results with
the other
databases
by going
further from
this link
Selection of sequences that have a
known 3D structure
3. Lets select
here the
filtering of the
obtained
results to the
ones that have
a link to 3D
structure
2. The box next
to PDB
database is
selected with
mouse
1. The sub
folder with
protein
databases is
opened by
selecting
protein function
structure and
interactions
databases
Summary
• protein databases show detailed information of
protein sequences
• Uniprot/Swissprot is recommended protein
database
-manually curated
-non-overlapping
• SRS is a method for searching information from
selected databases with search terms
• Word of warning: Sometimes SRS does not
work as nicely as hoped!
Search of the protein databases
with sequences
So what can be done if we have a sequence that we do
not know nothing about?
We can look for similar known protein from databases.
This can be done directly with protein sequences.
(Database searching is probably handled more later. Sorry for
wrong order!)
Nucleotide to amino acids
If you have produced a nucleotide seq. in
laboratory you might still want to compare
it to protein sequences for previous
reasons (slide n. 3). You’ll have two
options:
1.Use tools (like BLASTX, FastX) that
automatically compare the nucleotide seq.
to amino acid databases.
These can search sequence similarities going from one
reading frame to another. => Simple, You don’t have to worry
about translating the sequence (see below)
BLASTX and FastX are explained more in detail later
2.Translate the seq. using available tools
(for example http://www.ebi.ac.uk/emboss/transeq/ )
-required with tools that accept only protein sequence
-remember that you do not know the reading frame!
Correct reading frame can move from one frame
to another (sequencing errors like addition or
deletion of nucleotides)!!
Automatic tools comparing nucl.
seq. with protein database
• BLASTX
-looks for most similar protein
sequences for your nucleotide
sequence by comparing all possible
reading frames.
-Member of BLAST program family
http://www.ncbi.nlm.nih.gov/BLAST/
If you do a query with
a protein sequence
then use this
For nucleotide sequences
BLASTX can be obtained here
SEQUENCE:
>embl|AB029485|AB029485
Mus musculus ARIP1 mRNA
for activin receptor interacting
protein
protein database (SwissProt) can
be selected here
You can find the seq from google with AB029485
Next Window is opened here
Web page that is given while the results are being waited.
Colour figure presents where
the match to the database was
in our query sequence.
colour presents the goodness of
score.
E value tells how many similar
results can be expected
by random
The alingment can be
viewed from this link
This is the link to database that we searched
giving the full information on the sequence
The alingment enables
the manual evaluation
of the result
Changing the nucleotides to amino acids
http://www.ebi.ac.uk/emboss/transeq/
Transeq requires you to
paste the nucleotide
sequence, to select the
reading frame (1, 2 or 3) and
to select forward or reverse
direction
An example sequence
obtained with randomly
typed g,a,c,t:
DQLTCQSTVSAGLAWLAG
MA
The obtained sequences
from different reading frames
can be used to search
protein databases...
Motif databases
• Motifs are conserved areas in the functionally
similar proteins
• These are crucial parts for protein function
– protein cannot change them without changing the
function
• Analysis of sequences with motifs can be more
efficient when no close sequence relatives are
found
– recommended when normal sequence search gives
no results
What is motif?
Areas with strong conservation between
alingned sequences
modified from Terri Attwood, 2002
modified from Eija korpelainen...
Motif databases
BLOCKS
http://blocks.fhcrc.org/
PROSITE
http://au.expasy.org/prosite/
...and more...
http://au.expasy.org/tools/
Subgroup Pattern and profile searches shows the
list of protein motif analysis tools
INTERPRO
http://www.ebi.ac.uk/InterProScan/
Combines many motif
databases in one
search
can take DNA or protein
sequence.
Fragment of the BLASTX
test sequence
WW domains
Important for binding
proteins
PDZ domains
Important for
protein-interactions
Kinase associated
motifs