swift.cmbi.ru.nl

Download Report

Transcript swift.cmbi.ru.nl

Bioinformatics databases & sequence retrieval
Content of lecture
I. Introduction
II. Bioinformatics data & databases
III. Sequence Retrieval with MRS
Celia van Gelder
CMBI
UMC Radboud
September 2014
I. Bioinformatics questions
Lookup
•Is the gene known for my protein (or vice versa)?
•What sequence patterns are present in my protein?
•To what class or family does my protein belong?
Compare
•Are there sequences in the database which resemble the protein I
cloned?
•How can I optimally align the members of this protein family?
Predict
•Can I predict the active site residues of this enzyme?
•Can I predict a (better) drug for this target?
•How can I predict the genes located on this genome?
©CMBI 2009
Sequence similarity
Image, you sequenced this human protein.
MVVSGAPPAL
WPWIVSIQKN
VGVAWVEPHP
GSIQDGVPLP
DSGGPLMCQV
GGGCLGTFTS
GTHHCAGSLL
VYSWKEGACA
HPQTLQKLKV
DGAWLLAGII
LLLLASTAIL
TSRWVITAAH
DIALVRLERS
PIIDSEVCSH
SWGEGCAERN
NAARIPVPPA
CFKDNLNKPY
IQFSERVLPI
LYWRGAGQGP
RPGVYISLSA
CGKPQQLNRV
LFSVLLGAWQ
CLPDASIHLP
ITEDMLCAGY
HRSWVEKIVQ
VGGEDSTDSE
LGNPGSRSQK
PNTHCWISGW
LEGERDACLG
GVQLRGRAQG
You know it is a serine protease.
Which residues belong to the active site?
Is its sequence similar to the mouse serine protease?
©CMBI 2009
Sequence Alignment
MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE
MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ
*::* .**** **. :. :
*:**:*** : .** * *.* *********: ****** *::
WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK
WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK
******* ** *:******** *.***:**** ***.*::** *********: **.**.****
VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW
VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW
**:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:**
GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG
GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG
********** ********** ******:*. ******** . ***.****** **********
DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG
DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---********** *. ***:*** *******: : ***** ** * *****::*** ******
=> Transfer of information
©CMBI 2009
II. Bioinformatics data and databases
mRNA
expression
profiles
MS data
Large amount of data
Growing very very fast
Heterogeneous data types
EMBL DNA database
Note: In 2012 247 millions & 429 billions
©CMBI 2014
Genome projects
©CMBI 2013
Biological databases (1)
Primary databases
contain biomolecular sequences or structures (experimental data!) and
associated annotation information
Sequences
Nucleic acid sequences
Protein sequences
EMBL, Genbank, DDBJ
SwissProt, trEMBL, UniProt
Structures
Protein Structures
PDB
Structures of small compounds CSD
Genomes
Ensembl
UCSC
©CMBI 2010
Biological databases (2)
Secondary databases
Contain data derived from primary database(s)
Patterns, motifs, domains PROSITE, PFAM, PRINTS, INTERPRO,......
Disease mutations
OMIM / MIM
SNPs
dbSNP
Pathways
KEGG
©CMBI 2009
Databases
Data must be in a certain format for software to recognize
Every database can have its own format but some data elements are
essential for every database:
1. Unique identifier, or accession code
2. Name of depositor
3. Literature references
4. Deposition date
5. The real data
©CMBI 2009
Quality of Data
SwissProt
• Data is only entered by annotation experts
EMBL, PDB
• “Everybody” can submit data
• No human intervention when submitted;
some automatic checks
©CMBI 2009
SwissProt database
• Database of protein sequences
• 546000 sequence entries (sept 2014)
• Swissprot is manually annotated and reviewed
• Obligatory deposit of in SwissProt before publication
• SwissProt is part of UniProt
• The other main part of UniProt is Trembl (translated Embl).
Trembl is automatically annotated and is not reviewed.
©CMBI 2014
Important records in SwissProt (1)
ID
AC
DT
DT
DT
HBA_HUMAN
Reviewed;
142 AA.
P69905; P01922; Q3MIF5; Q96KF1; Q9NYR7;
21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
23-JAN-2007, sequence version 2.
23-SEP-2008, entry version 63.
DE RecName: Full=Hemoglobin subunit alpha;
DE AltName: Full=Hemoglobin alpha chain;
DE AltName: Full=Alpha-globin;
©CMBI 2009
Important records in SwissProt (2)
Cross references section:
Hyperlinks to all entries in other databases which are relevant for the
protein sequence HBA_HUMAN
genes & mRNA
protein domains
diseases
structures
©CMBI 2009
Important records in SwissProt (3)
Features section:
post-translational modifications, signal peptides, binding sites, enzyme
active sites, domains, disulfide bridges, local secondary structure,
sequence conflicts between references etc. etc.
©CMBI 2011
And finally, the amino acid sequence!
©CMBI 2009
EMBL database
Nucleotide database
EMBL: 470 million sequence entries comprising 998 billion
nucleotides (Sept 2014)
EMBL records follows roughly same scheme as SwissProt
Obligatory deposit of sequence in EMBL before publication
Most EMBL sequences never seen by a human
©CMBI 2013
Protein Data Bank (PDB)
Databank for 3-dimensional structures of biomolecules (by X-ray & NMR):
•
•
•
•
Protein
DNA
RNA
Ligands
Obligatory deposit of coordinates in the PDB before publication
~ 84000 entries (Sep 2012) ( ~6000 “unique” structures)
PDB file is a keyword-organised flat-file (80 column)
1) human readable
2) every line starts with a keyword (3-6 letters)
3) platform independent
©CMBI 2011
PDB important records (1)
PDB nomenclature
Filename= accession number= PDB Code
Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN)
HEADER
describes molecule & gives deposition date
HEADER
PLANT SEED PROTEIN
30-APR-81
1CRN
CMPND
name of molecule
COMPND
CRAMBIN
SOURCE
organism
SOURCE
ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED
©CMBI 2009
PDB important records (2)
SEQRES
Sequence of protein; be aware: Not always all 3d-coordinates are present
for all the amino acids in SEQRES!!
SEQRES
SEQRES
SEQRES
SEQRES
1
2
3
4
46
46
46
46
THR
ASN
ALA
CYS
THR
VAL
THR
PRO
CYS
CYS
TYR
GLY
CYS
ARG
THR
ASP
PRO
LEU
GLY
TYR
SER
PRO
CYS
ALA
ILE VAL ALA ARG SER ASN PHE
GLY THR PRO GLU ALA ILE CYS
ILE ILE ILE PRO GLY ALA THR
ASN
1CRN
1CRN
1CRN
1CRN
51
52
53
54
SSBOND
disulfide bridges
SSBOND
1 CYS
3
CYS
40
SSBOND
2 CYS
4
CYS
32
©CMBI 2009
PDB important records (3)
and at the end of the PDB file the “real” data:
ATOM
one line for each atom with its unique name and its x,y,z coordinates
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
6
7
8
9
10
11
N
CA
C
O
CB
OG1
CG2
N
CA
C
O
THR
THR
THR
THR
THR
THR
THR
THR
THR
THR
THR
1
1
1
1
1
1
1
2
2
2
2
17.047
16.967
15.685
15.268
18.170
19.334
18.150
15.115
13.856
14.164
14.993
14.099
12.784
12.755
13.825
12.703
12.829
11.546
11.555
11.469
10.785
9.862
3.625
4.338
5.133
5.594
5.337
4.463
6.304
5.265
6.066
7.379
7.443
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
13.79
10.80
9.19
9.85
13.02
15.06
14.23
7.81
8.31
5.80
6.94
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
1CRN
70
71
72
73
74
75
76
77
78
79
80
©CMBI 2009
Structure Visualization
Structures from PDB can be visualized with:
1. Yasara / Yasaraview (www.yasara.org)
2. SwissPDBViewer (http://spdbv.vital-it.ch/)
3. Protein Explorer (http://www.umass.edu/microbio/rasmol/)
4. Cn3D (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml)
©CMBI 2009
Part III: Sequence Retrieval with MRS
Google Thé best generic search and retrieval system
Google searches everywhere for everything
MRS
Maarten’s Retrieval System (http://mrs.cmbi.ru.nl )
MRS searches in selected data environments
MRS is the Google of the biological database world
Search engine (like Google)
• Input/Query = word(s)
• Output = entry/entries from database
Other programs exist: Entrez, SRS, ....
©CMBI 2009
MRS Search Steps
• Select database(s) of choice
• Formulate your query
• Hit “Search”
• The result is a “query set” or “hitlist”
• Analyze the results
©CMBI 2009
http://mrs.cmbi.ru.nl
©CMBI 2011
MRS Database Selection
You can choose between
selecting all databases or just
one of them.
But think about your query first!!
©CMBI 2009
MRS Search options
Simply type your keywords in the keyword field and choose SEARCH.
If you know the fields of the database you are searching in you can
specify your query further
But think about your query first!!
©CMBI 2009
MRS Hitlist (1)
©CMBI 2009
MRS Hitlist (2)
©CMBI 2009
MRS Options
MRS creates a result, or a “query set”, or “hitlist”.
With the result you can do different things in MRS:
– View the hits
– Blast single hit sequences
– Clustal multiple hit sequences
©CMBI 2009
MRS - View Hits
©CMBI 2009
Combine in MRS
AND or &
AND is implicit
OR or |
NOT or !
©CMBI 2009
MRS - Options
Home brings you back to the start page of MRS. That is the page from which
you can do keyword searches.
Blast brings you to the MRS-page from which you can do Blast searches.
Status gives you all the currently indexed databases
Align brings you to the MRS-page from which you can do Clustal alignments.
Databank: uniprot lists the database you selected.
Help provides some help
©CMBI 2011
Try it yourself with the exercises!
Ground rules for bioinformatics
Don't always believe what programs tell you - they're often misleading &
sometimes wrong!
Don't always believe what databases tell you - they're often misleading &
sometimes wrong!
Don't always believe what lecturers tell you - they're sometimes wrong!
Don't be a naive user, computers don’t do biology & bioinformatics, you do!
free after Terri Attwood
©CMBI 2009