Web resources in structural biology

Download Report

Transcript Web resources in structural biology

Structure databases, searches
and alignments
Marian Novotny
[email protected]
Molecular Bioinformatics X3
Outline
1. Structure databases - why do we need them?
- types of structural databases
- Protein Data Bank
- other useful databases
2. Searches - text searches
3. Structure searches (alignments) - why?
- how ?
- comparison of available tools
Structure databases
Why?
 data tend to get lost
 source of information for further analysis
 better access to data by general public
 validation of data is (sometimes) possible
Database is…
…. a structured collection of data held in computer storage;
esp. one that incorporates software to make it accessible in
a variety of ways; transf., any large collection of information.
Oxford English dictionary
…..a usually large collection of data organized especially for
rapid search and retrieval (as by a computer)
Merriam-Webster Online
Databases
Primary databases
Added-value databases
Derived databases
RCSB
MSD
PDBJ
NDB
CSD
OCA
PDBSum
EDS
Whatcheck
Jena Image library
ftp archive of flat files
Primary databases
- repositories of experimental data of macromolecular
structures (X-ray, NMR, electron microscopy…)
- RCSB (USA), MSD (Europe) and PDBJ (Japan) collaborate
to form wwPDB. Data can be submitted to any of these
databases. Databases interchange their new data on a regular
basis, so they have an identical content.
- primary databases differ in presentation of data and the
amount of extra services and links they provide
The Protein Data Bank (PDB)
- established in 1971 by Walter Hamilton at Brookhaven National
Laboratory
- seven structures were deposited at the beginning
- the database was distributed on magnetic tapes
- RCSB now run by the consortium of three institutions (San Diego Supercomputer
Centre, Rutgers University and Centre for Avanced Reasearch and Biotechnology)
- 29326 structures (26.01.2005)
- distributed over internet
- released once a week
HEADER HYDROLASE
27-OCT-03 1UR9
TITLE INTERACTIONS OF A FAMILY 18 CHITINASE WITH THE DESIGNED
TITLE 2 INHIBITOR HM508, AND ITS DEGRADATION PRODUCT,
TITLE 3 CHITOBIONO-DELTA-LACTONE
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: CHITINASE B;
COMPND 3 CHAIN: A, B;
COMPND 4 EC: 3.2.1.14;
COMPND 5 ENGINEERED: YES;
COMPND 6 MUTATION: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: SERRATIA MARCESCENS;
SOURCE 3 STRAIN: BJL200;
SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 5 EXPRESSION_SYSTEM_STRAIN: DH5 ALPHA;
SOURCE 6 OTHER_DETAILS: CLONED GENE
KEYWDS CHITINASE, INHIBITION, LACTONE, CHITIN DEGRADATION,
KEYWDS 2 HYDROLASE, GLYCOSIDASE
EXPDTA X-RAY DIFFRACTION
AUTHOR G.VAAJE-KOLSTAD,A.VASELLA,M.G.PETER,C.NETTER,D.R.HOUSTON,
AUTHOR 2 B.WESTERENG,B.SYNSTAD,V.G.H.EIJSINK,D.M.F.VAN AALTEN
REVDAT 1 27-APR-04 1UR9 0
JRNL
AUTH G.VAAJE-KOLSTAD,A.VASELLA,M.G.PETER,C.NETTER,
JRNL
AUTH 2 D.R.HOUSTON,B.WESTERENG,B.SYNSTAD,V.G.H.EIJSINK
JRNL
AUTH 2 D.M.F.VAN AALTEN
JRNL
TITL INTERACTIONS OF A FAMILY 18 CHITINASE WITH THE
JRNL
TITL 2 DESIGNED INHIBITOR HM508 AND ITS DEGRADATION
JRNL
TITL 3 PRODUCT, CHITOBIONO-DELTA-LACTONE
JRNL
REF J.BIOL.CHEM.
V. 279 3612 2004
JRNL
REFN ASTM JBCHA3 US ISSN 0021-9258
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH D.M.F.VAN AALTEN,D.KOMANDER,B.SYNSTAD,S.GASEIDNES,
REMARK 1 AUTH 2 M.G.PETER,V.G.H.EIJSINK
REMARK 1 TITL STRUCTURAL INSIGHTS INTO THE CATALYTIC MECHANSIM OF
REMARK 1 TITL 2 A FAMILY 18 EXOCHITINASE
REMARK 1 REF PROC.NAT.ACAD.SCI.USA
V. 98 8979 2001
REMARK 1 REFN ASTM PNASA6 US ISSN 0027-8424
REMARK 1 REFERENCE 2
REMARK 1 AUTH D.M.F.VAN AALTEN,B.SYNSTAD,M.B.BRURBERG,E.HOUGH,
REMARK 1 AUTH 2 B.RIISE,V.G.H.EIJSINK,R.K.WIERENGA
REMARK 1 TITL STRUCTURE OF A TWO-DOMAIN CHITOTRIOSIDASE FROM
PDB FILE
PDB file format
Chain
Occupancy
Atom identifier
12345678901234567890123456789012345678901234567890123456789012345678901234567890
1
2
3
4
5
6
7
8
ATOM
340 N
PHE A 43
3.853 28.346 32.161 1.00 10.57
N
ATOM
341 CA PHE A 43
3.839 29.688 32.724 1.00 12.33
C
ATOM
342 C
PHE A 43
3.096 29.747 34.047 1.00 13.20
C
ATOM
343 O
PHE A 43
2.361 28.823 34.393 1.00 12.52
O
ATOM
344 CB PHE A 43
3.228 30.659 31.700 1.00 10.99
C
ATOM
345 CG PHE A 43
3.993 30.709 30.401 1.00 9.80
C
ATOM
346 CD1 PHE A 43
3.743 29.794 29.386 1.00 9.85
C
ATOM
347 CD2 PHE A 43
5.032 31.615 30.233 1.00 11.37
C
ATOM
348 CE1 PHE A 43
4.528 29.781 28.220 1.00 10.71
C
ATOM
349 CE2 PHE A 43
5.816 31.612 29.075 1.00 10.61
C
ATOM
350 CZ PHE A 43
5.569 30.697 28.067 1.00 10.48
C
Atom number
Residue type
Residue number
X,Y,Z coordinates
Temperature factor
Atom type
12345678901234567890123456789012345678901234567890123456789012345678901234567890
1
2
3
4
5
6
7
8
ATOM
340 N
PHE A 43
3.853 28.346 32.161 1.00 10.57
N
ATOM
341 CA PHE A 43
3.839 29.688 32.724 1.00 12.33
C
ATOM
342 C
PHE A 43
3.096 29.747 34.047 1.00 13.20
C
ATOM
343 O
PHE A 43
2.361 28.823 34.393 1.00 12.52
O
ATOM
344 CB PHE A 43
3.228 30.659 31.700 1.00 10.99
C
ATOM
345 CG PHE A 43
3.993 30.709 30.401 1.00 9.80
C
ATOM
346 CD1 PHE A 43
3.743 29.794 29.386 1.00 9.85
C
ATOM
347 CD2 PHE A 43
5.032 31.615 30.233 1.00 11.37
C
ATOM
348 CE1 PHE A 43
4.528 29.781 28.220 1.00 10.71
C
ATOM
349 CE2 PHE A 43
5.816 31.612 29.075 1.00 10.61
C
ATOM
350 CZ PHE A 43
5.569 30.697 28.067 1.00 10.48
C
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
340
341
342
343
344
345
346
347
348
349
N
CA
C
O
CB
CG
CD1
CD2
CE1
CE2
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
A
A
A
A
A
A
A
A
A
A
43
43
43
43
43
43
43
43
43
43
3.853
3.839
3.096
2.361
3.228
3.993
4.743
5.032
4.528
5.816
28.346
29.688
25.747
28.823
30.659
30.709
29.794
31.615
32.781
31.612
32.161
32.724
34.047
34.393
31.700
30.401
29.386
30.233
28.220
29.075
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
10.57
12.33
13.20
12.52
10.99
9.80
9.85
11.37
10.71
10.61
N
C
C
O
C
C
C
C
C
C
ATOM
350
CZ
PHE A
43
5.569
30.697
28.067
1.00 10.48
C
PDB files - problems
- PDB format uses fixed-width fields, so one entry is limited
to 99,999 atom records and chain identifier is limited to single
character (not even for structures of huge complexes - e.g.
ribosome and viruses)
12345678901234567890123456789012345678901234567890123456789012345678901234567890
1
2
3
4
5
6
7
8
ATOM
340 N
PHE A 43
3.853 28.346 32.161 1.00 10.57
N
- parsing of PDB files difficult - apart from ATOM records
the file is almost unstructured (e.g. no rules to describe
structure determination in REMARKS records)
mmCIF and XML formats deal with these issues
Trust PDB?
The database centres can’t refuse to accept any data! Even if
curators of the PDB know the data contain serious errors.
So, PDB does contain a lot of errors - from sequence consistency
errors (you’ll deal with them) to completely wrong folds.
And even the best data are still only the models that fit best
experimental data.
Never trust the PDB!
Do you find this Trp normal?
Trp D 67 7GPB
Validation of structure files
- check statistics for bond lengths, angles, Ramachandran
plots….
- do statistics look similar to those of other proteins?
WhatCheck, Procheck
- how well does the model fit experimental data?
EDS
Electron Density Server
PDBsum
PDBSum-Highlights
Text searches in structural databases
Find all the structures deposited by Gerard Kleywegt with resolution better than
2Å and published in Journal of Molecular Biology
Options:
PDB - SearchLite, SearchFields
MSD - MSDlight, MSDpro (Java), MSDmine
OCA
Search Fields
Summary
- three major repositories of structural data: RCSB, MSD and
PDBJ
-all three are part of wwPDB
-structural data are deposited in PDB files - problems
- new formats - mmCIF, XML
- validation tools are necessary - WHATCheck, EDS
- new services are developed to analyze the whole database
(MSD services)
- searches at various levels of depth/complexity - Searchlite,
Search Fields
- added-value databases - OCA, PDBSum
Structural alignment
Why structural alignment ?
we have sequence alignment - Clustal…
KTHLCV
KSHA-V
that gives us an idea about a correspondence of
amino acids of two (or more ) proteins
That enables to infer information about function
And evolution of the Protein
If the sequences are similar enough !!!!
What is twilight zone ?
Sequence alignment unambiguously
distinguishes only between protein
pairs of similar structure and nonsimilar structures when the pairwise
sequence identity is high.
High sequence identity roughly means
over 40 %.
The signal gets blurred in the twilight
zone of 20-35 % sequence identity.
More of the twilight zone
More than 90 % sequence pairs with the sequence
identity lower than 25 % have different structures.
Significance of sequence alignments is length
dependent.
The longer the sequence the lower identity is
required to be called significant.Nevertheless, it
converges to 25% with alignments longer than 80
amino acids.
‘The more similar than identical’ rule can reduce a
number of false positives.
Using intermediate sequences for finding links
between more distant families can also reduce the
number of false positives.
How far can the sequence identity drop?
Average sequence identity of random alignments - 5.6 %
Average sequence identity of remote homologues 8.5 %
How does it work?
From http://www.biochem.unizh.ch/antibody/Introduction/Institutsseminar97/source/slide2.htm
Structural alignment because:
Structures are better conserved than sequences
structural alignment can imply a functional
similarity that is not detectable from a sequence
alignment .
Might help to improve sequence alignment when
structures are available (phylogenetic studies,
homology modeling).
Will improve sequence alignment methods (use of
structural alignments’ substitution matrices, gap
penalties).
Will improve sequence prediction methods
Structural versus sequence alignment
1FWR_A
2YPI_A
-------------------------MKNWKTSAESILTTGPVVPVIVVKKLEHAVPMAKA
ARTFFVGGNFKLNGSKQSIKEIVERLNTASIPENVEVVICPPATYLDYSVSLVKKPQVTV
::. . . : :. * .. : .
* ...
1FWR_A
2YPI_A
LVAGGVRVLEVTLRTECAVDAIRAIAKEVPEAIVGAGTVLNPQQLAEVTE-------AGA
GAQNAYLKASGAFTGENSVDQIKDVGAKWVILGHSERRSYFHEDDKFIADKTKFALGQGV
. ..
. :: * :** *: :. :
.
::
:::
*.
1FWR_A
2YPI_A
QFAISPGLTEPLLKAATEGTIPLIPGISTVSELMLGMDYGLKEFQFFPAEANGGVKALQA
GVILCIGETLEEKKAGKTLDVVERQLNAVLEEVKDWTNVVVAYEPVWAIGTGLAATPEDA
. :. * *
**..
:
:.:.*:
: :
.:. :. .... :*
1FWR_A
2YPI_A
IAGPFSQVRFCPKGGISPANYRDYLALKSVLCIGGSWLVPADALEAGDYDRITKLAREAV
QDIHASIRKFLASKLGDKAASELRILYGGSANGSNAVTFKDKADVDGFLVGGASLKPEFV
* :* ..
. * . :
.
..: . .*
*
:.* * *
1FWR_A
2YPI_A
EGAKL-DIINSRN
Sequence 1 ------------ART---FFVGGNFKLNG-SKQSI-KEIVERLNTASI--PENVEVVICP
.=ALI |=ID
|
|.... .. ..... . ....|... .
| ...
Sequence 2 MKNWKTSAESIL--TTGP--VVPVI--VVKKLEHAVP-MAKALVAG-GVR-----V-LEV
Sequence 1 ------PATYLDYSVSLV-KKPQVTVGAQ-N--AY-LKASGAFTGEN-S---VDQIKDVG
.=ALI |=ID
...........| . ..|||. .
.
. .
.
.|
Sequence 2 TLRTECAVDAIRAIAKEVP-E--AIVGAGTVLN-PQ----------QLAEVT--E---AG
Sequence 1 AKWVILGH--SERRSYFHEDDKFIADKTKFALGQGVGVILCIGETLEEKKAGKTLDVVER
.=ALI |=ID |...|. .
.....|.|.......|..|.
...
Sequence 2 AQFAIS-PGL-------------TEPLLKAATEGTIPLIPGIS--------------TVS
Sequence 1 QLNAV-LEEVKDW-TNVVVAYEP--VW--AIGTGLAATPEDA--QDI--HASI-RKFLA.=ALI |=ID .|... . .. .
.....|
.
.
.
.
.. . .
Sequence 2 ELMLGMD--YG-LK---EFQFFPAE-ANG-------G----VKA--LQA--IAG-P--FS
Sequence 1 SKLGDKAA-SELRILYGGSANGSN-AVTF---KDK-ADVDGFLVGGA-SLK--------.=ALI |=ID
.
|....|... .. . .
. ..|..... .. ..
Sequence 2 -------QV---RFCPKGGIS-PANY--RDYL--ALKSVLCIGG-SWL-VPADALEAGDY
Sequence 1 --P--EFV--DIIN--SR-N
.=ALI |=ID
. . . . . ..
Sequence 2 DRITKL-AREA--VEGAKL-
Sequence versus structural alignment
1 2
3
4
5
6
7
8
9
10 11 12 13 14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS
Is it difficult to make structural alignment?
Structural alignment is NP-hard
(nondeterministic polynomial time)
problem.
In other words, it is not tractable
properly.
Even, if it would, the result would be
correct from technical point of view not
necessary from biological point of view.
Yes, it is.
General solution
Use a heuristic approach:
1. Represent the proteins A and B in some
coordinate independent space
2. Compare A and B
3. Optimize the alignment between A and B
(e.g. minimize R.M.S.d.)
4. Measure the statistical significance of
the alignment against some random set of
structure comparisons
“..in some coordinate independent space…”
Make the problem easier by:
- comparing only distance matrices of
atoms
-comparing secondary
structure element (SSE)
- comparing cartoons
- comparing vectors of SSE
- combination of mentioned methods
- ….
None of the methods guarantee
the finding of the closest
structure and two methods can
disagree at all amino acid
positions.
Nevertheless they can still
provide a valuable insight into the
history of the protein and give
hints concerning the function.
Methods for fold comparison
Server
CE
DALI
DEJAVU
LOCK
MATRAS
PRIDE
SSM
TOP
TOPS
TOPSCAN
VAST
Location
http://cl.sdsc.edu
Method
Extension of optimal path1
http://www2.ebi.ac.uk/dali
Distance-matrix
alignment2
http://portray.bmc.uu.se/cgi-bin/dennis/dejavu.pl
SSE alignment with Caatom optimisation3
http://gene.stanford.edu/LOCK/
Absolute orientation of
corresponding points4
http://bongo.lab.nig.ac.jp/~takawaba/Matras.html
Markov transition model
of evolution5
http://hydra.icgeb.trieste.it/pride/
Ca- Ca atom distances6
http://www.ebi.ac.uk/msd-srv/ssm/ssmstart.html
Graph matching algorithm
http://bioinfo1.mbfys.lu.se/TOP
SSE alignment7
http:// tops.ebi.ac.uk/tops/compare1. html
TOPS-diagram alignment8
http://www.rubic.rdg.ac.uk/~andrew/bioinf.org/to
pscan
Secondary topology-string
alignment9
http://www.ncbi.nlm.nih.gov/Structure/VAST/vas
tsearch.html
Vector alignment10
Protein structure classification
If you want to know which structures are
similar to a known structure, these
systems might help:
A) Manual - SCOP
B) Semi-automatic - CATH
C) Automatic - FSSP
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH Topology or fold group level
From C. Orengo talk at EMBO course,
Cambridge 2004
TIM barrel enzymes
– 18 different homologous families
>60 different E.C. numbers
Structure of TIM barrel:
EC Wheel of TIM barrels
Triose phosphate isomerase
From J. Thornton talk at EMBO
course, Cambridge 2004
Arc repressor-like
CATH
Arc repressor-like
nearly one third of
the superfamilies
belong to <10 fold
groups
Up-down
Rossmann Fold
Rossmann
SH3-like
OB fold
OB Fold
Immunoglobulin
Alpha/Beta Plaits
Jelly Roll
Alpha-beta plait
TIM barrel
Jelly Roll
From C. Orengo talk at EMBO course, Cambridge 2004
TargetDB
http://targetdb.pdb.org/
QuickT i me™ and a T IFF (Uncompressed) decompressor are needed t o see thi s pi cture.
contains 84063 sequences annotated like:
Hypothetical Protein Mth938 (PDB ID:1ihn)
-hypothetical protein Af0491 from A. fulgidus
- putative serine hydrolase from S.cerevisiae
-predicted glutamine amidotransferase from P. aeruginosa
(January 2005)
PDB contains about 500 structures with a similar degree of
confidence in functional assignment
Function from structure
Sequence scans
Fold and
structural motifs
n-residue templates
Sequence search
vs PDB
SSM fold search
Enzyme active sites
Sequence search
vs Uniprot
Surface clefts
Ligand binding sites
Sequence motifs
(PROSITE, BLOCKS,
SMART, Pfam, etc)
Residue
conservation
DNA binding sites
Superfamily HMM
library
DNA-binding
HTH motifs
Reverse templates
Gene neighbours
Nest analysis
Summary
Structural alignment can help with protein
annotations even when the sequence similarity is
not significant.
Sequence identity of two proteins with similar
structures can be lower than 10 % - number of
folds is limited.
Recent progress in protein structure
determination increases the usefulness of
structural alignment.
Structural alignment is difficult problem that is
solved by heuristic methods.
These methods simplify the problem and sacrifice
the optimum result for the speed.
Summary II
Different methods can provide completely
different alignments.
In our results, CE, Dali,Matras and Vast were the
best servers for finding structural relatives.
A few structural classification systems have been
developed (CATH, FSSP, SCOP), they provide
hierarchical classification of protein structures
and enable to infer functional and evolutionary
relationships between proteins.
Folds are not distributed equally. Ten most
frequent folds represent almost one third of all
structures.