Transcript slides

Understanding Sequence,
Structure and Function
Relationships and the
Resulting Redundancy
PHAR 201/Bioinformatics I
Philip E. Bourne
UCSD
PHAR 201 Lecture 07, 2012
1
Agenda
• Understand the relationship between sequence,
structure and function. Consider specifically:
– sequence-structure
– structure-structure
– structure-function
• Take home message: a non-redundant set of
sequences is different than a non-redundant set
of structures is different than a non-redundant
set of functions
PHAR 201 Lecture 07, 2012
2
Why Bother?
• Biology:
– A full understanding of a molecular system
comes from careful examination of the
sequence-structure-function triad
– Each triad is then a component in a biological
process
• Method:
– Bioinformatics studies invariably start from a
non-redundant set of data to achieve
appropriate statistical significance
PHAR 201 Lecture 07, 2012
3
Background – RMSD Defined
Protein A
d
d1 1
a1
b1
i=N
d2
a2
Represents the overall distance
between two proteins usually
averaged over their Calpha
atoms denoted here a and b
b2
RMSD = Sqrt (1/N
Σ | d| )
i
2
i=1
d3
a3
b3
d4
b4
a4
Protein B
aN
Thus RMSD is the square root of the
sum of the squares of the distances
between all Calpha atoms
Rule of thumb:
1-2 Å RMSD the proteins are close
<6 Å RMSD they are likely related
bN
Note: Assumes you know residues
correspondences
PHAR 201 Lecture 07, 2012
4
Some Useful Observations
• Below 30% protein sequence identity detection of a
homologous relationship is not guaranteed by sequence
alone
• Structure is much more conserved than sequence
• Distinguishing between divergent versus convergent
evolution is an issue
• Structure is limited relative to sequence or the order
1:100 – 1:10000 (depending on how you count)
• Structure follows a power law with respect to function –
each structural template has from 1 to n functions
PHAR 201 Lecture 07, 2012
5
Relationship Between
Sequence and Structure
PHAR 201 Lecture 07, 2012
6
The classic hssp curve from Sander and Schneider (1991) Proteins 9:56-68
PHAR 201 Lecture 07, 2012
7
This Analysis was Updated by
Rost in 1999
http://peds.oupjournals.org/cgi/con
tent/full/12/2/85
PHAR 201 Lecture 07, 2012
8
Sequence vs Structure – Another
Perspective
Random 1000 structurally similar PDB polypeptide chains from
CE with z > 4.5 (% sequence identity vs alignment length)
% Seq. Id.
Twilight Zone
Midnight Zone
Alignment Length
PHAR 201 Lecture 07, 2012
9
There Are No Absolute Rules - Similar Sequences
– Different Structures
1PIV:1
Viral Capsid Protein
1HMP:A
Glycosyltransferase
10
80 Residue Stretch (Yellow) with Over 40% Sequence Identity
Given This Complex Relationship
a Non-redundant Set of
Sequences Does not Imply a
Non-redundant Set of Structures
PHAR 201 Lecture 07, 2012
11
Structure vs Structure
PHAR 201 Lecture 07, 2012
12
Structure Is Highly Redundant
The Russian Doll Effect
Homology
modeling
is used here
PHAR 201 Lecture 07, 2012
13
Structure Alignments using CE with z>4.0
We will be revisiting this in the next
couple of lectures
• Specifically:
– How do we capture this redundancy?
– What systems are commonly used to express
this redundancy and what do they bring to our
understanding of biology?
• For now consider what this means using
the most popular structure classification
scheme - SCOP
PHAR 201 Lecture 07, 2012
14
Nature’s Reductionism
There are ~ 20300 possible proteins
>>>> all the atoms in the Universe
17.4M protein sequences from
17994 species (RefSeq 10/24/12)
38,221 protein structures
yield 1195 domain folds (SCOP 1.75
15
not changed in 3 years)
The SCOP Hierarchy v1.75
Based on 38221 Structures
7
1195
This is remarkable!
Explains the one fold
many functions
1962
3902
110800
PHAR 201 Lecture 07, 2012
16
Specific Examples
From the SCOP
Hierarchy
PHAR 201 Lecture 07, 2012
17
Protein Domains
• Definition
– Compact,
spatially distinct
– Fold in isolation
– Recurrence
PHAR 201 Lecture 07, 2012
18
Structure vs Function
PHAR 201 Lecture 07, 2012
19
Some Basic Rules Governing
Structure-Function Relationships …
• The golden rule is there are no golden
rules – George Bernard Shaw
• Above 40% sequence identity sequences
tend to have the same structure and
function – But there are exceptions
• Structure and function tend to diverge at
the same level of sequence identity
PHAR 201 Lecture 07, 2012
20
Structure vs Function
This is even more complicated than the
relationship between sequence and
structure and not as well understood
PHAR 201 Lecture 07, 2012
21
Complication Comes from One
Structure Multiple Functions
• We saw this from GO already
• phosphoglucose isomerase acts as a
neuroleukin, cytokine and a differentiation
mediator as a monomer in the extracellular
space and as a dimer in the cell involved
in glucose metabolism
PHAR 201 Lecture 07, 2012
22
Consider an Example Relative to
SCOP
• lysozyme and alpha-lactalbumin:
– Same class alpha+beta
– Same superfamily – lysozyme-like
– Same family C-type lysozyme
– Same fold – lysozyme-like
– different function at 40% sequence identity
• Lysozyme – hydrolase EC 3.2.1.17
• Alpha lactalbumin – Ca binding lactose
biosynthesis
PHAR 201 Lecture 07, 2012
23
More Details…
Lysozyme is an O-glycosyl hydrolase, but -lactalbumin
does not have this catalytic activity. Instead it regulates
the substrate specificity of galactosyl transferase through
its sugar binding site, which is common to both lactalbumin and lysozyme. Both the sugar binding site and
catalytic residues have been retained by lysozyme during
evolution, but in -lactalbumin, the catalytic residues have
changed and it is no longer an enzyme.
PHAR 201 Lecture 07, 2012
24
Why is It Not so Well Understood?
1. Function is often ill-defined e.g., biochemical,
biological, phonotypical and instances are
buried in the literature
2. The PDB is biased – it does not have a
balanced repertoire of functions and those
functions are ill-defined
3. There are a number of functional
classifications eg EC, GO that have differing
coverage and depth
PHAR 201 Lecture 07, 2012
25
Point 2 PDB Bias
PDB vs Human Genome
EC – Hydrolases – Begins to Illustrate the Bias in the PDB
2.5 Transferring alkyl or aryl groups
over represented in PDB
PDB
2.4 Glycosyltransferases
under represented in PDB
Ensembl
Human
Genome
Annotation
PHAR 201 Lecture 07, 2012
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http://sg.rcsb.org
26
Structure vs Function Follows a
Power Law Distribution
• Some folds are
promiscuous and
adopt many different
functions - superfolds
PHAR 201 Lecture 07, 2012
Qian J, Luscombe NM, Gerstein M. JMB 2001 313(4):673-81
27
Examples of Superfolds..
1TIM
PHAR 201 Lecture 07, 2012
28
Examples of Superfolds
3ADK
1FXI
PHAR 201 Lecture 07, 2012
29
Specific Examples of the
Relationship Between Structure
and Function
PHAR 201 Lecture 07, 2012
30
Same Structure and Function Low
Sequence Identity
The globin fold is resilient to amino acid changes. V. stercoraria (bacterial)
hemoglobin (left) and P. marinus (eukaryotic) hemoglobin (right) share just
8% sequence identity, but their overall fold and function is identical.
PHAR 201 Lecture 07, 2012
31
Same Structure Different Function - Alpha/beta
proteins characterized as different superfamilies
1ymv
1fla
PHAR 201 Lecture 07, 2012
1pdo
32
Example – Same Structure Different
Function
1ymv
1fla
CheY
Flavodoxin
Signal Transduction
Electron Transport
1pdo
Mannose Transporter
Less than 15% sequence identity
PHAR 201 Lecture 07, 2012
33
Convergent Evolution
Subtilisin and chymotrypsin are both serine endopeptidases. They share no
sequence identity, and their folds are unrelated. However, they have an
identical, three-dimensionally conserved Ser-His-Asp catalytic triad, which
catalyses peptide bond hydrolysis. These two enzymes are a classic
example of convergent evolution.
PHAR 201 Lecture 07, 2012
34
150
Ilk____PSS
..........
Ilk____Seq
..........
-----------1fmk--_Seq
KHADGLCHRL
1fmk--_SS
HCCCCCCCCC
200
Ilk____PSS
EEEECCCCE.
Ilk____Seq
WKGRWQGND.
------------ W+G+W-G+1fmk--_Seq
WMGTWNGTTR
1fmk--_SS
EEEEECCCEE
Ilk____PSS
Ilk____Seq
-----------1fmk--_Seq
1fmk--_SS
250
EECCCCEEEE
CQSPPAPHPT
++++P
-VSEEP...IY
ECCCC...EE
Ilk____PSS
Ilk____Seq
-----------1fmk--_Seq
1fmk--_SS
300
HHHCCCCCEE
FLHTLEPLIP
++++--- YVERMNY..V
HHHHHCC..C
Ilk____PSS
Ilk____Seq
-----------1fmk--_Seq
1fmk--_SS
350
HHHHHHCCCC
APEALQKKPE
APEA++++APEAALYGR.
CHHHHHHCC.
***
.......... ........CC ....CEEEHH
.......... ........FK ....QLNFLT
-+
+L-+++
TTVCPTSKPQ TQGLAKDAWE IPRESLRLEV
CEECCCCCCC CCCCCCCCCE CCHHHEEEEE
200
HHCCCCCCEE
KLNENHSGEL
KL-+---GEKLGQGCFGEV
EEEECCCEEE
* * *
250
EEEEEEECCC
IVVKVLKVRD
+-+K+LKVAIKTLKP..
EEEEEECC..
*
CCCCCHHHHH
WSTRKSRDFN
+T+++-+F.GTMSPEAFL
.CCCCHHHHH
HHHHHHHHHC
EECPRLRIFS
+E---++-++
QEAQVMKKLR
HHHHHHHHCC
*
EEHHHHCCCC
LITHWMPYGS
++T--M++GS
IVTEYMSKGS
EEEECCCCCE
HHHHHHCCCC
LYNVLHEGTN
L-++L-+-T+
LLDFLKGETG
HHHHHCCCCC
CCCCHHHHHH
FVVDQSQAVK
--+--+Q-V+
KYLRLPQLVD
CCCCHHHHHH
CCCCCCCCEE
RHALNSRSVM
---L-+++++
HRDLRAANIL
CCCCCHHHEE
*
*
Cat. Loop
ECCCCEEEEC
IDEDMTARIS
++E+-+++++
VGENLVCKVA
EECCCEEEEC
CCCCEEECCC
MADVKFSFQC
---+-DFGLAR....
CCCCCC....
*
CCCCEEEEEE
DTNRRSADMW
---++D+W
..FTIKSDVW
..CCHHHHHH
EEHHHHHHHH
SFAVLLWELV
SF++LL+EL+
SFGILLTELT
HHHHHHHHHH
H.CCCCCCCC
T.REVPFADL
T -+VP+-++
TKGRVPYPGM
CCCCCCCCCC
CCCEEEEEEE
HPNVLPVLGA
H++++-++++
HEKLVQLYAV
CCCECCEEEE
Example: Same Fold
but Not Function
•“Integrin-linked kinase” (Ilk)
is a novel protein kinase fold
with strong sequence similarity
to known structures (Hannigan
et al. 1996 Nature 379, 91-96)
300
HHHHHHHHHH
FALDMARGMA
+A+++A+GMA
MAAQIASGMA
HHHHHHHHHH
•Aligns to Src kinases with
BLAST e-value of 10-19 and
27% identity (alignment shown
is to a known Src kinase
structure)
350
CCCCCCCCCC
PGRMYAPAWV
+---W....FPIKWT
....CCHHHC
•Several key residues are
conserved, but residues
important to catalysis, including
catalytic Asp, are missing
400
CHHHHHHHHH
SNMEIGMKVA
+N-E+-++V
VNREVLDQV.
CHHHHHHHH.
PHAR 201 Lecture 07, 2012
•Recent experimental evidence
suggests that Ilk lacks kinase
activity (Lynch et al. 1999
Oncogene 18, 8024-8032)
35
Non-Redundant Sets: Sequences
• Refseq (NCBI) – Annotated
• BLASTclust
http://www.ncbi.nlm.nih.gov/Web/Newsltr/
Spring04/blastlab.html
• CDhit http://bioinformatics.org/cd-hit/ popular algorithm for fast clustering of
sequences
PHAR 201 Lecture 07, 2012
36
Non-Redundant Sets:
Sequences with Structure
• PDBselect - http://bioinfo.tg.fhgiessen.de/pdbselect/
• Astral http://astral.berkeley.edu/
• Pisces
http://dunbrack.fccc.edu/Guoli/PISCES_O
ptionPage.php
• RCSB PDB queries
• RCSB Sequence Similaity
PHAR 201 Lecture 07, 2012
37
PHAR 201 Lecture 07, 2012
38
PDB Has 194042 Polypeptide
Chains
From http://www.pdb.org/pdb/statistics/clusterStatistics.do
PHAR 201 Lecture 07, 2012
39