Protein Sequence Analysis Structure & Function Prediction
Download
Report
Transcript Protein Sequence Analysis Structure & Function Prediction
Bioinformatics approaches for…
Teresa K Attwood
Faculty of Life Sciences & School of Computer Science
University of Manchester, Oxford Road
Manchester M13 9PT, UK
http://www.bioinf.man.ac.uk/dbbrowser/
….analysing GPCRs….
….which craft is best?
Overview
• What are GPCRs?
– why they’re interesting & important
– why bioinformatics approaches are important
• In silico function prediction
– a reality check
• Family-based methods for characterising GPCRs
• Understanding the tools
– problems with pair-wise & family-based approaches
– estimating (biological) significance
• Seeking deeper functional insights
• Conclusions
What are GPCRs?
G protein-coupled receptors
• A functionally diverse family of cell-surface 7TM proteins
• Functional diversity achieved via
– interaction with a variety of ligands
– stimulation of various intracellular pathways via coupling to
different G proteins
GTP
GTP
GDP
GTP
Why are GPCRs interesting?
Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance
of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp.60-71.
• They are ubiquitous
– >800 GPCR genes in the human genome, from 3 major
superfamilies
• rhodopsin-, secretin- & metabotropic glutamate receptor-like
• Share almost no sequence similarity
– but are united by common 7TM architecture
• Constitute a complex multi-gene family
– populated by >50 families & >350 subtypes
Isn’t just stamp collecting!
Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance
of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp.60-71.
• GPCRs are of profound biomedical importance
– targets for >50% of prescription drugs
– yield sales >$16 billion/annum
• they’re big business!
• Given their importance, we need to
– characterise the ones we know about
– identify new ones
• & discover what they do!
– e.g., as potential new drug targets
Why studying GPCRs is difficult
• Only 2 crystal structures available
– bovine rhodopsin (2000) & human 2-adrenergic receptor (2007)
• Many GPCRs haven’t been characterised experimentally
– remain 'orphans’, with unknown ligand specificity
• With >800 human GPCRs, this isn’t much to go on!
Why use bioinformatics approaches?
• Computational approaches are important
– can be used to help identify, characterise & model novel receptors
• usually by similarity & extrapolation of known characteristics
• Bioinformatics thus offers complementary tools for
elucidating the structures & functions of receptors
• But the task is non-trivial
– GPCRs exhibit rich relationships & complex molecular interactions
• present many challenges for in silico analysis
– in trying to derive meaningful functional insights, traditional methods are
likely to be limited
peptides
peptides
proteins
amino acids
proteins
amino acids
ions
biogenic
amines
ions
biogenic
amines
lipids
GPCR
GPCR
light
others
GDP
PI3K
Ras
Rap
Raf1
B-Raf
Shc Sos
Grb2
i
q
light
others
Src
GTP
lipids
RTK
PYK2
GTP
o
Ras
GRF
GTP
MEK
Ca2+
EPAC
PKC
We’ve been using biology-unaware search
tools to analyse such complex systems
How far can we truly expect to
understand cellular function with
such naïve approaches…?
P
P
s
GTP
cAMP
MAPK
i
GTP
PKA
PLC
GDP
Regulation of gene
expression
Nucleus
In silico function prediction
…a reality check
• What is the function of this structure?
• What is the function of this sequence?
• What is the function of this motif?
–
the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions knowing the fold & function allows us to
rationalise how the structure effects its
function at the molecular level
“A test case for structural genomics
Structure-based assignment of the biochemical function of
hypothetical protein mj0577” (Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP, the biochemical function of the protein is unknown
What's in a sequence?
Methods for family analysis
Attwood, TK (2000). The quest to deduce protein function from sequence:
the role of pattern databases. Int.J. Biochem. Cell Biol., 32(2), 139–155.
Fuzzy regex
(eMOTIF)
Single motif
methods
Exact regex
(PROSITE)
Full domain
alignment methods
Profiles
(Profile Library)
HMMs
(Pfam)
Multiple motif
methods
Identity matrices
(PRINTS)
Weight matrices
(Blocks)
The challenge of family analysis
• highly divergent family with single function?
• superfamily with many diverse functional families?
– must distinguish if function analysis done in silico
– a tough challenge!
In the beginning was PROSITE
TM domain
[GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-LIVMFYWSTAC]-[DENH]-R
Diagnostic limitations of PROSITE
ID
AC
DT
DE
PA
PA
NR
NR
NR
G_PROTEIN_RECEP_F1_1; PATTERN.
PS00237;
APR-1990 (CREATED); NOV-1997 (DATA UPDATE); SEP-2004 (INFO UPDATE).
G-protein coupled receptors family 1 signature.
[GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x(2)-[LIVMNQGA]-x(2)-[LIVMFT][GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-x(2)-[LIVM].
/RELEASE=44.6,159201;
/TOTAL=1622(1621); /POSITIVE=1530(1529); /UNKNOWN=0(0);
/FALSE_POS=92(92); /FALSE_NEG=261; /PARTIAL=61;
• This represents an apparent 22% error rate
– the actual rate is probably higher
• Thus, a match to a pattern is not necessarily true
– & a mis-match is not necessarily false!
• False-negatives are a fundamental limitation to this
type of pattern matching
– if you don't know what you're looking for, you'll never know
you missed it!
Where do motifs (fingerprints) fit in?
(fingerprints are hierarchical)
TM domain
loop region
TM domain
Rhodopsin-like superfamily, family
& subtype GPCRs in PRINTS
Attwood, TK (2001) A compendium of specific motifs for diagnosing GPCR subtypes.
TiPS, 22(4), 162-165.
Searching PRINTS - FingerPRINTScan
Scordis, P, Flower, DR & Attwood, TK (1999) FingerPRINTScan: intelligent
searching of the PRINTS motif database. Bioinformatics, 15, 523-524.
• GPCR fingerprints are embedded in PRINTS
– allows diagnosis of GPCR mosaics
Visualising fingerprints
Attwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint
for G-protein-coupled receptors. Protein Eng., 6(2), 167–176.
N
C
Visualising fingerprints
Attwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint
for G-protein-coupled receptors. Protein Eng., 6(2), 167–176.
N
C
Diagnosing partial matches
• Missed by PROSITE
– wasn’t annotated as a FN
An integrated approach
Mulder, NJ, Apweiler, R, Attwood, TK, Bairoch, A et al. (2007) New developments in InterPro. NAR, 35, D224-8.
• To simplify sequence analysis,
the family dbs were integrated
within a unified annotation
resource – InterPro
– initial partners were PRINTS,
PROSITE, profiles & Pfam
• now many more partners
– linked to its satellite dbs
• but lags behind their coverage
– by Oct 2007, it had 14,768
entries & covered 76% of
UnitProtKB
• major role in fly & human
genome annotation
InterPro – method comparison
Where has this got us?
Understanding the tools
…estimating significance
• How do we know what to believe?
• Let’s explore some of the difficulties that arise when
pair-wise search tools (BLAST & FastA) & familybased methods are used naïvely
– these examples caution us to think about what the results
actually mean in biological terms.....
Identifying sequence similarity
• GPCRs present many challenges for in silico
functional analysis
• Several signature-based methods now available
– with different areas of optimum application
• Yet naïve, pair-wise similarity searching has been the
mainstay of functional annotation efforts
– it allows us to identify/quantify relationships between
sequences
• But quantifying similarity between sequences is not
the same as identifying their functions
Problems with pairwise similarity tools
Gaulton, A & Attwood, TK (2003) Bioinformatics approaches for the classification of G protein-coupled receptors.
Current Opinion in Pharmacology, 3, 114-120.
• For identifying precise families to which receptors belong
& the ligands they bind, pair-wise tools are limited
– at what level of seq ID is ligand specificity conserved?
• some GPCRs with 25% ID share a common ligand;
• others, with greater levels, don’t…
• It may be impossible to tell from BLAST if an orphan
belongs to a known family (the top hit), or if it will bind a
novel ligand
– e.g., for the now de-orphaned UR2R, BLAST indicates most
similarity to the type 4 SSRs, yet it is known to bind a different
(related) ligand
When is a GPCR not an SSR?
Query length: 389 AA
Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.ch
Taxon: Homo sapiensDatabase: XXswissprot
120,412 sequences; 45,523,583 total letters
SWISS-PROT Release 40.29 of 10-Oct-2002
Db
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
AC
Q9UKP6
P31391
O43603
P30872
P32745
P35346
P30874
P30874
P48145
O60755
P41143
P35372
P35372
Description
Score E-value
Q9UKP6
Orphan receptor [Homo sapiens...
782
0.0
SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41
GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35
SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34
SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33
SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33
SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens...
134 3e-31
SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31
GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31
GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30
OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29
SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien...
125 1e-28
OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28
When is a GPCR not an SSR?
…when it’s a UR2R
Query length: 389 AA
Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.ch
Taxon: Homo sapiensDatabase: XXswissprot
120,412 sequences; 45,523,583 total letters
SWISS-PROT Release 40.29 of 10-Oct-2002
Db
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
sp
AC
Q9UKP6
P31391
O43603
P30872
P32745
P35346
P30874
P30874
P48145
O60755
P41143
P35372
P35372
Description
Score E-value
UR2R_HUMAN Urotensin II receptor (UR-II-R) [GPR14] [Ho... 782
0.0
SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41
GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35
SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34
SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33
SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33
SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens...
134 3e-31
SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31
GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31
GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30
OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29
SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien...
125 1e-28
OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28
UR2R_HUMAN vs UROTENSIN2R
UR2R_HUMAN vs SOMATOSTANR
9
9
7
8
8
6
7
7
6
5
6
5
4
5
4
4
3
3
%ID
3
2
2
2
1
1
Residue Number
1
1
380
1
380
The trouble with top hits
• The most statistically significant hit is not always the most
biologically relevant
• Yet many rule-based ‘expert systems’ still rely on top
BLAST or FastA hits to make their diagnoses
• BLAST/FastA ‘see’ generic similarity & not the often-subtle
differences that constitute the functional determinants
between closely-related receptor families & subtypes
• Failure to appreciate this fundamental point has generated
numerous annotation errors in our databases
Misleading annotation via FastA
m-opioid receptor
k-opioid receptor
m-opioid receptor true
Misleading results from BLAST
• As we’ve seen, it’s tempting to use top hits from BLAST
or FastA results to classify unknown proteins
– but this may lead us (& especially computer programs) to false
functional conclusions
• PSI-BLAST is more sensitive than BLAST, because it
creates a profile from hits above a given threshold
– but this too can cause problems
– let’s take a closer look
So, is UL78 a GPCR?
& if so, what sort?
What
PSIBLAST
said
(profile dilution
in action)
*
*
*
What GeneQuiz said…
a thrombin receptor
What GeneQuiz said later…
Overview of results
pair-wise & family-based methods
What is UL78?
Tool
No hit
BLAST
Poor hit
Significant hit
GPCRs in list
PSI-BLAST
thrombin receptor;
chemokine &
opioid receptors
PROSITE profile
GPCR
Pfam
PRINTS
Blocks-PRINTS
GPCR
GeneQuiz
thrombin receptor;
C5A receptor
Bioinformatics tools, alone, cannot tell us!
So, beware top hits
…but also beware bottom hits!
Let us now compare & contrast some InterPro results
with those of its source dbs…
Rhodopsin-like superfamily
GPCRs in InterPro 2005
IPR000276 GPCR_Rhodopsn
7752 proteins
PS50262
G_PROTEIN_RECEP_F1_2 7702 proteins
PF00001
7tm_1
PS00237
G_PROTEIN_RECEP_F1_1 6527 proteins
PR00237
GPCRRHODOPSN
7064 proteins
5821 proteins
(don’t include partials)
Rhodopsin-like superfamily
GPCRs in the source databases
Pfam
FP ?
FN ?
U?
TP? 8776 matches
7064
PROSITE (profile)
FP 3
FN 3
U 12
TP 1837 matches
7702
PROSITE (regex)
FP 92 FN 261 U 0
TP 1530 matches
6527
PRINTS
FP 0
FN ?
U0
TP 1154 matches
5821
>2165 updated
Rhodopsin-like superfamily
GPCRs in InterPro 2007
IPR000276 GPCR_Rhodopsn
16,845 proteins
PS50262
G_PROTEIN_RECEP_F1_2
16,714 proteins
PF00001
7tm_1
15,712 proteins
PR00237
GPCRRHODOPSN
13,405 proteins
PS00237
G_PROTEIN_RECEP_F1_1
13,723 proteins
No human curator has time to validate all these matches…
14,615 rhodopsin-like superfamily
GPCRs in Pfam?
ID
AC
DT
DT
DT
DE
GN
OS
OX
RN
RP
RC
RA
RA
RT
RT
RL
RP
RC
RA
RL
DR
DR
DR
DR
DR
KW
SQ
//
Pfam
match Q6NV75/24-297
Q6NV75
PRELIMINARY;
PRT;
609 AA.
Q6NV75;
PROSITE (profile)
no match
05-JUL-2004 (TrEMBLrel. 27, Created)
05-JUL-2004 (TrEMBLrel. 27, Last sequence update)
PROSITE (regex)
no
match
false
negative
05-JUL-2004 (TrEMBLrel. 27, Last annotation update)
PRINTS
no match
G protein-coupled receptor 153.
Name=GPR153;
Homo sapiens (Human).
NCBI_TaxID=9606
[1]
SEQUENCE FROM N.A.
TISSUE=Brain;
Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G.,
Jones S.J., Marra M.A.;
"Generation and initial analysis of more than 15,000 full-length
human and mouse cDNA sequences.";
Proc. Natl. Acad. Sci. U.S.A. 99:16899-16903(2002).
SEQUENCE FROM N.A.
TISSUE=Brain;
Strausberg R.;
Submitted (MAR-2004) to the EMBL/GenBank/DDBJ databases.
EMBL; BC068275; AAH68275.1; -.
GO; GO:0004872
InterPro; IPR000276; GPCR_Rhodpsn.
Pfam; PF00001; 7tm_1; 1.
PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.
ClustalW – sequences too
Receptor
SEQUENCE
609 AA; 65341 MW; E525CC7F60D0891C CRC64;
divergent to be aligned
MSDERRLPGS AVGWLVCGGL SLLANAWGIL SVGAKQKKWK PLEFLLCTLA ATHMLNVAVP
IATYSVVQLR RQRPDFEWNE GLCKVFVSTF YTLTLATCFS VTSLSYHRMW MVCWPVNYRL
SNAKKQAVHT VMGIWMVSFI LSALPAVGWH DTSERFYTHG CRFIVAEIGL GFGVCFLLLV
GGSVAMGVIC TAIALFQTLA VQVGRQADHR AFTVPTIVVE DAQGKRRSSI DGSEPAKTSL
QTTGLVTTIV FIYDCLMGFP VLVVSFSSLR ADASAPWMAL CVLWCSVAQA LLLPVFLWAC
DRYRADLKAV REKCMALMAN DEESDDETSL EGGISPDLVL ERSLDYGYGG DFVALDRMAK
YEISALEGGL PQLYPLRPLQ EDKMQYLQVP PTRRFSHDDA DVWAAVPLPA FLPRWGSGED
LAALAHLVLP AGPERRRASL LAFAEDAPPS RARRRSAESL LSLRPSALDS GPRGARDSPP
GSPRRRPGPG PRSASASLLP DAFALTAFEC EPQALRRPPG PFPAAPAAPD GADPGEAPTP
PSSAQRSPGP RPSAHSHAGS LRPGLSASWG EPGGLRAAGG GGSTSSFLSS PSESSGYATL
HSDSLGSAS
GPCR?
Beware top & bottom hits
…but also beware simplistic analysis
tools coupled with wet experiments!
Let’s finally look at how hydropathy profiles can
compel biologists to make strange deductions…
- & still get their results published in Science!
ID
AC
DT
DT
DT
DE
GN
OS
OC
OX
RN
RP
RA
RA
RA
RT
RL
RN
RP
RA
RL
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
KW
SQ
//
Q9C929_ARATH
Unreviewed;
401 AA.
Q9C929;
Pfam
Lanthionine synthetase C-like protein
01-JUN-2001, integrated into UniProtKB/TrEMBL.
PROSITE (profile)
no match
01-JUN-2001, sequence version 1.
24-JUL-2007, entry version 23.
PROSITE (regex)
no match
Putative G protein-coupled receptor; 80093-78432.
Name=F14G24.19; OrderedLocusNames=At1g52920;
PRINTS
no match
Arabidopsis thaliana (Mouse-ear cress).
Eukaryota; Viridiplantae; Streptophyta; ... Arabidopsis.
NCBI_TaxID=3702;
[1]
NUCLEOTIDE SEQUENCE.
Lin X., Kaul S., Town C.D., Benito M., Creasy T.H., Haas B.J., Wu D.,
Maiti R., Ronning C.M., Koo H., Fujii C.Y., Utterback T.R.,
Barnstead M.E., Bowman C.L., White O., Nierman W.C., Fraser C.M.;
"Arabidopsis thaliana chromosome 1 BAC F14G24 genomic sequence.";
Submitted (DEC-1999) to the EMBL/GenBank/DDBJ databases.
[2]
NUCLEOTIDE SEQUENCE.
Town C.D., Kaul S.;
Submitted (JAN-2001) to the EMBL/GenBank/DDBJ databases.
EMBL; AC019018; AAG52264.1; -; Genomic_DNA. [EMBL / GenBank / DDBJ]
PIR; E96570; E96570.
UniGene; At.66935; -.
GenomeReviews; CT485782_GR; AT1G52920.
KEGG; ath:At1g52920; -.
ClustalW – sequences too
TAIR; At1g52920; -.
GO; GO:0004872; F:receptor activity; IEA:UniProtKB-KW.
divergent to be aligned
InterPro; IPR007822; LANC_like.
InterPro; Graphical view of domain structure.
Pfam; PF05147; LANC_like; 1.
Receptor.
SEQUENCE
401 AA; 45284 MW; C9D3BF8CC8F0FE0B CRC64;
MPEFVPEDLS GEEETVTECK DSLTKLLSLP YKSFSEKLHR YALSIKDKVV WETWERSGKR
VRDYNLYTGV LGTAYLLFKS YQVTRNEDDL KLCLENVEAC DVASRDSERV TFICGYAGVC
ALGAVAAKCL GDDQLYDRYL ARFRGIRLPS DLPYELLYGR AGYLWACLFL NKHIGQESIS
SERMRSVVEE IFRAGRQLGN KGTCPLMYEW HGKRYWGAAH GLAGIMNVLM HTELEPDEIK
DVKGTLSYMI QNRFPSGNYL SSEGSKSDRL VHWCHGAPGV ALTLVKAAQV YNTKEFVEAA
MEAGEVVWSR GLLKRVGICH GISGNTYVFL SLYRLTRNPK YLYRAKAFAS FLLDKSEKLI
SEGQMHGGDR PFSLFEGIGG MAYMLLDMND PTQALFPGYE L
GPCR?
peptides
peptides
proteins
amino acids
proteins
amino acids
ions
biogenic
amines
ions
biogenic
amines
lipids
GPCR
GPCR
light
others
GTP
light
others
GDP
Ras
Rap
Shc Sos
Grb2
Remember
Src
i
lipids
RTK
o
PYK2
Ras do biology!
Computers
don’t
GRF
q
Raf1
PI3K
GTP
MEK
Ca2+
B-Raf
GTP
EPAC
PKC
P
P
They do sums (quickly) &
crude string matching
s
GTP
cAMP
MAPK
i
GTP
PKA
PLC
GDP
Regulation of gene
expression
Nucleus
Seeking deeper functional insights
Attwood, TK, Croning, MD & Gaulton, A (2002) Deriving structural and functional insights from a ligand-based
hierarchical classification of G protein-coupled receptors. Protein Eng., 15, 7-12.
• S’family, family & subtype motifs have different locations
• If s’family motifs define the common scaffold, hypothesis:
– family motifs relate to ligand binding?
– subtype motifs relate to G protein coupling?
– powerful tools for subtyping & potentially de-orphaning GPCRs
Locations of ligand-binding residues & motif distribution
Locations of G protein-coupling residues & distribution of motifs
G protein coupling regions & # of
families mapping to each region
Subtype motifs & # of fingerprints
mapping to each region
Seeking deeper functional insights?
Attwood, TK, Croning, MD & Gaulton, A (2002) Deriving structural and functional insights from a ligand-based
hierarchical classification of G protein-coupled receptors. Protein Eng., 15, 7-12.
GPCR superfamily
Muscarinic receptors
Muscarinic receptor M5
• Clearly, many family- & subtype motifs are simply in the
‘wrong’ place for the initial hypothesis to be true
Refining the hypothesis
• Besides, it’s not that simple
– only part of the answer
• Need to consider that GPCRs don’t function in isolation
– their functions are modulated via interactions with other proteins
• Also, the phenomenon of dimerisation challenges the
view of the GPCR monomer as functional unit
– many GPCRs exist as homo- & heterodimers
• Such observations demand a more systematic analysis of
motifs & their likely functional roles
Oligomerisation & protein-protein interaction
residues/regions
A pilot study with adrenergic, bradykinin & dopamine receptors
residues involved in ligand binding
residues involved in G protein coupling
residues involved in protein-protein interaction
residues involved in oligomerisation
family-level motifs
subfamily-level motifs
Where next?
• Based on location, some family-level motifs couldn’t
be involved in ligand binding & some subtype-level
motifs couldn’t be involved in G protein coupling
– clearly, 3D location must be taken into account
• functional correlations would then be stronger
• The remaining motifs are likely to be involved in other
molecular interactions
– e.g., dimerisation, effector proteins….(early results promising)
• this will help us to build a knowledge-based system to help
suggest the likely functional roles for family- & subtype-level
motifs in future
Conclusions
• There are many barriers to success for the jobbing
bioinformatician, e.g.:
– not fully understanding the processes we’re trying to model
& predict (e.g., protein folding)
– the dynamic nature of biological data
– not having been rigorous in the way we define &/or describe
biology/biological processes in the literature
– the volume of data, data heterogeneity
– maintenance of data, propagation of errors…
• Possibly the largest hurdle is that computers are
number crunchers
– they don’t do biology, & trying to teach them is hard
– & the harder we try, the clearer it is how naïve we’ve been
Conclusions
• In silico functional annotation requires several dbs to be
searched & several tools to be used
– different methods provide different perspectives
– dbs aren’t complete & their contents don’t fully overlap
• The more dbs searched, the harder it is to interpret results
• The more computers are involved in automating annotation,
the greater the need for collaboration
– especially between s/w developers, annotators & ‘wet’
experimentalists
• The more data we have, the more rigorous we must be in
thinking/writing if we are to make sense of the complexities
Conclusions
Flower DR & Attwood, TK (2004) Integrative bioinformatics for functional genome annotation: trawling for G
protein-coupled receptors.Semin Cell Dev Biol., 15(6), 693-701.
• For GPCRs, there are many analysis tools available
– BLAST, FastA, family databases, modelling tools, etc.
• We must understand the limitations of the methods
– no method is infallible or able to replace the need for biological validation
– use all available resources & understand their problems – none is best!
• Used wisely, bioinformatics tools are useful
– BLAST/FastA offer broad brush strokes, motif-methods add fine detail
– together, they facilitate receptor characterisation & prediction of ligand
specificity, & allow identification of novel ligand-binding, G proteincoupling or other likely molecular interaction motifs
• We are a long way from having reliable tools for deducing
GPCR function & structure from sequence
– but with the right approach, there is hope