Domain-SLiM mining from High Throughput Protein Interaction Data

Download Report

Transcript Domain-SLiM mining from High Throughput Protein Interaction Data

Domain-SLiM mining from
High Throughput Protein
Interaction Data
Hugo Willy
August 19, 2010
Introduction to SLiM




It stands for (Protein) Short Linear Motif
By its name, it is a short linear stretch of region
in a protein sequence that is recognized by
another protein for binding
It averages at 8-12 amino acids where some
can go as short as three amino acids.
It is currently one of the special mechanism of a
protein recognizing its interaction partner
Protein Interaction in general
•
Some proteins only function in terms of a
complex. They have to be in a certain
combination. These are called the obligate
complexes.
•
Their binding surface are usually large to
provide strong chemical interactions.
Protein Interaction in general (2)
• On the other hand, some complexes are formed
only “on-demand”. Once the task is done, they
dissociate. These are called the transient
complexes.
• The interaction surface of this type of interaction
is generally smaller.
• SLiM based interaction is one of the transient
ones.
Picture of non-linear interface
(common case in obligate complexes)
The interaction region
is non linear
Picture of Linear interface
The protein chain
bound is linear on the
interface of the
partner.
Protein domains recognizing
SLiMs
•
In reality, the task of recognizing a SLiM often
is performed by specialized protein domain.
•
Some of the most well known example is the
SH3 domain which recognize P..P motif where
P is a proline amino acid. WW domains
recognize PP.Y motif.
•
These SLiMs, along with their functions (or
domain association) are listed in databases
like Eukaryotic Linear Motif (ELM) [1] and
MiniMotif (MnM) [2]
Methods of finding SLiMs in
proteins
• The SLiMs listed in the two databases are
mostly results of experimental procedures like
mutagenesis and phage display.
• They are laborious and expensive.
Computational Method to detect
SLiMs
•
From Sequence-based data (the focus of this
talk)
•
From Structural data – earlier this year, we
published SLiMDiet [3], which is currently the
most comprehensive SLiM listing from the
PDB.
Sequence-based SLiM detection
•
•
Protein sequence based
–
Given a set of grouped sequence, find motif that
occurs in unrelated sequences.
–
Example: DILIMOT [4], SliMDisc [5],
SLiMFinder [6]
Protein interaction based
–
Find correlated motifs that is over-represented
in interacting proteins
–
Example: D-STAR [7], MotifCluster [8],
SLIDER [9]
Protein Sequence Based Methods
•
Rely on occurrences on unrelated sequences.
•
May need to remove protein domains from the
motif search space because of their similarity
•
The grouping of the sequences can be manual
– by manually selecting known sequences
with a certain property. For example, proteins
that are exported outside the cells can be
grouped to find the motif that is responsible for
the export mechanism.
•
Automated grouping – using the protein
domain information or GO ontology annotation
Protein Sequence Based Methods
(2)
•
Once the grouping is done, the motif is mined
using standard motif searching like MEME or
TEIRESIAS.
•
Because of the speed and rigid requirement of
motif length of MEME, usually TEIRESIAS is
the program of choice (it can start with a motif
length and try to combine the motifs into
longer ones).
•
Teiresias uses L,W motif – motif of length L
over window of length W.
Protein Sequence Based Methods
(3)
• The problem of this method is that it relies too
much on the initial grouping.
• The grouping must have the motif really overrepresented.
• All paper in this line have been comparing their
performances in the ELM set (a dataset of
curated sequences which are known to contain
the ELM motif).
Protein Sequence Based Methods
(4)
• They also found some significant motifs from the
group of protein known to interact with a certain
protein domain which is known to have such
SLiM interaction.
• DILIMOT got published in PLoS Biology as they
managed some biological validations.
Interaction based methods
• To be precise, none of the interaction based
methods designed up to date were specifically
designed to find SLiMs.
• Most of them are finding “correlated motif pair”.
• These are a pair of motif which occur
consistently more frequently in interacting
proteins as opposed to some background
model.
• Examples: D-STAR, MotifCluster and SLIDER
Interaction based methods (2)
• These methods rely solely on the density of
interactions between the two set of proteins that
contain the motif pair respectively.
• D-STAR and SLIDER uses a Chi-Square
scoring while MotifCluster uses hypergeometric
scoring.
• As I shall show later, they may not be suitable in
finding SLiMs – they are by design finding
interaction motif which may not be the binding
motif themselves.
My current attempt - SLIMMER
• I learnt that most of the time SLiMs are bound
by a non-linear interface.
• Thus, it is not very feasible to hope that both
side of the interface contain linear motifs.
• This was mentioned by one of D-STAR’s
reviewer.
• So, I try to find correlated motifs where one of
them is a protein domain – which is by definition
non-linear (they are distinct protein folds in 3D)
SLIMMER (2)
• I basically combine the good ideas from many
programs to accomplish this.
• The strength of correlated motifs is that they can
find seemingly insignificant motifs (by virtue of
their sequence occurrence) by using the fact
that once they occur, they interact intensively
with the partner motif.
• The correlated motif uses over-representation of
the interaction occurrence, as opposed to
sequence occurrence.
SLIMMER (3)
• However, the tricks of sequence based method
can also be applied.
• They requires occurrence of the SLiMs in nonhomologous sequences (which can be
considered as independent occurrences).
• This non homology is never considered in DSTAR, MotifCluster and SLIDER.
• We should consider only non-homologous
interactions when we count the occurrence of
the motif pair.
SLIMMER (4)
• The SLiM itself must have an occurrence
probability better than random. MotifCluster
uses the binomial distribution to compute the
probability of seeing a motif M, k times in the
sequence set (this is threshold approach).
• I also tried another approach where I combine
the binomial p-value of the motif occurrence and
the hypergeometric p-value of the interaction
occurrence.
SLIMMER (5)
• Current results, SLIMMER is better than all
methods available and it is also fast.
• I am still implementing a better background
model to deal with low complexity regions –
using a simple 3rd or 4th order markov.
• I also in the middle of trying a motif model that
allows choices like [LIVM], [FWY] or [KRH]
• The program allowing these currently is only
SLiMFinder and it is very slow and inaccurate
for now.
References
[1] P Puntervoll et al. ELM server: A new resource for investigating short
functional sites in modular eukaryotic proteins. Nucleic Acids Res.,
31(13):3625–3630, 2003.
[2] S Rajasekaran et al. Minimotif miner 2nd release: a database and web
system for motif search. Nucleic Acids Res., 37(Database issue):D185–190,
2009.
[3] W Hugo et al. SLiM on Diet: finding short linear motifs on domain interaction
interfaces in Protein Data Bank. Bioinformatics 2010 26(8):1036-1042
[4] V Neduva et al. Systematic discovery of new recognition peptides mediating
protein interaction networks. PLoS Biol., 3(12):e405, 2005.
[5] N E Davey et al. SLiMDisc: short, linear motif discovery, correcting for
common evolutionary descent. Nucleic Acids Res., 34(12):3546–3554, 2006.
[6] R J Edwards et al. SlimFinder: a probabilistic method for identifying
overrepresented, convergently evolved, short linear motifs in proteins. PLoS
ONE, 2(10):e(967), 2007.
References (2)
[7] S H Tan et al. A correlated motif approach for finding short linear motifs from
protein interaction networks. BMC Bioinformatics, 7:502, 2006.
[8] H C Leung et al. Clustering-based approach for predicting motif pairs from
protein interaction data. J Bioinform Comput Biol. 2009 Aug;7(4):701-16.
[9] P Boyen et al. SLIDER: Mining correlated motifs in protein-protein interaction
networks. In Proceedings of the 2009 Ninth IEEE International Conference on
Data Mining, pages 716–721, 2009.