Protein – Protein Interactions

Download Report

Transcript Protein – Protein Interactions

Protein – Protein Interactions
Lisa Chargualaf
Simon Kanaan
Keefe Roedersheimer
Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty,
ChengBang Huang
What are proteins?

Basis of most living functions
 Building blocks of life
– Substrates
– Products
– Enzymes

One cell contains thousands of different
proteins; the human body contains 50 to 100
thousand proteins!
Proteins

Composed of sequences of amino acids
– Variations of 20 primary/basic amino acids

Rules governing structure:
– AAs close in the folded structure may/may not be close
in primary structure
– Hydrophobic residues generally buried in core;
hydrophilic are usually exposed
– Protein strings cannot form knots
– Related proteins generally have similar structures

Similar structures can exist without having similar sequences
What is a “protein–protein”, P-P,
interaction and why is it important?



Derived from the nuclear material within a cell, proteins
fold and interact in intricate arrangements that provide
functionality to the components of a cell, which in turn
work cooperatively to form whole body systems.
Protein-protein interactions serve as the chemical basis of
all living organisms.
Understanding protein interactions helps us understand the
protein network.
What causes P-P
interactions?
 Many
speculations arise when it
comes to the driving force behind
proteins interacting with each other
– Primary sequence dictating
interaction between attached
functional groups
– Protein domains drive proteins to
fold and interact as they do.
What are protein domains?

significant portions of proteins
 composed of distinct peptides
 the key to intricate
arrangements
Domains and Proteins




A single protein molecule can possess multiple domains,
causing difficulty in discovering a simple formula that
dictates the manner by which protein-protein interactions
occur.
Yet, certain affinities exist between certain protein domains
and are frequently seen in living organisms.
This drives our research that seeks to extrapolate the
mechanism of protein-protein interactions to focus on
domain-domain interactions as a factor.
The model system used for these proceedings is the yeast
cell, with several of its proteins serving as the test cases.
This is done using a protein family data bank available
online.
Our “Formula” dictating which
P-P interactions occur

A data bank gives a list of protein
interactions.
 A protein interaction, (P1, P2), is explained
by a domain pair, (D1, D2), if P1 includes
one domain and P2 includes the other.
 Find the minimum number of domain pairs
that explains the databank. Equivalent to
Minimum Set Cover problem.
Minimum Set Cover Problem

The problem of finding the minimum size
set of sets whose union is equal to the union
of all the sets.
 NP complete problem.
Why the Minimum Set of
Domains?

Lets look at the
following case:
–
–
–
–


P1 contains domains D2
P2 contains domains D2 and D3
P3 contains domains D2 and D4
P4 contains domains D2 and D5
And lets assume the
protein interactions
are:




P1 - P1
P1 - P2
P1 - P3
P1 - P4
P-P interactions
explained by:
– (D2 - D2)
– (D2 - D3)
– (D2 - D4)
– (D1 - D5)

Or by:
– (D2 - D2)
Mapping to MSC

Let





P1 - P1 = 0
P1 - P2 = 1
P1 - P3 = 2
P1 - P4 = 3
Each pair’s interactions




D2-D2={0,1,2,3}
D2-D3={1}
D2-D4={2}
D1-D5={3}


This maps to the integer
MSC problem with a
global set of {0,1,2,3}
and subsets of
{{0,1,2,3},{1},{2},{3}}
Solution is D2-D2, more
difficult for larger
problems.
Implementation/Algorithm




This base algorithm consists of functions that can record the protein
structure and interaction information and store them into different data
structures.
It also builds a domain-domain matrix.
This matrix holds information about interacting domains. Each entry
in the matrix represents the number of times domains Di and Dj were
observed as the possible cause in different protein-protein interactions.
Example:
– P1:{D1, D2, D3} and P2 {D1, D5} interact.

(D1, D1), (D1, D5), (D2, D1), (D2, D5), (D3, D1) and (D3, D5).
Exact Problems


In the worst case, (# of
domains)^2 number of
domain interactions,
corresponding to subsets.
Large number of proteins
corresponding to a global
set.


MSC is an NP complete
problem, the exact
solution requires
considering all
combinations of subsets.
Computationally
expensive, impractical for
more than ~10 domains.
There are thousands in a
real problem.
Implementation/Algorithm

Algorithm approximates the minimum set of
domains pairs.
 Algorithm needs to be able to choose d-d pairs in
an educated, not a randomized fashion.
 This educated way can be done using weight
functions. Where each domain pair is given a
weight, and the largest of the weights is chosen.
Different Functions

Different weight functions were considered.
 Decided on looking at two for now:
– MSC
– MSC by probability

Also looked at running MSC twice with the
addition of adding pairs with a high
probability of interacting.
MSC

Assumption:
– most common observed interacting domain pair among
the protein interactions is probably the cause of the
protein interactions.

While there are P-P interactions to be explained
{
– Chooses the most common observed interacting domain Di-Dj.
– Removes Di-Dj
 Removes all P-P interactions from the data being observed
 Undoes P-P interactions effect on matrix
}
MSC by Probability

Assumption:
– Incorporate the absence of p-p interactions.
– Initialize matrix just like MSC.
 go through every element in the matrix and divide that entry by
the total number of proteins that contain the first domain times
the number of proteins which contain the second domain.
 Now each element now represents the probability that domains
i and j interact.
– Then the weight function goes about choosing the highest
probability in the matrix, seeing which proteins this domain pair
explains, remove these proteins influence from the data and then
performing the same tasks again.
Prediction

Input set of proteins with known structure.
 Set of domains pairs obtained from
algorithm being observed.
 Go through each interacting domain pair
(Di, Dj)
 Every protein contained domain Di is
considered interacting with a protein
containing Dj.
Testing

Running MSC approximation VS. MSC
exact on very small sets to see how good the
approximation really is to exact solution.
Testing

Building different size training data using
swiss pfam A database among others.
 Running The aproximation algorithms on
these sets.
 Running AM on the same sets.
 Attempting to use similar size sets to MLE
for comparisons sake.
Testing

Compares calculated P-P interactions with
observed interactions. (number of matches,
false positive, and false negative p-p
interactions)
 Calculate fold, specificity, and sensitivity in
order to compare to previous research.
Results
Results
Results
Future Work

Finish Testing and comparing different
Weight Functions.
 Getting some stats by running different
algorithms multiple times on different size
data sets.
 Testing MSC exact vs. different weight
functions