Presentation

Download Report

Transcript Presentation

Motif Discovery in Protein
Sequences using Messy
De Bruijn Graph
Mehmet Dalkilic and Rupali Patwardhan
Goal
The goal of this project is to develop an
algorithm that can take advantage of the
properties of De Bruijn graphs for
discovering motifs in protein sequences.
Outline of Presentation





Motivation and Background
Approach
Implementation
Applications
Future Work
Motivation



Most of the popular motif discovery
algorithms being used right now depend on
statistical significance to find the motif.
This project explores computational and
graph theoretic ways of doing the same
thing without using statistical significance.
Such an approach could drastically reduce
the time required to search for motifs.
What is a De Bruijn Graph?



De Bruijn Graph is a graph whose nodes are
sequences of symbols from some alphabet
and whose edges indicate the sequences
which might overlap.
The parameters are nodelength(n) and
overlap(k).
So if n=4 and k=3, an edge ACAT  CATS
represents the sequence 'ACATS'
Example



If we have a sequence ABCDEFG,
and we take nodelength=4 and
overlap=3,
we will can represent this same
sequence by the following De Bruijn
Graph
ABCDEFG
ABCD
Node Length = 4
Overlap = 3
BCDE
CDEF
DEFG
Applying this to Identify
Repeating Sub-sequences





If we have a bunch of sequences, we can go on
adding corresponding nodes and edges to our De
Bruijn graph.
If any sub-sequence is repeated, the corresponding
edge will already be present in that graph.
So we just increment the weight of that edge.
Eventually the edges corresponding to highly
repeated sequences will have higher weights.
Now we can find the motif by simply following the
graph along these edges with weights above a
specified threshold .
Example



Sequence 1:
PAKARCDEKD
Sequence 2:
ARCDEKHKH
Constructing the De Bruijn Graph for
these sequences …
 PAKARCDEKD
 ARCDEKHKH
PAKA
1
AKAR
1
KARC
1
ARCD
2
1
2
CDEK
DEKD
1
DEKH
1
1
EKHK
KHKH
RCDE
Making them Messy




In the context of protein sequences, some
amino acid residues can be substituted
without affecting the function of the protein.
So a sequence could be considered 'similar'
to an edge though its not exactly same.
Similarity is determined in the context of a
standard scoring matrix, such as
BLOSUM62.
In that case, we increment weights of all
edges that represent sequences that are
‘similar’ to the one in question.
Example




Consider the same 2 sequences as before,
but with K replaced by R in one of them.
PAKARCDERD
ARCDEKHKH
As per BLOSUM62, K and R have a positive
substitution score.
 PAKARCDERD
 ARCDEKHKH
PAKA
1
AKAR
1
KARC
1
ARCD
2
1
1.75
CDER
DERD
RCDE
1
CDEK
1
1
DEKH
1
EKHK
KHKH
Another Example
> Sequence 1
DMLKLCDKADDKMNDRLDDYLKLDD
> Sequence 2
EAKDKFDFKDFKLCDKADDARTYVH
> Sequence 3
GTYYYCPGHKLCDEADDFFHVDDTE
> Sequence 4
LKLCDKANDYRPYYPITDPLMMNHI
> Sequence 5
GTYKPGHKLCDEADDFFHENDTEKYC
> Sequence 6
KLCDKADDYRPYYPITDPLGATAKHI
Another Example
> Sequence 1
DMLKLCDKADDKMNDRLDDYLKLDD
> Sequence 2
EAKDKFDFKDFKLCDKADDARTYVH
> Sequence 3
GTYYYCPGHKLCDEADDFFHVDDTE
> Sequence 4
LKLCDKANDYRPYYPITDPLMMNHI
> Sequence 5
GTYKPGHKLCDEADDFFHENDTEKYC
> Sequence 6
KLCDKADDYRPYYPITDPLGATAKHI
Sample output …
http://biokdd.informatics.indiana.edu/r
patward/L519/project/ex1.html
http://biokdd.informatics.indiana.edu/rp
atward/L519/project/ttt.gif
Results




When 41 sequences belonging to
PS00021 family were given as input
The best motif output was YCRNPD
The Prosite Reg Ex for this family is
[FY]-C-R-N-P-[DNR].
http://biokdd.informatics.indiana.edu/r
patward/L519/project/PS00021_op.ht
ml
Possible Applications




To predict if a given protein sequence is
likely to belong to a particular protein family
or not.
To construct regular expressions for protein
families.
To fine-tune the results of clustering
algorithms, by helping to decide whether to
merge two clusters or not.
Do preprocessing to improve the
performance of other motif discovery
algorithms.
Limitation of this Approach



The motif should have at least 3 continuous
amino acid residues.
So the program runs into trouble if the motif
consists of alternate residues. For example,
something like AxAxCxDxAxGxC (x could be
any residue).
The problem is due to the need for
overlaps, which is inherent nature of De
Bruijn Graphs.
Future Work


We would like to integrate a machinelearning aspect to dynamically change
the node length and other parameters
to find the optimal motif.
We also want to try to extend this
approach to do clustering itself.
Link to the Implementation
http://biokdd.informatics.indiana.edu/rp
atward/L519/project.html
Acknowledgement

I would like to thank Dr. Mehmet
Dalkilic for his ideas and support.