Protein Sequences - Indiana University

Download Report

Transcript Protein Sequences - Indiana University

Motif Discovery in
Protein Sequences using
Messy de Bruijn Graph
Rupali Patwardhan
Advisors:
Dr. Mehmet Dalkilic
Dr. Haixu Tang
April 27, 2005
Rupali Patwardhan, Capstone Presentation
1
Outline of Presentation





Goal
Background and Motivation
Approach
Results
Future Work
April 27, 2005
Rupali Patwardhan, Capstone Presentation
2
Goal
To develop an algorithm that can
take advantage of the properties of
de Bruijn graph to discover motifs in
protein sequences
April 27, 2005
Rupali Patwardhan, Capstone Presentation
3
What is a motif ?

A repeating pattern
• VSKLIPKNRLMISTEWRSLGQQSPGWMHYMP
• VMLPKDIAKLVPKTHLMSTEWRNRLGVQQSQG
• SGVPRLLTASREWRNLGEPFIDQIHYSPRYAD
• YRHVMLPKAMSTEWRSLGLKNPETGTLRILQE
• GLGITQSLGWSREWRHTLGEPHILLFKREKDYQ
April 27, 2005
Rupali Patwardhan, Capstone Presentation
4
Why are motifs interesting ?



They represent regions that have been
conserved through evolution
So those regions are likely to be important
for the function of the protein (e.g. an
active site)
Motifs can be used to classify proteins into
families based on their functions, or predict
the function of a new protein
April 27, 2005
Rupali Patwardhan, Capstone Presentation
5
PS00059
Zinc-containing alcohol dehydrogenases signature
G-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC]
H is a zinc ligand
April 27, 2005
Rupali Patwardhan, Capstone
Presentation
6
Motif Discovery Algorithms


There are two main categories
Stochastic Algorithms
– Based on Statistical Significance
e.g. MEME, GIBBS

Combinatorial Algorithms
– Based on Enumeration
e.g. PRATT, SPLASH
April 27, 2005
Rupali Patwardhan, Capstone Presentation
7
Then why one more ?

Existing algorithms
– Are too slow or computationally
expensive for massive inputs (e.g. MEME)
– Do not handle gapped motifs effectively
– Need the length/number of the motifs to
be specified in advance
April 27, 2005
Rupali Patwardhan, Capstone Presentation
8
What is a de Bruijn Graph?
A graph whose nodes are
subsequences of same length (ltuples) and whose edges indicate the
subsequences of the two connected
nodes overlap
 E.g. An edge ACAT  CATS represents
the sequence “ACATS”

April 27, 2005
Rupali Patwardhan, Capstone Presentation
9
ABCDEFG
ABCD
April 27, 2005
BCDE
CDEF
Rupali Patwardhan, Capstone Presentation
DEFG
10
Applying this to Identify
Repeating Subsequences





If we have a set of sequences, we can go on adding
corresponding nodes and edges to our de Bruijn
graph.
If any sub-sequence is repeated, the corresponding
edge will already be present in that graph.
So we just increment the weight of that edge.
Eventually the edges corresponding to highly
repeated sequences will have higher weights.
Now we can find the motif by simply following the
graph along these edges with weights above a
specified threshold .
April 27, 2005
Rupali Patwardhan, Capstone Presentation
11
1. PAKARCDEKD
PAKA
1
AKAR
1
KARC
1
ARCD
1
1
DEKD
April 27, 2005
1
CDEK
Rupali Patwardhan, Capstone Presentation
RCDE
12
1. PAKARCDEKD
2. NARCDEKHKH
NARC
1
PAKA
1
AKAR
1
KARC
1
ARCD
2
1
2
CDEK
DEKD
RCDE
1
DEKH
April 27, 2005
1
1
EKHK
KHKH
Rupali Patwardhan, Capstone Presentation
13
1. PAKARCDEKD
2. NARCDEKHKH
NARC
1
PAKA
1
AKAR
1
KARC
1
ARCD
2
1
2
CDEK
DEKD
RCDE
1
DEKH
April 27, 2005
1
1
EKHK
KHKH
Rupali Patwardhan, Capstone Presentation
14
Making them Messy




In the context of protein sequences, some amino
acid residues can be substituted by some others
without affecting the function of the protein.
So a sequence could be considered 'similar' to an
edge even though its not identical.
Similarity between amino acid residues is
determined using standard scoring matrices, such
as BLOSUM62.
In that case, we increment weights of all edges that
represent sequences that are ‘similar’ to the one in
question.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
15
Example




Consider the same 2 sequences as before,
but with K replaced by R in one of them.
PAKARCDERD
NARCDEKHKH
As per BLOSUM62, K  R substitution has a
positive substitution score.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
16
 PAKARCDERD
 NARCDEKHKH
NARC
1
PAKA
1
AKAR
1
KARC
1
ARCD
2
1
1
CDER
DERD
RCDE
1
CDEK
April 27, 2005
1
1
DEKH
1
EKHK
Rupali Patwardhan, Capstone Presentation
KHKH
17
 PAKARCDERD
 NARCDEKHKH
NARC
1
PAKA
1
AKAR
1
KARC
1
ARCD
2
1
1.4
CDER
DERD
RCDE
1.4
CDEK
April 27, 2005
1
1
DEKH
1
EKHK
Rupali Patwardhan, Capstone Presentation
KHKH
18
Adjusting the weights to
account for messiness

Suppose edge A is under consideration, and
edges B and C originating from the same
node as A are similar to A.
WA’  WA + WB*s(A,B) + WC*s(A,C)
April 27, 2005
Rupali Patwardhan, Capstone Presentation
19
Limitation of this Approach




The motif should have at least a few
continuous amino acid residues
So the method may fail if the motif consists
of alternate residues
E.g. AxAxCxDxAxGxC (x could be any
residue) or AxCDxGxRGxC, since these
motifs would not lead to high-weight edges
in the de Bruijn graph
The problem is due to the need for
overlaps, which is inherent nature of de
Bruijn Graphs
April 27, 2005
Rupali Patwardhan, Capstone Presentation
20
Gapped Version



For each node, we also create nodes
obtained by applying a gap mask (or
“Dont care” mask) on that node
We currently restrict the maximal
number of “Dont cares” in a node to 2
There are 10 such masks
April 27, 2005
Rupali Patwardhan, Capstone Presentation
21
Gapped Version


Let ‘1’ represent a conserved amino
acid and ‘0’ represent a gap or “Don’t
care”
Then the 10 masks can be
represented as:
1111, 0111, 1110, 1011, 1101, 1100,
0011, 1001, 0110, 1010, 0101
April 27, 2005
Rupali Patwardhan, Capstone Presentation
22
Masking Example




If ANCD is the node that we are
applying the mask to
ANCD * 1001 = AxxD
ANCD * 1101 = ANxD
ANCD * 1011 = AxCD
April 27, 2005
Rupali Patwardhan, Capstone Presentation
23
1. ….ARCDM…
2. ….ANCDE…
3. ….ASCDT…
ARCD
ANCD
ASCD
April 27, 2005
1
1
1
RCDM
NCDE
SCDT
Rupali Patwardhan, Capstone Presentation
24
1. ….ARCDM…
2. ….ANCDE…
3. ….ASCDT…
AxCD
AxCD
AxCD
April 27, 2005
1
1
1
xCDx
xCDx
xCDx
Rupali Patwardhan, Capstone Presentation
25
1. ….ARCDM…
2. ….ANCDE…
3. ….ASCDT…
AxCD
AxCD
AxCD
April 27, 2005
xCDx
3
xCDx
xCDx
Rupali Patwardhan, Capstone Presentation
26
AxCDxxGH
ANCD
NCDE
CDEF
DEFG
EFGH
AxCD
NxDE
CxEF
DxFG
ExGH
ANxD
NCxE
CDxF
DExG
EFxH
ANxx
NCxx
CDxx
DExx
EFxx
xxCD
xxDE
xxEF
xxFG
xxGH
AxxD
NxxE
CxxF
DxxG
ExxH
xNCx
xCDx
xDEx
xEFx
xFGx
AxCx
NxDx
CxEx
DxFx
ExGx
xNxD
xCxE
xDxF
xExG
xFxH
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Implementation



The algorithm is implemented in Perl
Web Interface
http://biokdd.informatics.indiana.edu/r
patward/deBruijn/project.html
April 27, 2005
Rupali Patwardhan, Capstone Presentation
28
Issues in Testing Motif
Discovery Algorithms



No Benchmarking dataset
Difficult to compare different
algorithms since they have very
different kinds of parameters.
Some motifs are easier to find than
others.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
29
Test I


First 100 PROSITE patterns and their
corresponding protein families were
used as the test dataset to test the
accuracy of the output.
The output of the program was
compared to MEME and PRATT.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
30
Results I
Program
No. of Families for which motif
matched PROSITE pattern
MDMD
62
gMDMD
75
MEME
80
PRATT
61
For MEME and PRATT, the top 3 motifs were considered.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
31
Test II

We also tested families corresponding
to 162 PROSITE patterns that did not
have any continuous conserved amino
acid residues, but had at least one
occurrence of alternate conserved
amino acid residues.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
32
Results II
Rank of
motif
MEME
PRATT
MDMD
gMDMD
1
2
3
4
5
Total
50
26
10
6
10
102
77
13
4
6
5
105
57
22
17
14
5
115
68
26
12
12
10
128
April 27, 2005
Rupali Patwardhan, Capstone Presentation
33
MEME was run on IBM SP cluster on 8 processors in parallel
April 27, 2005
Rupali Patwardhan, Capstone Presentation
34
Future Work



Categorizing easy and difficult motifs.
Extending this approach to consensusbased multiple sequence alignment.
Predicting if a given protein sequence
is likely to belong to a particular family
or not.
April 27, 2005
Rupali Patwardhan, Capstone Presentation
35
Acknowledgements




Dr. Mehmet Dalkilic
Dr. Haixu Tang
Dr. Sun Kim
Bioinformatics Research Group
April 27, 2005
Rupali Patwardhan, Capstone Presentation
36