Applications Enablement Team Highlights
Download
Report
Transcript Applications Enablement Team Highlights
CyberBridges
Protein Pattern Discovery
Tom Milledge
Giri Narasimhan
Bioinformatics Research Group (BioRG)
School of Computing and Information Sciences, FIU
Protein Pattern Discovery: Introduction
Goals:
Implement unsupervised pattern discovery tools for
protein structure data by using the geometric hashing
technique
Create database of protein structure patterns
Create multiple 3-D structural alignments
Identify functional regions in proteins.
2
Molecular Biology Primer
Gene 1
Gene 2
Gene 3
DNA
RNA
Protein
3
Proteins:
Hemoglobin, Immunoglobin, Keratin,
Melanin, Insulin, etc.
Where does protein structure information come from?
Residue
ALA
VAL
ID
215
217
X
10.286
10.591
Y
-9.534
-8.669
Z
3.009
9.309
GLY
GLY
GLY
LEU
219
221
224
231
10.969
16.183
15.794
17.491
-6.429
-4.834
-7.389
-9.149
15.074
17.494
13.12
2.799
VAL
238
11.748
-5.098
3.394
GLU
242
7.443
-2.841
15.69
PDB (protein data bank): a repository of 3-D protein structures
4
Representing substructures as triangles
Largest
common
substructure
(many linked
triangles) in
query and
target proteins
One
triangle
(3 atoms)
Length1
9.5
5
Length2 Length3
7.05
7.01
ID1
217
ID2
231
ID3
238
Basic steps for triangle-based geometric hashing
Preprocessing phase
– Extract triangle information from
target (model) proteins and store
them in a hash table
Searching phase
– For any given query protein, find
the matching triangles in the hash
table
Extension phase
– Find the largest matching
substructures
6
Preprocessing phase: Create the hash table
Read PDB data
Extract
triangles
Generate a hash key (based
on the three lengths and binsize parameters) and enter
record into a hash table
7
7.01
Hash
key
7.06
9.49
035035047
Triangle
Hash table
Key: 035035047
Atom id: 217
Atom id: 231
Atom id: 238
Search phase: finding the matches
1. Decompose
Query Protein
Query Atom
Attributes Array
Key: Atom id
XYZ coord./attribs
Query Triangle
The Hash
The
across
initialcluster
searchnodes
entailsbymatching
protein, the
withquery
protein
triangles
attribute
Array Table is split
At the begin
the search,
HitsofHash
table the query protein is
information stored in a
with
separate
the database
table. This
of (target)
data istriangles.
accessedThe
via results
the atom
areid
Key: triangle
decomposed into triangles with the attribute information
foreign keys stored inadded
the hash
to
atable
new
record.
hash table containing all the target
Key:
TSLs/AtIds
side lengths
stored in
a separate
table. The query protein data is then
matches. The results table includes the query atom IDs
copied to
all nodes.
Value:
3 Atombuilding
ids
Value: 3 Atom ids
for the substructure
phase.
Triangle
Hash table
2. Initial
Search
Key: triangle
side lengths
Value: 3 Atom ids
8
Target Atom
Attributes Array
Key: Atom id
XYZ coord./attribs
Extension phase: building the substructures
A list of triangle hits
Every vertex of the tree
is a triangle
Build an adjacency structure
Use graph searching algorithm, find larger
substructures
Measure structural similarity (RMSD*)
between every substructure in query protein
with every substructure in model protein
Output common
substructure pairs
*RMSD: root mean square distance
9
Case study: Dehydrogenase superfamily
10
1B3R
1CJC
1CF2
Hydrolase
(Rat)
Reductase
(Cow)
Dehydrogenase
(Bacteria)
Dehydrogenases: Shared structural element
1B3R
Reoccurring
substructure
11
1CJC
1CF2
Dehydrogenases: building the common substructure
Other overlapping
Triangle
from
RMSD
(Root
triangle matches are
querySquare
protein
Mean
extended from initial
(green)less
matches
distance)
than
triangle to find largest
triangle
from
Angstrom
commonis substructure
RMSD
measured1.0
at
target protein
indicates
a good
each extension step
to
(pink)
ensure validity of the match
larger match
RMSD: 0.32 Angstroms
12
RMSD: 0.66 Angstroms
Results: Zinc finger protein family
13
DNA-binding
substructure
Zinc-binding
substructure
10 positions
RMSD: 0.46 angstroms
4 positions
RMSD: 0.35 angstroms
Conclusions and Future Work
Geometric hashing of proteins shows promise as an
important technique with a very good fit to many parallel
architectures. Areas of future work include:
Molecular Docking: Identify potential drugs that are
least likely to cause side-effects.
Function prediction: Create a database of conserved
substructures that indicate a specific protein function.
Structure prediction: Use sequence patterns with a
structural templates to predict structure of new
sequences.
14