Fast Search Protein Structure Prediction Algorithm for Almost Perfect

Download Report

Transcript Fast Search Protein Structure Prediction Algorithm for Almost Perfect

Fast Search Protein Structure Prediction Algorithm
for Almost Perfect Matches
By
Jayakumar Rudhrasenan
S3047315
Primary Supervisor: Prof. Heiko Schroder
Secondary Supervisor: Dr. Margaret Hamilton
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
1
Introduction
Bio-Informatics
What is Bio-Informatics?
Bio-Informatics is the science of developing computer databases and algorithms
to facilitate biological research especially in the area of genomic.
Genomic is the study of genes and its functions.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
2
Background - Protein Structure
a
How can we find the
Structure of a protein ?
r
n
c
• X-ray Crystallography
• NMR Spectroscopy
a
d
Amino acid
r
e
a
Protein Structure
k
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
Phi
Psi
3
Where does Computer Science come into it?
Limitations of traditional lab-work
•Expensive
Cost involved in finding the structure through these
method is expensive
•Time Consuming
Takes 6 to 12 months to predict the structure of a
single protein.
REASON:
 Some proteins don’t crystallise
 Some don’t give good diffraction patterns
 All proteins are fragile and difficult to handle.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
4
Methods Available
There are many ways by which this problem is being tackled.
These methods are basically classified into two groups:
• ab initio
• Homology modelling
What is Homology modelling ?
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
5
What is homology modelling?

Homology modeling works on the principle that although each
protein adopts a unique structure, there are only ~2,000 common
folds between the various super families identified thus far.

If two protein sequences are aligned and their percentage similarity
is above the ‘twilight zone’, or 20% we can conclude that the
sequences are homologous, or share a common ancestry, below
this zone it is not possible to say whether the identical amino acid
residues are in fact evolutionarily linked or have arisen by chance.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
6
What is Protein Structure Prediction?
In its most general form
- It is the prediction of the relative position of each amino
acid in the protein structure with the knowledge of the
structural details of other known proteins.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
7
Why predict protein structure?
• The sequence structure gap
– 750 000 known sequences, 17 000 known structures
• Structural knowledge brings understanding of function and mechanism
of action
• Can help in prediction of function
• Predicted structures can be used in structure based drug design
• It can help us understand the effects of mutations on structure or
function
• It is a very interesting scientific problem
– still unsolved in its most general form after more than 20 years of
effort
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
8
Protein Structure Prediction Algorithm
Protein Database
n f s b c a r . . . . .
window
Protein sequence
for which the
structure is
unknown
a r n d c q e g h i l km n f s s d
e g h i l n f s e a r l k s p q g a
n h e . . . . . . . . . . .
Window size =3. Can be implemented with window size of 5,7,9. With window size of 9, we look for
almost perfect matches as we wont get a perfect match with the database we have.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
9
Algorithm – continued..
Number of
Occurrences
Phi graph
Psi graph
Number of
Occurrences
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
10
Limitations of this algorithm
Time Consuming
Time taken to predict the
2 hr PC time
structure of a protein
Time taken to predict the
structure 20,000 protein
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
2 x 20,000 = 40,000 hrs PC time
11
Why does it take time?
Each sub sequence of the unknown protein is compared
with all the sub sequences of the proteins in the database.
With a window size of 9, the number of sub strings in the
database will be around 2 million.
So, there will be 2 million comparisons for each sub
sequence in the unknown protein.
“Unknown protein” here means the proteins whose
sequence is knows but the structure is not known.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
12
Fast Search Protein Structure Prediction Algorithm
for Almost Perfect Matches
•Arrange the sub sequences with a hamming distance of one
between each sub sequences.
What is hamming distance?
The number of disagreeing bits between two binary vectors.
Used as measure of dissimilarity.
Eg. 1000011
1000001
These two binary numbers differ by one bit.
Hamming distance of one here means that the each sub
sequence differ from the one next to that by just one
amino acid.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
13
Continued…
• Maintain a table which stores the hope index value for a
mismatch. For example
Row
number
Sub Sequence
Jump to row
number
1023
111110000
1027
1024
111110001
1025
111110002
1026
111110003
1027
111110013
1028
111110012
1029
111110011
1030
111110010
1031
111110020
1035
.
.
.
1031
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches
14