Genetic threading (Power point)

Download Report

Transcript Genetic threading (Power point)

Genetic Threading
By J.Yadgari and A.Amir
Published: special issue on Bioinformatics in
Journal of Constraints, June 2001
Alexandre Tchourbanov
University of Nebraska at Lincoln
CSCE 421-821
December 4, 2001
Structure of the presentation
• Introduction to protein native structure
• Methods of finding a native structure
Physical
Computational
Common methods and principles
Protein threading method
Protein threading using genetic
approach
Problem of protein structure
prediction
• Proteins are key molecules in all life
processes
• The function of a protein directly related to
its three dimensional structure
• Knowing and understanding the structure of
proteins will have a tremendous impact on
understanding of biological processes,
medical discoveries, and biotechnological
inventions
Problem of protein structure
preduction
• Given a sequence of amino acids, predict
the unique 3D folding of molecule
minimizing its free energy
1
Lys
Gly
Leu
Primary
structure
2
3
Computational
Methods of prediction
Physical
methods of prediction
Practical
use of the
3D structural
knowledge
Protein structure
• A protein is built up from a chain of amino
acids linked by peptide bonds
• There are 20 amino acids that can be
divided into several classes based on size
and other chemical and physical properties
• Depending on type of a residue, protein
could be either hydrophilic (water loving) or
hydrophobic (water hating)
General structure of an amino acid
• Each amino acid consists of:
1. Common main chain part, containing the heavy
atoms N, C, O, C forming amide plane
2. Chain residue of size 0 – 10 additional atoms
Common
part

Chain
residue
Peptide bond
• Peptide bond connects carboxyl group of the
first amino acid with amino group of the
second acid
• Peptide bonds are planar and rigid


Sequence of amino acids
• Sequence of amino acids, connected by
peptide bonds, form protein
• There is no flexibility for rotation around
peptide bond
• There is more flexibility for protein to rotate
around N-C-bond (called the -angle) and
around C-C-bond (-angle)
• These angles are restricted to small regions
in natural proteins
Part of Protein (…|Phe|Asp|Ala|…)
Protein folding
• Using the freedom of rotations, the protein can
fold into a specific and unique three dimensional
structure (called conformation), forming a native
structure
Physical methods of determining
protein native structure
X-ray crystallography
Physical methods
NMR (Nuclear Magnetic Resonance)
•
•
•
•
X-ray crystallography requires significant amounts of
purified protein molecules (1014) to grow a crystal and
protein needs to crystallize
NMR method applicable to proteins of small and
average size, which do not crystallize
Both methods are expensive and give coherent results
on the same protein, proving to be correct
Structure of many important proteins is still unknown
Protein structure in X-ray
crystallography
• X-ray diffraction
pattern is recorded
and processed using
FFT to form electron
density map
• Regions of map with
the highest electron
density reveal the
location of atomic
nuclei
Family of structures in NMR
method
• Absorption of radio
frequency energy is
recorded as a 2D
spectrum
• Possible 3D
structures are
constructed by
computer according
to NMR signal
Computational methods to find a
protein structure
•
•
•
The unique 3D arrangement of protein
corresponds to lowest free energy
conformation
Most computational approaches for solving
the protein folding problem look for the
lowest free energy conformation
Two principal methods are currently in use for
computing the lowest energy conformation:
1. Molecular dynamics
2. Monte Carlo
Molecular dynamics
• Forces acting on each atom at a particular
state of the system are calculated using an
empirical force field
• Atoms allowed to move with accelerations
resulting from forces, changing conformation
• Once atom moved significantly, acting forces
are recalculated (every 10-15 sec)
• Even super computers can simulate only 10-9
sec of folding time, which is insufficient
Monte Carlo method
• Used with simplified model of protein (does
not consider structure of every amino acid)
• Procedure makes random move from
current conformation and evaluates
resulting energy changes
• If new conformation is better, it replaces old
one with newly generated, and process
repeats
• Method is not powerful enough to find an
optimal conformation even for simple cases
Protein threading
•
•
Many proteins in nature are homologous, having
different primary structure, but forming the same
conformation to carry out the same functionality
in a living matter and having the same
evolutionary origin
Most protein share the secondary structure
motifs:
1. Helices
2. Extended strands forming sheets
3. Specific turns
4. Random coils
Protein threading
• Threading means mapping a given sequence
to a given structure
• To assign a structure to a sequence one
would then need to thread the sequence
through all known conformations,
evaluating compatibility, and assign the
most compatible structure to the sequence
• Upon discovery of completely different
structure from any known, enter it into
database of structures
Protein threading
• Structure is presented by
the black trace
• Sequence (at the top) is
threaded through the
structure, encoding an
alignment (at the bottom)
• Zero means structure
deletion, values greater
that one mean sequence
deletion, while one is a fit
Protein threading
• The size of the search space to thread
sequence of length k into structure of size n
could be found as a selection with repetition
 n  k  1 (n  1  k )!


 k  (n  1)!k!
• Search space is huge and problem appears to
be NP-complete [Unger,R., Moult,J. (1993)]
Protein threading
m-1 core regions
m loops (non-core)
 m  1  k  c  (m  1  k  c)!


k  c  (n  1)!(k  c)!

• In order to reduce complexity of search
task, (m –1) core and m non-core regions
are introduced
• Usually -helices and -sheets are core
regions, connected by loops
• Total number of amino acids in core
regions is c
Protein threading
• Although suffering from some inherent
limitations (such as prediction of the right
structure with completely wrong
threading), method became a significant
tool in protein structure prediction
• Any threading procedure must contain two
major components:
1. An alignment algorithm to position a
sequence on a structure
2. Score function to evaluate the “energy”
of the sequence in given conformation
Protein threading possible
implementations
• Protein threading could be implemented
using:
1. Enumeration for small problems,
2. Dynamic programming to find core regions
to “freeze”,
3. Monte Carlo variants with Gibbs sampling
4. Branch and bound search
• Genetic programming with constraints
seems to be a decent alternative in
comparison with other methods
Protein threading using genetic
programming
• Genetic Algorithms are parallel
computational tools that are based on the
principle of diversity and selection
• Solutions are represented as strings, for
example 11111100111311
• Sum of all terms in the string needs to be
equal to the number of amino acids in the
sequence, as well as length of the string
equal to the length of the structure
Protein threading using genetic
programming
• These strings are maintained as a population
that undergoes evolutionary process via
generic operators such as:
– Replication (copying of the string to the next
generation)
– Mutation (changing bits in the string)
– Crossover (concatenating a prefix of one string
with suffix of another)
• Energy function is a good candidate to
evaluate fit of an offspring
Energy function
• Energy functions are subject to
minimizations
• Energy functions are calculated by
extracting from the structural database
frequencies of interactions between pairs of
residues as a function of amino acids types
and distance
• Tendency of certain hydrophilic residues to
be on the surface can be approximated by
energy term related to the position
Implementing mutation
• An example of mutation could be
transformation of 1111100111311 into
11111100211211, which is also a valid
encoding
• We need to have validity check every time
we do mutation and compensate for
problems
• Reverting of substrings is especially
interesting mutation, since it does not
violate a valid structure of the solution
Implementing crossovers
Parent 1
1
1
2
0
1
1
2
0
1
1
1
1
1
1
1
1
1
0
0
1
1
1
3
1
1
Parent 2
1
1
1
Offsprings
1
1
2
0
1
1
2
0
1
1
1
3
1
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
Following issues were addressed
• The linear trade-off between population size
and the number of generations
• Optimal level of mutation rate
• Locality of mutation operator
• Locality of the crossover operator
• Regular mutations versus reverse mutations
• Magnitude of the mutation operation
• Quality control of the crossover operation
Results
• For author’s examples, the optimal
performance is achieved with population
size of 300 solutions and duration of 1000
generations
• The optimal rate of mutations is 0.25 to 0.3
of the populations
The minimal energy of threading
runs
The average energy of the
population during threading
Structural
comparisons
Difference between
sequence deletions and
structure deletions plots
Structural alignment
Most similar
threading alignment
Least similar
threading alignment
Maximal mutation magnitude
Average score of
5 runs after 600
generations
Average score of
5 runs after 2000
generations
Summary
• The running time of a GA depends linearly on the
number of solutions in the population (i.e.
population size) and also depends linearly on the
number of generations the process is repeated
• Genetic algorithms method is a feasible and
efficient approach to threading
• It is especially encouraging that the threading
alignments are quite similar, quantitatively, to the
structural alignments
Summary
• Changing the locality of the mutation and
crossover operation does not show a consistent
change in the performance of the algorithm
• Mutations of high magnitude are
counterproductive, probably because changes
between the template and the assigned structure do
not tend to concentrate in single position
• Using crossover under strict quality control was
shown not to be effective, since genetic
mechanism has quality control itself
Summary
• The success of the reverse mutation is quite
surprising and should be further explored
Future work
• Threading algorithms should be tested on
their ability to assign a conformation for
new and unknown sequence
• Authors plan to implement the genetic
algorithm in a complete threading package,
with all the necessary components and to
test it in a realistic prediction setup.