Evolutionary Algorithms & Protein Folding
Download
Report
Transcript Evolutionary Algorithms & Protein Folding
Evolutionary Algorithms for
the Protein Folding Problem
Giuseppe Nicosia
Department of Mathematics and Computer Science
University of Catania
DMI - Università di Catania
16/07/2015
1
Talk Outline
1. An overview of Evolutionary Algorithms
2. The Protein Folding Problem
3. Genetic Algorithms for the ab initio prediction
DMI - Università di Catania
16/07/2015
2
An overview of Evolutionary
Algorithms
EAs are optimization methods based on
an evolutionary metaphor that showed
effective in solving difficult problems.
DMI - Università di Catania
16/07/2015
3
Computational Intelligence and EAs
“Evolution is the natural way to program”
Thomas Ray 16/07/2015
DMI - Università di Catania
4
Evolutionary Algorithms
1. Set of candidate solutions (individuals): Population.
2. Generating candidates by:
– Reproduction: Copying an individual.
– Crossover: 2 parents 2 children.
– Mutation: 1 parent 1 child.
3. Quality measure of individuals: Fitness function.
4. Survival-of-the-fittest principle.
DMI - Università di Catania
16/07/2015
5
Main components of EAs
1. Representation of individuals: Coding.
2. Evaluation method for individuals: Fitness.
3. Initialization procedure for the 1st generation.
4. Definition of variation operators (mutation and crossover).
5. Parent (mating) selection mechanism.
6. Survivor (environmental) selection mechanism.
7. Technical parameters (e.g. mutation rates, population size).
DMI - Università di Catania
16/07/2015
6
Mutation and Crossover
EAs manipulate partial
solutions in their search for
the overall optimal solution .
These partial solutions or
`building blocks' correspond to
sub-strings of a trial solution - in
our case local sub-structures
within the overall conformation.
DMI - Università di Catania
16/07/2015
7
`Optimal' Parameter Tuning:
• Experimental tests.
• Adaptation based on measured quality.
• Self-adaptation based on evolution !
DMI - Università di Catania
16/07/2015
8
Constraint handling strategies
[Michalewicz, Evolutionary Computation, 4(1), 1996]
1. Repair strategy: whenever an unfeasible
solution is produced "repair" it, i.e. find a
feasible solution "close“ to the unfeasible
one;
2. Penalize strategy: admit unfeasible
individuale in the population, but penalize
them adding a suitable term to the energy.
DMI - Università di Catania
16/07/2015
9
The evolution Loop
DMI - Università di Catania
16/07/2015
10
Algorithm Outline
procedure EA; {
t = 0;
initialize population P(t);
evaluate P(t);
until (done) {
t = t + 1;
parent_selection P(t);
recombine P(t);
mutate P(t);
evaluate P(t);
survive P(t);
}
}
DMI - Università di Catania
16/07/2015
11
Example
DMI - Università di Catania
16/07/2015
12
Evolutionary Programming
• The individuals are real-valued vectors,
procedure EP; {
ordered lists, graphs.
t = 0;
initialize population P(t);
evaluate P(t);
until (done) {
t = t + 1;
parent_selection P(t);
mutate P(t);
evaluate P(t);
survive P(t);
}
}
• All N individuals are selected to be parents,
and then are mutated, producing N children.
These children are evaluated and N
survivors are chosen from the 2N individuals,
using a probabilistic function based on
fitness (individuals with a greater fitness
have a higher chance of survival).
• Mutation is based on the representation
used, and is often adaptive. For example,
when using a real-valued vector, each
variable within an individual may have an
adaptive mutation rate that is normally
distributed.
• No Recombination.
DMI - Università di Catania
16/07/2015
13
Evolution Strategies
procedure ES; {
t = 0;
initialize population P(t);
•
ES typically use real-valued vector.
•
Individuals are selected uniformly
randomly to be parents.
•
Pairs of parents produces children via
recombination. The number of children
created is greater than N.
•
Survival is deterministic:
evaluate P(t);
until (done) {
t = t + 1;
1. ES allows the N best children to
survive, and replaces the parents
with these children.
parent_selection P(t);
recombine P(t)
2. ES allows the N best children and
parents to survive.
mutate P(t);
evaluate P(t);
•
Like EP, adapting mutation.
survive P(t);
•
Unlike EP, recombination does play an
important role in ES, especially in
adapting mutation.
16/07/2015
}
}
DMI - Università di Catania
14
Genetic Algorithms
procedure GA; {
• GAs traditionally use a more domain
independent representation, namely, bit-strings.
t = 0;
• Parents are selected according to a
probabilistic function based on relative fitness.
initialize population P(t);
evaluate P(t);
• N children are created via recombination from
the N parents.
until (done) {
t = t + 1;
parent_selection P(t);
recombine P(t)
mutate P(t);
evaluate P(t);
survive P(t);
}
}
• The N children are mutated and survive,
replacing the N parents in the population.
• Emphasis on mutation and crossover is
opposite to that in EP.
• Mutation flips bits with some small probability
(background operator).
• Recombination, on the other hand, is
emphasized as the primary search operator.
DMI - Università di Catania
16/07/2015
15
Genetic Programming 1/2
There are 5 major preparatory steps in using GP for a particular problem.
1) selection of the set of terminals (e.g., the actual variables of the
problem, zero-argument functions, and random constants, if any)
2) selection of the set of functions
3) identication of the evaluation function
4) selection of parameters of the system for controlling the run
5) selection of the termination condition.
Each tree (program) is composed of functions and terminals appropriate to
the particular problem domain; the set of all functions and terminals is
selected a priori in such a way that some of the composed trees yield a
solution.
The initial pop is composed of such trees. The evaluation function assigns a
fitness value which evaluates the performance of a tree. The evaluation is
based on a preselected set of test cases,a fitness cases; in general, the
fitness function returns the sum of distances between the correct and
16/07/2015
obtained results on all test DMI
cases.
16
- Università di Catania
Genetic Programming 2/2
procedure GP; {
t = 0;
initialize population P(t); /* randomly create an initial pop of individuals
computer program */
evaluate P(t); /* execute each program in the pop and assign it a fitness value */
until (done) {
t = t + 1;
parent_selection P(t); /* select one or two program(s) with a probability
based on fitness (with reselection allowed) */
create P(t); /* create new programs for the pop by applying the
following ops with specified probability */
reproduction; /* Copy the selected program to the new pop */
crossover; /* create new offspring programs for the new pop by
recombining randomly chosen parts from 2 selected prgs*/
mutation; /* Create one new offspring program for the new pop by randomly
mutating a randomly chosen part of one selected program. */
Architecture-altering ops; /* Choose an architecture-altering operation
from the available repertoire of such op. and create
one new offspring program for the new pop by
applying the chosen architecture-altering op. to the
one selected prg */
16/07/2015
}
}
DMI - Università di Catania
17
Scaling
Suppose one has two search spaces. The first is
described with a real-valued fitness function F. The
second search space is described by a fitness function
G that is equivalent to F p , where p is some constant.
The relative positions of peaks and valleys in the two
search spaces correspond exactly. Only the relative
heights differ (i.e., the vertical scale is different).
Should our EA search both spaces in the same
manner?
DMI - Università di Catania
16/07/2015
18
Ranking selection (ES, EP)
If we believe that the EA should search the two spaces in the
same manner, then selection should only be based on the relative
ordering of fitnesses, only the rank of individuals is of importance.
ES
•
Parent selection is performed uniformly randomly, with no regard
to fitness.
•
Survival simply saves the N best individuals, which is only based
on the relative ordering of fitnesses.
EP
•
All individuals are selected to be parents. Each parent is mutated
once, producing N children.
•
A probabilistic ranking mechanism chooses the N best
individuals for survival, from the union of the parents and
children.
16/07/2015
DMI - Università di Catania
19
Probabilistic selection mechanism GA
Many people, in the GA community, believe that F and G
should be searched differently. Fitness proportional
selection is the probabilistic selection mechanism of the
traditional GA.
•
Parent selection is performed based on how fit an
individual is with respect to the population average.
For example, an individual with fitness twice the
population average will tend to have twice as many
children as average individuals.
•
Survival, though, is not based on fitness, since the
parents are automatically replaced by the children.
DMI - Università di Catania
16/07/2015
20
Lacking the killer instinct
One problem with this latter approach is that, as
the search continues, more and more
individuals receive fitnesses with small
relative differences. This lessens the selection
pressure, slowing the progress of the search.
This effect, often referred to as "lacking the killer
instinct", can be compensated somewhat by
scaling mechanisms, that attempt to magnify
relative differences as the search progresses..
DMI - Università di Catania
16/07/2015
21
Mutation and Adaptation
• GAs typically use mutation as a simple background
operator, to ensure that a particular bit value is not
lost forever. Mutation in GAs typically flips bits with a
very low probability (e.g., 1 bit out of 1000).
• Mutation is far more important in ESs and EP.
Instead of a global mutation rate, mutation
probability distributions can be maintained for
every variable of every individual.
More importantly, ESs and EP encode the probability
distributions as extra information within each individual,
and allow this information to evolve as well (selfadaptation of mutation parameters, while the space is
being searched).
DMI - Università di Catania
16/07/2015
22
Recombination and Adaptation
• There are a number of recombination methods for ESs, all of
which assume that the individuals are composed of real-valued
variables. Either the values are exchanged or they are
averaged. The ES community has also considered multi-parent
versions of these operators.
• The GA community places primary emphasis on crossover.
– One-point recombination inserts a cut-point within the two
parents (e.g., between the 3rd and 4th variables, or bits). Then the
information before the cut-point is swapped between the two
parents.
– Multi-point recombination is a generalization of this idea,
introducing a higher number of cut-points. Information is then
swapped between pairs of cut-points.
– Uniform crossover, however, does not use cut-points, but simply
uses a global parameter to indicate the likelihood that each variable
should be exchanged between two parents.
DMI - Università di Catania
16/07/2015
23
Representation
• Traditionally, GAs use bit strings. In theory, this
representation makes the GA more problem independent.
We can also see this as a more genotypic level of
representation, since the individual is in some sense
encoded in the bit string. Recently, however, the GA
community has investigated more phenotypic
representations, including vectors of real values, ordered
lists, neural networks.
• The ES and EP communities focus on real-valued vector
representations, although the EP community has also used
ordered list and finite state automata representations.
• Very little has been done in the way of adaptive
representations.
DMI - Università di Catania
16/07/2015
24
Strength of the selection and
population’s carrying capacity
• Strong selection refers to a selection mechanism that
concentrates quickly on the best individuals, while weaker
selection mechanisms allow poor individuals to survive
(and produce children) for a longer period of time.
• Similarly, the population can be thought of as having a
certain carrying capacity, which refers to the amount of
information that the population can usefully maintain. A
small population has less carrying capacity, which is usually
adequate for simple problems. Larger populations, with
larger carrying capacities, are often better for more difficult
problems.
• Perhaps the evolutionary algorithm can adapt both selection
pressure and the population size dynamically, as it
16/07/2015
solves problems. DMI - Università di Catania
25
Accumulated payoff
• EP and ES usually have optimization for a goal. In other
words, they are typically most interested in finding the best
solution as quickly as possible.
• De Jong (1992) reminds us that GAs are not function
optimizers per se, although they can be used as such.
There is very little theory indicating how well GAs will
perform optimization tasks. Instead, theory concentrates
on what is referred to as accumulated payoff.
• The difference can be illustrated by considering financial
investment planning over a period of time (e.g., you play
the stock market). Instead of trying to find the best stock, you
are trying to maximize your returns as the various stocks are
sampled. Clearly the two goals are somewhat different, and
maximizing the return may or may not also be a good16/07/2015
heuristic for finding DMI
the- Università
best stock.
26
di Catania
Fitness correlation
• Fitness correlation, which appears to be a measure of EAHardness that places less emphasis on optimality
(Manderick et al., 1991). Fitness correlation measures the
correlation between the fitness of children and their
parents. Manderick et al. found a strong relationship
between GA performance and the strength of the
correlations.
• Another possibility is problem modality. Those problems
that have many suboptimal solutions will, in general, be
more difficult to search.
• Finally, this issue is also very related to a concern of de
Garis, which he refers to as evolvability. de Garis notes that
often his systems do not evolve at all, namely, that fitness
does not increase over time. The reasons for this are not
clear and remain an important research topic.
16/07/2015
DMI - Università di Catania
27
Distributed EAs
Because of the inherent natural parallelism
within an EA, much recent work has concentrated
on the implementation of EAs on parallel
machines. Typically either one processor holds
one individual (in SIMD machines), or a
subpopulation (in MIMD machines). Clearly, such
implementations hold promise of execution time
decreases.
More interestingly, are the evolutionary effects
that can be naturally illustrated with parallel
machines, namely, speciation, nicheing, and
punctuated equilibria (Belew and Booker,1991).
DMI - Università di Catania
16/07/2015
28
Resume
• Selection serves to focus search into areas of
high fitness.
• Other genetic operators (recombination and
mutation) perturb the individuals, providing
exploration in nearby areas.
• Recombination and mutation provide
different search biases, which may or may not
be appropriate for the task.
• The key to more robust EA systems probably
lies in the adaptive selection of such genetic
operators.
DMI - Università di Catania
16/07/2015
29
Hint: Gray Code
Decimal
Binary
Gray
0
0000
0000
Procedure binaryToGray
1
0001
0001
{
2
0010
0011
3
0011
0010
4
0100
0110
5
0101
0111
6
0110
0101
7
0111
0100
8
1000
1100
9
1001
1101
10
1010
1111
11
1011
1110
12
1100
1010
13
1101
1011
14
1110
1001
15
1111
1000
g1=b1;
for k=2 to m do
gk=bk-1 XOR bk;
}
Gray code is often used in GAs for mapping
between a decimal number and a bit string.
Mapping each digit of a decimal number to a
string of four bits corresponds to choosing 10
strings from 16 possibilities. In Gray code, two
neighbouring decimal digits are always
represented by adjacent bit strings that differ by
only one bit position.
DMI - Università di Catania
16/07/2015
30
Advantages of EAs
• Widely applicable, also in cases where no (good) problem
specic techniques are available:
– Multimodalities, discontinuities, constraints.
– Noisy objective functions.
– Multiple criteria decision making problems.
• No presumptions with respect to the problem space.
• Low development costs; i.e. costs to adapt to new problem
spaces.
• The solutions of EA's have straightforward
interpretations.
• They can be run interactively (online parameter adjustment).
DMI - Università di Catania
16/07/2015
31
Disadvantages of EAs
• No guarantee for finding optimal solutions
within a finite amount of time: True for all global
optimization methods.
• No complete theoretical basis (yet), but much
progress is being made.
• Parameter tuning is largely based on trial and
error (genetic algorithms); solution: Selfadaptation (evolution strategies).
• Often computationally expensive: Parallelism.
DMI - Università di Catania
16/07/2015
32
Applications of EAs
• Optimization and Problem Solving;
• NP-Complete Problem;
• Protein Folding;
• Financial Forecasting;
• Automated Synthesis of Analog Electrical Circuits;
• Evolutionary Robotics;
• Evolvable Hardware;
• Modelling.
DMI - Università di Catania
16/07/2015
33
Turing ‘s
vision
July 20, 2001
DMI - Università di Catania
16/07/2015
34
Turing’s three approaches for
creating intelligent computer program
One approach was logic-driven
while a second was knowledge-based.
The third approach that Turing specifically
identified in 1948 for achieving machine
intelligence is
“... the genetical or evolutionary search
by which a combination of genes is
looked for, the criterion being the
survival value”. A. M Turing A. M.,Intelligent
machines, Machine Intelligence, 1948.
DMI - Università di Catania
16/07/2015
35
In 1950 Turing described how evolution
and natural selection might be used to
create an intelligent program:
“... we cannot expect to find a good childmachine at first attempt. One must
experiment with teaching one such machine
and see how well it learns. One can then try
another and see if it is better or worse”.
Turing A. M.,Computing machinery and
intelligence,Mind,1950.
DMI - Università di Catania
16/07/2015
36
Turing’s third approach & EAs
There is an obvious connection between this process and
evolution, by identifications:
1. Structure of the child machine = Hereditary material;
2. Changes of the child machine = Mutations;
3. Judgment of the experimenter = Natural selection.
The above features of Turing's third approach to machine
intelligence are common to the various forms of evolutionary
computation developed over the past four decades.
DMI - Università di Catania
16/07/2015
37
Genes were easy:
The Protein Folding Problem
Genomics Transcriptomics Proteomics
DMI - Università di Catania
16/07/2015
38
Reductionistic and synthetic
approaches
DMI - Università di Catania
16/07/2015
39
Basic principles
DMI - Università di Catania
16/07/2015
40
Why are computer scientists
interested in biology ?
DMI - Università di Catania
16/07/2015
41
Scientific answer
• Biology is interesting as a domain for AI
research (i.e. drug design).
• Biology provides a rich set of metaphors
for thinking about intelligence: genetic
algorithms, neural networks and
Darwinian automata are but a few of the
computational approaches to behavior
based on biological ideas. There will, no
doubt, be many more (Artificial Immune
System).
DMI - Università di Catania
16/07/2015
42
Pragmatic answer
• “Gene sequencing’s Industrial Revolution” [IEEE
Spectrum, November 2000].
• IBM predicts the IT market for biology will grow
from $3.5 billion to more $9 billion by 2003. The
volume of life science data doubles every six
months. [IEEE Spectrum, January 2001]
• “Golden rice to Bioinformatics” [Scientific
American 2001].
• Biotechnology, BioXML, BioPerl, BioJava, Bioinspired models, Biological data analisys … Bio-all.
DMI - Università di Catania
16/07/2015
43
The building blocks:
the 20 natural Amino acids
1. Ala
A
Alanine
H
11. Leu
L
Leucine
H
2. Arg
R
Arginine
C+
12. Lys
K
Lysine
C+
3. Asn
N
Asparagine
P
13. Met
M
Methionine
H
4. Asp
D
Aspartic acid
C-
14. Phe
F
Phenylalanine H
5. Cys
C
Cysteine
P
15. Pro
P
Proline
H
6. Gln
Q
Glutamine
P
16. Ser
S
Serine
P
7. Glu
E
Glutamic acid
C-
17. Thr
T
Threonine
P
8. Gly
G
Glycine
P
18. Trp
W
Tryptophan
H
9. His
H
Histidine
P, C+
19. Tyr
Y
Tyrosine
P
10. Ile
I
Isoleucine
H
20. Val
V
Valine
H
3-letter code, single code, name residued, charge,Polar, Hydrophobic.
DMI - Università di Catania
16/07/2015
44
Proteins are necklaces of
amino acids
The protein is a linear polymer of the 20 different
kinds of amino acids, which are linked by peptide
bonds. Protein sequence length: 20 – 4500 aa.
DMI - Università di Catania
16/07/2015
45
Hydrophobic & hydrophilic residues
• Hydrophobic residues tend to come together to form
compact core that exclude water. Because the
environment inside cells is aqueous (primarily water),
these hydrophobic residues will tend to be on the inside
of a protein, rather than on its surface.
• Hydrophobicity is one of the key factors that determines
how the chain of amino acids will fold up into an active
protein (Hydrophilic: attracted to water, Hydrophobic:
repelled by water).
• The polarity of a molecule refers to the degree that its
electrons are distributed asymmetrically. A non-polar
molecule has a relatively even distribution of charge.
DMI - Università di Catania
16/07/2015
46
Primary structure
•The scaffold is always the same.
•The side-chain R determines the amino acid type.
DMI - Università di Catania
16/07/2015
47
Grand Challenge Problems in
Bioinformatics [T.Lengauer, Informatics – 10
Years Back, 10 Years Ahed, LNCS 2000]
1. Finding Genes in Genomic Sequences
2. Protein Folding and Protein Structure
Prediction
3. Estimating the Free Energy of
Biomolecules and their complexes
4. Simulating a Cell
DMI - Università di Catania
16/07/2015
48
The famed Protein Folding problem asks
how the amino-acid
sequence of a protein
adopts its native
three-dimensional
structure under
natural conditions
(e.g. in aqueous
solution, with neutral
pH at room
temperature).
DMI - Università di Catania
16/07/2015
49
Sequence
Structure
Function
While the nature of the fold is determined
by the sequence, it is encoded in a very
complicated manner. Thus, protein folding
can be seen as a connection between the
genome (sequence) and what the proteins
actually do (their function).
DMI - Università di Catania
16/07/2015
50
Protein Folding process
•Denaturated = Unfolded
(Disruption of equilibrium
and of the H-bonds)
No biological activity
• Native = Folded =
Unique compact structure
Biologically active
Below folding transition
temperature, the protein
seems to exist under a
unique conformation.
Folding time: 10-2 to 1 s
Folding the smallest protein: a single alpha helix.
DMI - Università di Catania
16/07/2015
51
Protein Folding Principles
• Finding a low-energy shape: Proteins tend to
twist into shapes that achieve a “low energy”
state in which amino acids fit comfortably
together.
• Attraction between neighbors: aa will most
strongly attract or repel those closest to
themselves. Aa can also interact with each other
through “H-bonds”, weak interactions that when
multiplied throughout the chain, can hold a
protein in a regular shape.
DMI - Università di Catania
16/07/2015
52
The Ab initio protein structure
prediction problem
Protein folding has to be distinguished
from protein structure prediction (PSP).
In the PSP we are not interested in the
folding process but just in the final
structure attained.
DMI - Università di Catania
16/07/2015
53
Why do Proteins “Fold“ ?
In order to carry out their function (e.g. as enzymes
or Ab), they must take on a particular shape, also
known as a Fold. Thus, proteins are truly amazing
machines: before they do their work, they assemble
themselves! This self-assembly is called Folding.
What happens if protein don’t fold correctly ?
Diseases such as Alzheimer's disease, cystic
fibrosis, Mad Cow disease, an inherited form of
emphysema, and even many cancers are believed
to result from protein misfolding. When proteins
misfold, then can clump together (aggregate).
DMI - Università di Catania
16/07/2015
54
Computational Tools in Protein
Folding
• Molecular Dynamics (MD);
• Monte Carlo (MC);
• Simulated Annealing (SA);
Search
methods
• Evolutionary Algorithms (EA);
• Convex Global Underestimator (CGU)
[K.A.Dill & H.S.Chan,Nature Struct Biol, 4:10-19,1997];
• Performance Guaranteed Approximation
Algorithms [W.E. Hart & S. Istrail,RECOMB, ACM
SIGACT 1997];
DMI - Università di Catania
Compute
native
conformation
16/07/2015
55
Genetic Algorithms for the ab
initio prediction
Using biologically inspired ideas to
compute about biological problems.
DMI - Università di Catania
16/07/2015
56
Why Genetic Algorithms for
Protein Structure Prediction?
1. PSP is analytically difficult to solve.
2. The number of conformations of a protein with
N aa grows exponentially as N where is the
average number of conformations per residue
(typically 10).
3. The PSP problem is NP-hard.
4. The energy landscape of proteins must be
`rugged'.
DMI - Università di Catania
16/07/2015
57
The HP protein folding model
[K. A.Dill, Biochemistry, 24:1501, 1985].
• HP models abstract the hydrophobic interaction process
in protein folding by reducing a protein to a
heteropolymer that represents a predetermined pattern
of hydrophobicity in the protein.
• Nonpolar amino acids are classifìed as hydrophobic (H)
and polar amino acids are classified as hydrophilic (P).
A sequence is
s{H, P}+
• The HP modei restricts the space of conformations to,
self-avoiding paths on a lattice (square, cubic or
triangular) in which vertices axe labeled by the amino
acids.
DMI - Università di Catania
16/07/2015
58
The energy potential in the HP model
• The energy potential reflects the fact that
hydrophobic amino acids have a propensity to
form a hydrophobic core. To capture this feature
of protein structures, the HP model adds a value
for every pair of hydrophobics that form a
topological contact.
• A topological contact is formed by a pair of amino
acids that are adjacent on the lattice and not
consecutive in the sequence. The value of is
typically taken to be -1.
DMI - Università di Catania
16/07/2015
59
HP sequences embedded
Figure shows sequence embedded in the square
lattice, with hydrophobic-hydrophobic contacts (HH
contacts) highlighted with dotted lines. The
conformation has an energy of –4.
DMI - Università di Catania
16/07/2015
60
Why HP model for Protein
structure prediction ?
• S. Istrail: “Best Science for Protein Folding”,
[Lipari School on Computational Biology 1999].
• HP model is powerful enough to capture a variety of
properties of actual proteins [Dill et al., Protein
Science, 4:561, 1995].
• The PSP problem for the HP model has been
shown to be NP-complete on the square lattice
[Crescenzi et al., J. Comp. Bio. 5(3), 1998] and
cubic lattice [Berger et al., J.Comp. Bio., 5(1),
1998].
DMI - Università di Catania
16/07/2015
61
Internal coordinates with
Absolute direction
Individuals are coded with a sequence in
{U,D,L,R,F,B}n-1 : which correspond to
up, down, left, right, forward and
backward moves in a cubic for a length n
protein.
DMI - Università di Catania
16/07/2015
62
Representation and Encoding
Pseq {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}+
PHP{H, P}+
Pconf {U,D,L,R,F,B}n-1
Pconf-abs=RULLURURULU
DMI - Università di Catania
16/07/2015
63
Bond directions describing
lattice conformations
Direction
r
Up
rz rz +1
Left
rx rx -1
Front
ry ry +1
Back
ry ry -1
Right
rx rx +1
Down
rz rz –1
A bond direction
corresponds to a change,
r , in one of the Cartesian
coordinates of the
successive monomer,
keeping all other
coordinates the same as
the previous monomer.
DMI - Università di Catania
16/07/2015
64
The energy function
where,
• strength of HH
attraction (usually
taken as 1).
• 2 energetic
penalty parameter for
sites containing two
monomers.
• 3 energetic
penalty parameter for
sites containing three
or more monomers.
DMI - Università di Catania
16/07/2015
65
Procedure GA (Protein Sequence)
{ t := 0;
initialize population P(t); /* random pop. of conformations */
evaluate P(t);
/* compute energy function */
while not terminate do { /* terminate when free energy reaches
equilibrium point*/
parent_selection P(t);
crossover P(t);
/* 1-point-Crossover acts stochastically with
fixed probability Pcros*/
mutate P(t); /* randomly change the value of bond directions along
the string with fixed probability Pmut */
evaluate P(t);
survive P(t); } /* All individuals are replaced except for 10% of the
current best conformation - elitism strategy */
Output (Protein structure);
}
DMI - Università di Catania
16/07/2015
66
Selection
Selection is linearly proportional to
fitness so that the probability, Pi , of
selecting the i-th conformation, with a
fitness value Fi, to propagate to the next
time step is given by:
DMI - Università di Catania
16/07/2015
67
Fitness function
Probabilities must be positive so a linear
mapping with a cut-off value is used to
convert the energy (E) minimization
problem to a fitness (F) maximization:
DMI - Università di Catania
16/07/2015
68
Population Free Energy
• Since GAs deal with an ensemble of solutions, a
quantity analogous to the statistical mechanical free
energy is used. The population free energy, F, is
calculated from its partition function, Z:
• where the sum is over the total number of
conformers in a population and Ei is the energy of
the i-th conformer. Hence,
F = -ln(Z)
DMI - Università di Catania
16/07/2015
69
Evolution of energy distributions
The GA dynamics
converge the
population of
conformations to an
equilibrium
distribution. This is
characterised by F as
shown in next slide.
Fig. 1 Evolution of a population: plotting the energy distributions at various
time steps. t is the percentage of the total run-time elapsed. The population
converges within 50% of the run time.
DMI - Università di Catania
16/07/2015
70
Mean, Minimum & Free Energy of
an Evolving Population
The mean energy
fluctuates around
the equilibrium
making it difficult
to use as a stop
criterion
The free energy, which characterises the energy distribution of a
population, reaches a convergence, or equilibrium point.
DMI - Università di Catania
16/07/2015
71
Compact HP conformation found
by GA
This method determines
the global minima of HPsequences by
constructing
conformations with a
core of H (hydrophobic)
residues that also
minimize the surface
area of the conformation.
Example of a compact HP conformation, with a hydrophobic core,
found by the GA (number of nearest neighbours = 39).
Dark=H=hydrophobic, Light=P=polar.
16/07/2015
DMI - Università di Catania
72
Sequence
Lowest Energy
# Energy eval
643d1
-27
433.533
643d.2
-30
167.017
643d.3
-38
172.192
643d.4
-34
107.143
643d.5
-36
154.168
643d.6
-31
454.727
643d.7
-25
320.396
643d.8
-34
315.036
643d.9
-33
151.705
643d.10
-26
191.019
Results
Studies were
carried out
using HPsequences
taken from
Unger,Moult, J.
Molecular
Biology,
231,1993.
643d.1 wwbbbbbwwwbbwwwwwbbwwwbwwwwwwbwb
wwwbwwbwwbwwwwwbwwwwbbwbbwwbwwbw
DMI - Università di Catania
16/07/2015
73
Energy Landascapes are Funnels
DMI - Università di Catania
16/07/2015
74
Conformational search strategies
DMI - Università di Catania
16/07/2015
75
Conclusions
• The evolutionary algorithms (GA) approach to the
protein structure prediction problem offers a
promising potential method of solution.
• GAs are fast and efficient at searching the rugged
conformational landscapes presented by protein
molecules.
Working in Progress
• Work is under way to design a EA to optimize real
protein structures (torsion angle space, lattice
models).
DMI - Università di Catania
16/07/2015
76
References
• D. E. Clark (ed.), Evolutionary Algorithms
in Molecular Design, Wiley-VCH,
Weinheim, 2000.
• M. Kanehisa, Post-Genome Informatics,
Oxford University Press, 2000. (An overview of
the types of data and databases used in bioinformatics).
DMI - Università di Catania
16/07/2015
77
Final Remark
“The impact of Bioinformatics research
critically depends on an accurate
understanding of the biological process under
investigation. It is essential to ask the right
questions, and often modeling takes priority
over optimization. Therefore, we need people
that understand and love both computer
science and biology to bring the field
forward.”
Thomas Lengauer
Institute of Computer Science - University of Bonn
DMI - Università di Catania
16/07/2015
78