Sequence Optimization For Synthetic Genes

Download Report

Transcript Sequence Optimization For Synthetic Genes

Sequence Optimization For Synthetic Genes
Using Genetic Algorithms
David Sigfredo Angulo1
Rob Vogelbacher1, Benjamin R. Capraro2, Tobin Sosnick2,
Shohei Koide2
1 School
of Computer Science Telecommunications and
Information Systems DePaul University
2 Department of Biochemistry and Molecular Biology
The University of Chicago
1
Introduction
• Genetic Algorithms:
– Using ideas based on the biology of genes
– Create software to use such a stochastic
means to search through large searchspaces
– Resulting algorithm has nothing to do with
genes
• Designing Genes
– This search space is huge
– REALLY NOVEL IDEA:
• Use Genetic Algorithms based on genes
to design genes!!
Outline
• Short biology Tutorial
• DNA Sequence Generation
– Why is the problem difficult?
• IBG Gene Designer
– Genetic Algorithm (GA) solution
– Heuristics and Fitness Evaluation
3
First
• Before the problem can be described
– Must give some background biochemistry principles
• Tutorial outline
– DNA
– Codons
– Protein
• Synthetic genes
– What are they and what are they used for?
– Restriction Enzymes
– Expressing Proteins using Vectors
Transcription/Translation
Transcription
DNA
Translation
RNA
RNA Polymerase
Protein
Ribosomes
Central Dogma of
Molecular Biology
DNA
• Deoxyribonucleic acid
• Strand backbone is made
of sugar & phosphate
molecules
• Strands connected by
nitrogen containing
nucleotide bases
• Two strands join making a
double helix
• Each strand is made of
nucleotides joined
together
Short region of DNA 2bl helix
2 nm
"beads on a string" form of
Chromatin
11 nm
30 nm chromatin fiber of packed
nucleosomes
30 nm
Section of chromosome in an
extended form
Condensed section of
chromosome
Entire mitotic chromosome
300 nm
700 nm
1100 nm
DNA
Four Nucleotides:
AGTC
DNA: Base Pairing
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
DNA Sequence Generation:
Codon to Amino Acid Translation
11
http://campus.queens.edu/faculty/jannr/Genetics/images/codon.jpg
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
Proteins: AA Chains
Proteins
• Amino Acid Chains Fold Into complex 3D Structures
• Functional properties depend on
3D structure
• Usefulness depends on
functional properties
– E.g. designing drugs
Designed/Expressed Proteins Extremely Useful
• Designed Proteins
– Can be used to study protein
structure
– Can be used to study effects of
otther proteins
• Can be designed to “knock
out” other proteins
• Can be designed to “block” the
acgtion of other proteins
• Expressed proteins
– Expressed in cow’s milk or
chicken eggs
– Can manufacture drugs on large
scales in this way
• E.g. insulin
Synthetic Genes
• DNA sequences
– “backtranslated” from a novel Protein or Amino Acid sequence
Transcription
DNA
Translation
RNA
RNA Polymerase
Protein
Ribosomes
• We’ll put the DNA for our designed protein into an organism (a vector)
• Then that vector will make (express) our protein
• But, how do we get the DNA into an organism???
16
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
Restriction Enzyme Digests
• Watson – Crick 1953
• Took 20 years to be able to do anything with DNA
• H. Smith (and others) made a discovery that allowed manipulation and
deciphering of DNA
• Discovery was that bacteria produced enzymes that introduce breaks in
double stranded DNA molecules whenever they encountered a specific string
of nucleotides
• These enzymes are called Restriction Enzymes
• Restriction Enzymes can be used as precise scissors
– They let biologists cut (and paste) portions of DNA
EcoRI
• EcoRI was the very first
Restriction Enzyme discovered
– "Eco" because it was isolated
from E. Coli (Escherichia
Coli)
– "R" because it is a Restriction
Enzyme
– "I" because it was the first
Restriction Enzyme from E.
Coli
– Now over 300 Restriction
Enzymes known
• EcoRI cleaves (restricts, digests)
DNA
– Between the G and A
nucleotides
– Only when it encounters them
in the string 5'-GAATTC-3'
– This is called the
restriction site
5'-GAATTC-3'
3'-CTTAAG-5'
Regulated by EcoRI
5'-G
3'-CTTAA
AATTC-3'
G-5'
Sticky Ends
• Many restriction enzymes in such a way that some single stranded DNA is
left at both ends
• These nucleotide sequences
– Are complimentary to each other
– Are 5'-AATT-3' in the case of EcoRI
– Can base pair with other nucleotides in a sequence
– Thus, are called "sticky ends"
– Can temporarily hold two
5'-GAATTC-3'
DNA strands together
3'-CTTAAG-5'
– The enzyme ligase
will permanently join
Regulated by EcoRI
those strands
– This is called
5'-G
AATTC-3'
ligation
3'-CTTAA
G-5'
Short Biology Tutorial
• Tutorial outline
– DNA
– Codons
– Protein
– Restriction Enzymes
– Expressing Proteins using Vectors
Gene Synthesis:
On the Lab Bench
• Initial Sequence Construction
– Oligonucleotides (short strands of DNA) are defined with complementary
overlapping sites
• The “sticky ends”
– Assembly PCR
• Oligonucleotides and polymerase are mixed and placed in a
thermocycler
• Creates contiguous DNA sequence from component oligos
22
Gene Synthesis:
On the Lab Bench (cont)
•
•
•
•
After PCR, generated DNA sequence cut with restriction enzymes
Expression hosts's plasmid cut with restriction enzymes
Synthetic gene inserted into plasmid and plasmid repaired
Expression Vectors
– Host organisms used to express the synthetic genes (make the protein)
– Typically E. Coli
• Possibly Chickens or Cows
• Expression vector can now express protein coded for by synthetic gene
– A bit more complicated than described above!!!
23
DNA Sequence Generation:
Gene Insertion
24
Outline
• Short biology Tutorial
• DNA Sequence Generation
– Why is the problem difficult?
• IBG Gene Designer
– Genetic Algorithm (GA) solution
– Heuristics and Fitness Evaluation
25
DNA Sequence Generation:
The Computational Problem
• Why is the problem difficult?
– Conflicting goals
• Avoid restriction sites
• Maximizing Codon Preference
• Thus, cannot use deterministic algorithm
– Degeneracy (redundancy) of the DNA code – 64 codons, 20 (21) amino
acids (see next slide)
• Several synonymous codons are translated into the same amino acid
• Synonymous codons per AA vary from one to six (average is four
codons per AA)
• Huge number of possible DNA Sequences
– Average 2N for protein of amino acid length n
– Codon Preference
• Varying levels of tRNA assembly components in organisms
• Codon usage for a particular AA greatly influence protein expression
– (continued)
26
DNA Sequence Generation:
Codon to Amino Acid Translation
27
http://campus.queens.edu/faculty/jannr/Genetics/images/codon.jpg
DNA Sequence Generation:
The Computational Problem (cont)
• Why is the problem difficult?
– (continued)
– Restriction Enzymes
• The vector will contain many restriction enzymes
– If these cut up our DNA, we won’t express our proteins
– We must design the DNA string using synonymous codons so that there are no
restriction sites
• Helpful to include some other restriction sites
– We must design the DNA string using synonymous codons so that these are
included
– (continued)
28
DNA Sequence Generation:
The Computational Problem (cont)
• Why is the problem difficult?
– (continued)
– mRNA Secondary Structure
• In prokaryotes, mRNA can fold into
complex shapes
• This inhibits protein creation
– Oligonucleotide generation
• Want a specific melting temperature so
that the complex folding doesn’t take
place
• The “sticky ends” must have the same
melting temperature so that they will
bind together.
29
Outline
• Short biology Tutorial
• DNA Sequence Generation
– Why is the problem difficult?
• IBG Gene Designer
– Genetic Algorithm (GA) solution
– Heuristics and Fitness Evaluation
30
IBG GeneDesigner:
Our Solution
•IBG GeneDesigner
31
IBG GeneDesigner:
Genetic Algorithm
• Uses a Genetic Algorithm for sequence optimization
– Tournament selection model
– Uniform and single-point crossover (behind the scenes – not user selectable
at present.)
– Mutation causes codon “wobbling”
– Sequence “fitness” determined by heuristic evaluation
32
IBG GeneDesigner:
Fitness Evaluation
• GeneDesigner heuristics
– Manipulation of nucleotide percentages/ratios to reduce mRNA secondary
structure formation
– Inclusion and Exclusion of restriction sites
• Restriction sites requested for inclusion should only occur once
– Matching of codon preference
– Oligonucleotide generation
• Fitness determined by melting points, start and end nucleotide
33
IBG GeneDesigner:
Future Work
• Algorithm parameters
– Systematically manipulate GA parameters to identify default values for
sequence optimization
• Population size
• Number of generations
• Mutation rate
• Convergence criteria
– Modify heuristic weighting scheme
• Selection models
– Experiment with alternative selection models (Roulette wheel, elitism, limit
population replacement)
34
IBG GeneDesigner:
Future Work
• Move algorithm to ECJ architecture
– Use the Strength-Pareto multi-objective optimization algorithm
• Create web-based version of application
• Explore island model effects on optimization
35
Results
• IBG GeneDesigner utilized to generate a nucleotide sequence for the SH3
domain of a-spectrin1.
• The codon optimization option was set for expression in E. coli with a 40%
G/C bias
• We also used the application to generate four assembly PCR template
oligonucleotide sequences to produce the protein coding sequence flanked by
desired restriction enzyme recognition sites.
• The calculated Tm values of the three overlapping regions were within 1.6oC
– Promoting similar annealing behavior between strands.
– Success of the reaction was confirmed by DNA sequencing of a pUC19
expression vector containing the PCR product cloned between restriction
sites included in the gene design.
• Summary: Protein Made!!!
Input: Protein Sequnce, Vector, Restriction Enzymes
Input: Flanking Sequences
Input: Algorithm Parameters and Fitness Scores
Output: Generation of Oligonucleotides
Acknowledgements
• Graduate student who
did much of the coding
• Rob Vogelbacher
• University of Chicago
undergraduate who used
it to build a protein
• Benjamin R. Capraro
• His advisor
• Tobin Sosnick
• Our collaborator at
University of chicago
• Shohei Koide
42