Biological Coding Theory Error-Control Code Models for Prokaryotic

Download Report

Transcript Biological Coding Theory Error-Control Code Models for Prokaryotic

Coding Theory and Protein
Synthesis
Avogadro-Scale Engineering: Form and Function
November 18, 19 2003
Elebeoba E. May
Computational Biology Department
Sandia National Laboratories
*[email protected]
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,
for the United States Department of Energy under contract DE-AC04-94AL85000.
Agenda
It is the glory of God to conceal a matter; to search out a matter is the glory of kings.
Proverbs 25:2 (NIV)
• Error Control at Diverse Molecular Scales
• Coding Theory Models of Protein Synthesis
– Gatlin
– Yockey
– May et al.
• Applications of Coding Theory to
– Genetic Classification
– Molecular Computation
– Construction and Control in Protein Synthesis
Nucleotides: Did nature select a parity check
code?
D. A. Mac Dónaill : “ Numerical Interpretation of nucleotides depicted as positions on
a B^4 hypercube: (a) even-parity nucleotides; (b) odd-parity nucleotides. The natural
alphabet is structured as an error-checking code.”
*D.A. Mac Dónaill, “A parity code interpretation of nucleotide alphabet composition,”
Chem. Comm. (2002) 2062-2063 and http://www.tcd.ie/Chemistry/People/macdonaill/
Protein: Degeneracy of the genetic code
http://www.people.virginia.edu/~rjh9u/code.html
B. Hayes : “how quickly a biochemical puzzle … was reduced to an abstract problem in symbol
manipulation.” B. Hayes, “The Invention of the Genetic Code,” Sc. Am. 1998
(Physicist George Gamow and coding-theorist Solomon W. Golomb. Experimental evidence from
Marshall W. Nirenberg and J. Heinrich Matthaei, NIH)
Protein: Information theory and binding sites
T. D. Schneider : “Strong minor groove base conservation in sequence logos implies
DNA distortion or base flipping during replication and transcription initiation,” Nucleic
Acids Research, 2001, Vol. 29, No. 23 4881-4891
Genome: Increased length, increased fidelity
Mutation Rates
• RNA viruses: 1 - 0.1
• DNA microbes:
1/300
• Higher eukaryotes:
1/300 EfGn
Comparison of microbial genome base mutation rate to genome size: exhibits power
law behavior; inverse relation between genome size and base mutation rate.
G. Battail: “… increasing the codeword length results in a
decreasing probability of error…”
Comparison of higher eukaryotic genome base mutation rate to genome size:
inverse relation between genome size and base mutation rate.
Evidence: Is there evidence of error control
in protein synthesis process?
Liebovitch et al. 1996, Rosen and Moore 2003 computational experiments did not find
evidence for linear block codes
Approach not comprehensive, did not consider convolutional coding or noise
May et al. Looked for optimal generator for translation initiation sites
Highly probable for encoding model not to conform to known error control codes.
Agenda
It is the glory of God to conceal a matter; to search out a matter is the glory of kings.
Proverbs 25:2 (NIV)
• Error Control at Diverse Molecular Scales
• Coding Theory Models of Protein Synthesis
– Gatlin
– Yockey
– May et al.
• Applications of Coding Theory to
– Genetic Classification
– Molecular Computation
– Construction and Control in Protein Synthesis
Central Dogma of Genetics = Genetic Information
Transmission
A
Encode
(eukaryotes)
Channel
Decode
B
(http://www-stat.stanford.edu/~susan/courses/s166/central.gif)
Coding Theory Models of Protein Synthesis
Gatlin, LL., Information theory and the Living System. 1972.
Yockey, Hubert, Information Theory and Molecular Biology. 1992
Coding Theory View of Protein Synthesis, May et al., JFI 2004
Genetic
Genetic Encoder
Channel
Genetic
Information
Errors
Principal Hypothesis: If mRNA is viewed
as a noisy encoded signal, it is feasible to use
Genetic Decoder
principles of error control coding theory to
interpret the genetic translation initiation
mechanism
mRNA
AUG
UAA
3’
Engineering Communication System
B
A
Error Control
Encoder
n-bit
Information
111-000-000-111
k-bit
1-0-0-1
Information
Channel
111-001-000-110
Errors!
Decoder
111-001-000-110
Noise+n-bit
Information
1-0-0-1
~ k-bit
1-0-0-1
Information
Engineering Communication System
B
A
Error Control
Encoder
n-bit
Information
111-000-000-111
k-bit
1-0-0-1
Information
Channel
111-001-000-110
Errors!
Decoder
????
111-001-000-110
Noise+n-bit
Information
1-0-0-1
~ k-bit
1-0-0-1
Information
Agenda
It is the glory of God to conceal a matter; to search out a matter is the glory of kings.
Proverbs 25:2 (NIV)
• Error Control at Diverse Molecular Scales
• Coding Theory Models of Protein Synthesis
– Gatlin
– Yockey
– May et al.
• Applications of Coding Theory to
– Genetic Classification
– Molecular Computation
– Construction and Control in Protein Synthesis
Biological Coding Theory
 David Loewenstern, et. al
• Compression for DNA
sequence classification
 Leonard Adleman, et al.; Lila Kari,
et al.
• Molecular computation
• Encoding for DNA computing
• Error-control coding
Thomas Schneider, et al.
• Biological information theory
• Error-control via sphere packing
Storage
Transmission
Error-Control Coding Based Methods
• Efficient Coding for the Desoxyribonucleic
Channel (S. W. Golomb 1962)
– Applied Biorthogonal codes to genetic
coding problem (the codon to amino acid
mapping challenge)
• Andrzej K. Konopka (1984)
• Gerard Battail
• Table-Based Convolutional Code for E. coli
Promoter (P. Bermel)
– Based on the informational content of E.
coli promoter, approximates the coding
rate for promoter region as 1/9.
– Developed a possible 1/5 binary code for
E. coli promoter region.
Coding Theory in RBS Classification
DB
NRD
SD
AUG
Horizontal axis is position relative to the first base of the initiation codon.
Vertical axis is the mean of the aligned minimum Hamming distance values by position, for the 3
sequence groups (Hamming distance = # of positions where two vectors differ)
May et al., BioSystems 2004
Coding Theory in RBS Classification
PDF PDF (p=0.5) CDF CDF (p=0.5)
b-15, b-14, …, b-11, … , b-1, A U G
80
73.81
70
60
b-15 b-14
b-13 b-12 b-11 b-10
b-9
b-8 b-7
50
59.065
62.105
50
50
40.935
37.895
40
Davg-15
Davg-14
……….
Davg-
26.19
30
11
20

10
0
Correct Classification
s
Incorrect Classification
Coding Theory and Molecular Computation
Leonard M. Adleman, et al.; Lila Kari, et al.
•Molecular computation
•Encoding for DNA computing
•Error-control coding
v
1
v
v2
1
v
2
ligase
v
v
1
M. Stojanovic and D. Stefanovic, “A deoxyribozyme-based molecular
automaton.” Nature Biotech. 2003
•Can achieve computational robustness using coding theory
http://www.scs.uiuc.edu/~scott/index_files/ligation.gif
2
Construction and control:
Quantify and Optimize Protein Translation
Polypeptide
Protein
50s
sub-unit
5’
Initiation
Factors
5’
UAA 3’
AUG
30s
sub-unit
Leader*
Messenger RNA
(mRNA)
AUG
GUG
UUG
Coding Region
UAA
UAG
UGA
*Ribosome binding site contained in leader region
•Phases of translation: initiation, elongation, termination
•Initiation is most time consuming, affects overall gene expression level
•Qualitative outline for initiation process exists: 1) 30S + Ifs bind to mRNA and fMettRNA; 2) Ternary complex binds 50S subunit; 3) IFs released prior to elongation.
mRNA is the only variable aspect of translation initiation.
Information encoded in mRNA determines specificity and efficiency
3’
Construction and control :
5’
Quantify and Optimize Protein Translation
mRNA Leader Region (UTR)
Non-random
Ribosome Binding Site
domain
AUG
GUG
UUG
3’..AUUCCUCCACUAG….
5’
Modify E.coli Intergenic
Downstream
box
3’
Acknowledgments
• Collaborators
– NCSU: Mladen Vouk, Donald Bitzer, and Winser
Alexander, Ann Stomp
– SNL: Anna Johnston, William Hart, Jean-Paul
Watson, Richard Pryor
• NIEHS: John Drake (Mutagenesis data)
• Support:
– SNL Tier 1 Seniors Council LDRD/DOE
– NSF, Ford Foundation