Transcript Document
Computational Problems
in Molecular Biology
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline
From DNA to gene
Protein sequence and structure
Gene expression
Protein interaction and pathway
Provide a roadmap for the entire course
Biology from system level (computational
perspective)
About Life
Life
is wonderful: amazing
mechanisms
Life
is not perfect: errors and
diseases
Life
is a result of evolution
Cells
Basic unit of life
Prokaryotes/eukaryotes
Different types of cell:
Skin, brain, red/white blood
Different biological function
Cells produced by cells
Cell division (mitosis)
2 daughter cells
DNA
Double Helix (Watson & Crick)
Nitrogenous Base Pairs
Adenine Thymine [A,T]
Cytosine Guanine [C,G]
Weak bonds (can be broken)
Form long chains
Genome
Each cell contains a full genome (DNA)
The size varies:
Small for viruses and prokaryotes (10 kbp-20Mbp)
Medium for lower eukaryotes
Yeast, unicellular eukaryote 13 Mbp
Worm (Caenorhabditis elegans) 100 Mbp
Fly, invertebrate (Drosophila melanogaster) 170 Mbp
Larger for higher eukaryotes
Mouse and man 3000 Mbp
Very variable for plants (many are polyploid)
Mouse ear cress (Arabidopsis thaliana) 120 Mbp
Lilies 60,000 Mbp
Differences in DNA
~2%
~4%
~0.2%
Genes
Chunks
of DNA sequence that can
translate into functional biomolecules
(protein, RNA)
2%
human DNA sequence for coding
genes
32,000
human genes, 100,000 genes
in tulips
Gene Structure
General structure of an eukaryotic gene
Unlike eukaryotic genes, a prokaryotic gene typically
consists of only one contiguous coding region
Informational Classes in
Genomic DNA
Transcribed sequences (exons and introns)
Messenger sequences (mRNA, exons only)
Coding sequences (CDS, part of the exons only)
Heads and tails: untranslated parts (UTR)
Regulatory sequences
... and all the rest
Identify them: gene-finding
Genetic Code
A=Ala=Alanine
C=Cys=Cysteine
D=Asp=Aspartic acid
E=Glu=Glutamic acid
F=Phe=Phenylalanine
G=Gly=Glycine
H=His=Histidine
I=Ile=Isoleucine
K=Lys=Lysine
L=Leu=Leucine
M=Met=Methionine
N=Asn=Asparagine
P=Pro=Proline
Q=Gln=Glutamine
R=Arg=Arginine
S=Ser=Serine
T=Thr=Threonine
V=Val=Valine
W=Trp=Tryptophan
Y=Tyr=Tyrosine
Protein Synthesis
AGCCACTTAGACAAACTA (DNA)
Transcribed to:
AGCCACUUAGACAAACUA (mRNA)
Translated to:
SHLDKL (Protein)
About Protein
10s – 1000s amino acids (average 300)
Lysozyme sequence (129 amino acids):
KVFGRCELAA AMKRHGLDNY RGYSLGNWVC AAKFESNFNT QATNRNTDGS
TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA SVNCAKKIVS
DGNGMNAWVA WRNRCKGTDV QAWIRGCRL
Protein backbones:
Side chain
Evolution of Genes:
Mutation
Genes alter (slightly) during reproduction
Caused by errors, from radiation, from toxicity
3 possibilities: deletion, insertion, alteration
Deletion: ACGTTGACTC ACGTGACTC
Insertion: ACGTTGACTC AGCGTTGACTC
Substitution: ACGTTGACTC ACGATGACTC
Mutations are mostly deleterious
Evolution and Homology
Ancestor
Orthologs
(similar function)
Gene duplication
Paralogs
(related functions)
Y
X
Twilight zone:
undetectable homology
(<20% sequence identity)
Recombination
75%X 25%Y Mixed Homology
Sequence Comparison
o
Pairwise sequence comparison
o
multiple alignment
SAANLEYLKNVLLQFIFLKPG--SERERLLPVINTMLQLSPEEKGKLAAV O15045
NEKNMEYLKNVFVQFLKPESVP-AERDQLVIVLQRVLHLSPKEVEILKAA P34562
KNEKIAYIKNVLLGFLEHKE----QRNQLLPVISMLLQLDSTDEKRLVMS Q06704
REINFEYLKHVVLKFMSCRES---EAFHLIKAVSVLLNFSQEEENMLKET Q92805
MLIDKEYTRNILFQFLEQRD----RRPEIVNLLSILLDLSEEQKQKLLSV O42657
EPTEFEYLRKVMFEYMMGR-----ETKTMAKVITTVLKFPDDQAQKILER O70365
DPAEAEYLRNVLYRYMTNRESLGKESVTLARVIGTVARFDESQMKNVISS Q21071
STSEIDYLRNIFTQFLHSMGSPNAASKAILKAMGSVLKVPMAEMKIIDKK Q18013
Phylogenetic Trees
Understand evolution
Protein Structure
Lysozyme structure:
ball & stick
strand
surface
Structure Features
of Folded Proteins
Compact
Secondary structures:
loop
a-helix
b-sheet
Protein cores mostly consist of a-helices and b-sheets
Protein Structure Comparison
Structure is better conserved than sequence
Structure can adopt a
wide range of mutations.
Physical forces favor
certain structures.
Number of fold is limited.
Currently ~700
Total: 1,000 ~10,000
TIM barrel
Protein Folding Problem
A protein folds into a unique 3D structure under
the physiological condition
Lysozyme sequence:
KVFGRCELAA AMKRHGLDNY
RGYSLGNWVC AAKFESNFNT
QATNRNTDGS TDYGILQINS
RWWCNDGRTP GSRNLCNIPC
SALLSSDITA SVNCAKKIVS
DGNGMNAWVA WRNRCKGTDV
QAWIRGCRL
Structure-Function Relationship
Certain level of
function can be found
without structure. But a
structure is a key to
understand the detailed
mechanism.
A predicted structure is
a powerful tool for
function inference.
Trp repressor as a function switch
Structure-Based Drug Design
Structure-based
rational drug
design is still a
major method for
drug discovery.
HIV protease inhibitor
Gene Expression
Same DNA in all cells, but only a few percent common
genes expressed (house-keeping genes).
A few examples:
(1) Specialized cell: over-represented hemoglobin in blood cells.
(2) Different stages of life cycle: hemoglobins before and after
birth, caterpillar and butterfly.
(3) Different environments: microbial in nutrient poor or rich
environment.
(4) Special treatment: response to wound.
Eucaryote Gene Expression Control
nucleus
DNA
Primary
RNA
transcript
transcriptional
control
Methods:
Mass-spec
Microarray
cytosol
inactive
mRNA
mRNA
degradation
control
mRNA
RNA
processing
control
RNA
transport
control
mRNA
translation
control
protein
nucleus
membrane
protein
activity
control
inactive
protein
Gene Regulation
promoter
operator
DNA sequence
Start of transcription
Microarray Experiments
Regulation/function/pathway/cellular state/phenotype
Disease: diagnosis/gene identification/sub-typing
Microarray chip
Microarray data
Genetic vs. Physical Interaction
Gene/protein interaction
Complex system
Regulatory network
Physical interaction
Genetic interaction
Transcription
factor
Expressed
gene
Biological Pathway
Studying Pathways through
Systems Biology Approach
RGYSLGNWVC
AAKFESNFNT
QATNRNTDGS
TDYGILQINS
RWWCNDGRTP
GSRNLCNIPC
sequence
gene regulation
structure
pathway
(cross-talk)
function
protein interaction
Discussion
Possible
our life
impacts of biotechnology to
Assignments
Required reading:
* Chapter 13 in “Pavel Pevzner: Computational
Molecular Biology - An Algorithmic Approach.
MIT Press, 2000.”
* Larry Hunter: molecular biology for computer
scientists
Optional reading:
http://www.ncbi.nih.gov/About/primer/bioinformatics.html
http://www.bentham.org/cpps1-1/Dong%20Xu/xu_cpps.htm