Transcript Document

Computational Problems
in Molecular Biology
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline

From DNA to gene

Protein sequence and structure

Gene expression

Protein interaction and pathway

Provide a roadmap for the entire course

Biology from system level (computational
perspective)
About Life
 Life
is wonderful: amazing
mechanisms
 Life
is not perfect: errors and
diseases
 Life
is a result of evolution
Cells

Basic unit of life

Prokaryotes/eukaryotes

Different types of cell:
 Skin, brain, red/white blood
 Different biological function

Cells produced by cells
 Cell division (mitosis)
 2 daughter cells
DNA

Double Helix (Watson & Crick)

Nitrogenous Base Pairs
 Adenine  Thymine [A,T]
 Cytosine  Guanine [C,G]
 Weak bonds (can be broken)
 Form long chains
Genome


Each cell contains a full genome (DNA)
The size varies:
 Small for viruses and prokaryotes (10 kbp-20Mbp)
 Medium for lower eukaryotes



Yeast, unicellular eukaryote 13 Mbp
Worm (Caenorhabditis elegans) 100 Mbp
Fly, invertebrate (Drosophila melanogaster) 170 Mbp
 Larger for higher eukaryotes

Mouse and man 3000 Mbp
 Very variable for plants (many are polyploid)


Mouse ear cress (Arabidopsis thaliana) 120 Mbp
Lilies 60,000 Mbp
Differences in DNA
~2%
~4%
~0.2%
Genes
 Chunks
of DNA sequence that can
translate into functional biomolecules
(protein, RNA)
 2%
human DNA sequence for coding
genes
 32,000
human genes, 100,000 genes
in tulips
Gene Structure

General structure of an eukaryotic gene

Unlike eukaryotic genes, a prokaryotic gene typically
consists of only one contiguous coding region
Informational Classes in
Genomic DNA

Transcribed sequences (exons and introns)

Messenger sequences (mRNA, exons only)

Coding sequences (CDS, part of the exons only)

Heads and tails: untranslated parts (UTR)

Regulatory sequences

... and all the rest
 Identify them: gene-finding
Genetic Code
A=Ala=Alanine
C=Cys=Cysteine
D=Asp=Aspartic acid
E=Glu=Glutamic acid
F=Phe=Phenylalanine
G=Gly=Glycine
H=His=Histidine
I=Ile=Isoleucine
K=Lys=Lysine
L=Leu=Leucine
M=Met=Methionine
N=Asn=Asparagine
P=Pro=Proline
Q=Gln=Glutamine
R=Arg=Arginine
S=Ser=Serine
T=Thr=Threonine
V=Val=Valine
W=Trp=Tryptophan
Y=Tyr=Tyrosine
Protein Synthesis

AGCCACTTAGACAAACTA (DNA)
 Transcribed to:

AGCCACUUAGACAAACUA (mRNA)
 Translated to:

SHLDKL (Protein)
About Protein
10s – 1000s amino acids (average 300)
Lysozyme sequence (129 amino acids):
KVFGRCELAA AMKRHGLDNY RGYSLGNWVC AAKFESNFNT QATNRNTDGS
TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA SVNCAKKIVS
DGNGMNAWVA WRNRCKGTDV QAWIRGCRL
Protein backbones:
Side chain
Evolution of Genes:
Mutation

Genes alter (slightly) during reproduction
 Caused by errors, from radiation, from toxicity
 3 possibilities: deletion, insertion, alteration

Deletion: ACGTTGACTC  ACGTGACTC

Insertion: ACGTTGACTC  AGCGTTGACTC

Substitution: ACGTTGACTC  ACGATGACTC

Mutations are mostly deleterious
Evolution and Homology
Ancestor
Orthologs
(similar function)
Gene duplication
Paralogs
(related functions)
Y
X
Twilight zone:
undetectable homology
(<20% sequence identity)
Recombination
75%X 25%Y Mixed Homology
Sequence Comparison
o
Pairwise sequence comparison
o
multiple alignment
SAANLEYLKNVLLQFIFLKPG--SERERLLPVINTMLQLSPEEKGKLAAV O15045
NEKNMEYLKNVFVQFLKPESVP-AERDQLVIVLQRVLHLSPKEVEILKAA P34562
KNEKIAYIKNVLLGFLEHKE----QRNQLLPVISMLLQLDSTDEKRLVMS Q06704
REINFEYLKHVVLKFMSCRES---EAFHLIKAVSVLLNFSQEEENMLKET Q92805
MLIDKEYTRNILFQFLEQRD----RRPEIVNLLSILLDLSEEQKQKLLSV O42657
EPTEFEYLRKVMFEYMMGR-----ETKTMAKVITTVLKFPDDQAQKILER O70365
DPAEAEYLRNVLYRYMTNRESLGKESVTLARVIGTVARFDESQMKNVISS Q21071
STSEIDYLRNIFTQFLHSMGSPNAASKAILKAMGSVLKVPMAEMKIIDKK Q18013
Phylogenetic Trees
Understand evolution
Protein Structure
Lysozyme structure:
ball & stick
strand
surface
Structure Features
of Folded Proteins


Compact
Secondary structures:
loop
a-helix
b-sheet
Protein cores mostly consist of a-helices and b-sheets
Protein Structure Comparison
Structure is better conserved than sequence
Structure can adopt a
wide range of mutations.
Physical forces favor
certain structures.
Number of fold is limited.
Currently ~700
Total: 1,000 ~10,000
TIM barrel
Protein Folding Problem
A protein folds into a unique 3D structure under
the physiological condition
Lysozyme sequence:
KVFGRCELAA AMKRHGLDNY
RGYSLGNWVC AAKFESNFNT
QATNRNTDGS TDYGILQINS
RWWCNDGRTP GSRNLCNIPC
SALLSSDITA SVNCAKKIVS
DGNGMNAWVA WRNRCKGTDV
QAWIRGCRL
Structure-Function Relationship
Certain level of
function can be found
without structure. But a
structure is a key to
understand the detailed
mechanism.
A predicted structure is
a powerful tool for
function inference.
Trp repressor as a function switch
Structure-Based Drug Design
Structure-based
rational drug
design is still a
major method for
drug discovery.
HIV protease inhibitor
Gene Expression
Same DNA in all cells, but only a few percent common
genes expressed (house-keeping genes).
A few examples:
(1) Specialized cell: over-represented hemoglobin in blood cells.
(2) Different stages of life cycle: hemoglobins before and after
birth, caterpillar and butterfly.
(3) Different environments: microbial in nutrient poor or rich
environment.
(4) Special treatment: response to wound.
Eucaryote Gene Expression Control
nucleus
DNA
Primary
RNA
transcript
transcriptional
control
Methods:
Mass-spec
Microarray
cytosol
inactive
mRNA
mRNA
degradation
control
mRNA
RNA
processing
control
RNA
transport
control
mRNA
translation
control
protein
nucleus
membrane
protein
activity
control
inactive
protein
Gene Regulation
promoter
operator
DNA sequence
Start of transcription
Microarray Experiments
Regulation/function/pathway/cellular state/phenotype
Disease: diagnosis/gene identification/sub-typing
Microarray chip
Microarray data
Genetic vs. Physical Interaction
Gene/protein interaction
Complex system
Regulatory network
Physical interaction
Genetic interaction
Transcription
factor
Expressed
gene
Biological Pathway
Studying Pathways through
Systems Biology Approach
RGYSLGNWVC
AAKFESNFNT
QATNRNTDGS
TDYGILQINS
RWWCNDGRTP
GSRNLCNIPC
sequence
gene regulation
structure
pathway
(cross-talk)
function
protein interaction
Discussion
 Possible
our life
impacts of biotechnology to
Assignments

Required reading:
* Chapter 13 in “Pavel Pevzner: Computational
Molecular Biology - An Algorithmic Approach.
MIT Press, 2000.”
* Larry Hunter: molecular biology for computer
scientists

Optional reading:
http://www.ncbi.nih.gov/About/primer/bioinformatics.html
http://www.bentham.org/cpps1-1/Dong%20Xu/xu_cpps.htm