Transcript General

Computational Biology, Part 1
Introduction
Robert F. Murphy
Copyright  1996, 2000, 2001.
All rights reserved.
Course Introduction
What these courses are about
 What I expect
 What you can expect

What these courses are about
overview of ways in which computers are
used to solve problems in biology
 supervised learning of illustrative or
frequently-used programs
 (03-510) supervised learning of
programming techniques and algorithms
selected from these uses

I expect





students will have basic knowledge of biology and
chemistry (at the level of Modern Biology/Chemistry) and
willingness to learn more
students will have basic familiarity with use of computers
(e.g., at the level of Computing Skills Workshop) and
eagerness to gain new skills
(03-510) students have some programming experience and
willingness to work to improve
heterogeneous class - I plan to include refreshers on each
new topic
students will ask questions in class and via email
You can expect

Three major course sections





Class sessions: lectures/demonstrations/exercises/quizzes
Homework assignments






Sequence Analysis (13 classes)
Biological Modeling (11 classes)
Biological Imaging (4 classes)
4 homework assignments for 03-311 (80% of grade)
8 homework assignments for 03-310 (70% of grade)
10 homework assignments for 03-510 (70% of grade)
Test March 1 (20% for 03-311, 10% for others)
Final (20% of grade for 03-310, 03-510)
Communication on class matters via email list
Textbooks for first half of course

For 03-310/311 students
 “Required

textbook” is Baxevanis & Ouellette
For 03-510 students
 “Recommended”

textbook is Durbin et al.
Additional suggested book
 Computational
Molecular Biology, Peter Clote
& Rolf Backofen (ISBN 0-471-87252-0)
 Chap.
1 is an excellent introduction to Molec. Biol.
for non-Biology majors
Specific sources for CMU
computational biology classes

Web page (http://www.bio.cmu.edu/Courses/03310 or
03311 or 03510)
 Lecture
Notes (as PowerPoint files)
 Homework Assignments (as Word files)
 Additional materials as needed

FTP server (www.bio.cmu.edu)
 Files

needed for homework assignments
CompBiol project volume on AFS
 /afs/andrew.cmu.edu/usr/murphy/CompBiol
Additional classes for 03-510
We will have one additional class meeting
per week for 03-510 for the first half of the
semester only
 Purpose is to cover some more advanced
material and programming assignments

Other relevant courses

Second half mini-course “47-863: Topics in
Operations Research: Computational
Biology” will be taught by Dr. R. Ravi
 Tuesday-Thursday
1:30-2:50 starting 3/13
 Recommended for 03-510 students

Fall 2001 course on advanced topics in
computational molecular biology will be
taught by Dr. Dannie Durand
 Prerequisite:
03-310/311/510
Information flow
A major task in computational molecular
biology is to “decipher” information
contained in biological sequences
 Since the nucleotide sequence of a genome
contains all information necessary to
produce a functional organism, we should in
theory be able to duplicate this decoding
using computers

Review of basic biochemistry
Central Dogma: DNA makes RNA makes
protein
 Sequence determines structure determines
function

Structure

macromolecular structure divided into






primary structure (1D sequence)
secondary structure (local 2D & 3D)
tertiary structure (global 3D)
DNA composed of four nucleotides or "bases":
A,C,G,T
RNA composed of four also: A,C,G,U (T
transcribed as U)
proteins are composed of amino acids
DNA properties - base
composition

Some properties of long, naturally-occuring
DNA molecules can be predicted accurately
given only the base composition, usually
expressed as either
 %GC
(the percent of all base pairs that are
G:C), or
 GC (the mole fraction of all bases that are
either G or C)
 %GC = 100*GC
DNA properties - melting
temperature and buoyant density

Two such properties are
 Tm, the melting temperature, defined as the
temperature at which half of the DNA is singlestranded and half is double-stranded
 Tm (oC)
= 69.3 + 41 GC (for 0.15 M NaCl)
 0,
the buoyant density, defined as the density
of a solution in which a DNA molecule will feel
no net force when centrifuged (the density at
the point in a density gradient at which the
DNA stops moving, or “bands”)
 0
(g cm-3) = 1.660 + 0.098 GC (for CsCl)
DNA structure - restriction maps
Restriction enzymes cut DNA at specific
sequences.
 A restriction map is a graphical description
of the order and lengths of fragments that
would be produced by the digestion of a
DNA molecule with one or more restriction
enzymes

Restriction map of a circular
plasmid with one enzyme
AccII
AccII AccII
AccII
AccII
AccII
pGEM4
AccII
AccII
AccII
AccII
AccII
Restriction map of all enzymes
that cut only once
SspBIBsrGI Bsp1407I
AcsI ApoI EcoRI Ecl136II EcoICRISacI SstI Acc65I Asp718I AvaI
NheINaeINgoMINgoAIV
SgrAI
Eco47IIIAor51HI
DsaI BsmFI
EcoNI
AflIII
pGEM4
AlwNI
AatII
SspI
XmnIAsp700I
ScaI Eco255I
XorII PvuI BspCI
AhdI AspEI Eam1105I EclHKI
BpmI GsuI BglI
AviII FspI
Transcription






transcription is accomplished by RNA polymerase
RNA polymerase binds to promoters
promoters have distinct regions "-35" and "-10"
efficiency of transcription controlled by binding
and progression rates
transcription start and stop affected by tertiary
structure
regulatory sequences can be positive or negative
RNA processing
eukaryotic genes are interrupted by introns
 these are "spliced" out to yield mRNA
 splicing done by spliceosome
 splicing sites are quite degenerate but not all
are used

Translation
conversion from RNA to protein is by
codon: 3 bases = 1 amino acid
 translation done by ribosome
 translation efficiency controlled by mRNA
copy number (turnover) and ribosome
binding efficiency
 translation affected by mRNA tertiary
structure

Protein localization
leader sequences can specify cellular
location (e.g., insert across membranes)
 leader sequences usually removed by
proteolytic cleavage

Postranslational processing
peptides fold after translation - may be
assisted or unassisted
 processing enzymes recognize specific sites
(amino acid sequences)
 protein signals can involve secondary and
tertiary structure, not just primary structure

Goals of Sequence Analysis
Assigned Reading:
 Baxevanis & Ouellette, Chapter 10

Goals of Sequence Analysis


Management of sequence information
Assembly of sequence fragments into complete
units (proteins, genes, chromosomes)
Goals of Sequence Analysis

Confirmation and prediction of restriction enzyme
sites (for nuc.acids)
 can
aid sequence determination in areas of uncertainty
by permitting testing of specific bases
 can permit selection of appropriate enzymes for
sequence checking
 can permit selection of appropriate enzymes for
subcloning or generation of probes
Goals of Sequence Analysis


Finding open reading frames (ORFs) for cDNAs or
genomic DNA from organisms without introns
Finding protein coding regions in DNAs using codon usage
tables




not all ORFs are made into proteins
redundancy in genetic code is not fully reflected in the tRNAs
made by a particular organism (codon preference)
can use to identify "real" coding regions (pseudo-genes "drift" in
their codon usage)
can use expressed sequence tags (ESTs)
Goals of Sequence Analysis

Finding and using consensus sequences

Examples








promoters
transcription initiation sites
transcription termination sites
polyadenylation sites
ribosome binding sites
protein features
use sets of sequences identified (by other means) as related
use sets of sequences identified by sequence comparison
Goals of Sequence Analysis

Comparison and alignment of sequences
 compare
sequence to database - goal: find related
sequences (SIMILARITY)
 compare sequence to sequence - goal: find matching
domains (ALIGNMENT)
 compare database to database - goal: estimate genetic
distance (EVOLUTION)
 either: determine consensus sequences
 comparisons can be pairwise or multiple-strand
Goals of Sequence Analysis

Translation to protein sequence and prediction of
protein properties - use measured propensities of
particular amino acids or amino acid stretches
 Predict
molecular weight
 Predict isoelectric point (pI)
 Predict extinction coefficient

Prediction of secondary and tertiary structure
 RNA -
use base pairing energies
 protein - use propensities