Transcript Slides

Spring 2006 – http://www.stanford.edu/class/cs273/
CS273
Algorithms for Structure and
Motion in Biology
Instructors:
Serafim Batzoglou and Jean-Claude Latombe
Teaching Assistant: Sam Gross
| serafim | latombe | ssgross | @ cs.stanford.edu
Need a Scribe!!
Range of Bio-CS Interaction
Enormous range over space and time
Body system
Tissue/Organs
Cells
Molecules
Gene
Sequence
alignment
Robotic
surgery
Soft-tissue
simulation and
surgical training
Simulation of
cell interaction
Molecular
structures,
similarities
and motions CS273
Focus on Proteins
 Proteins are the workhorses of all living
organisms
 They perform many vital functions, e.g:
•
•
•
•
•
•
Catalysis of reactions
Transport of molecules
Building blocks of muscles
Storage of energy
Transmission of signals
Defense against intruders
Proteins are also of great interest
from a computational viewpoint
 They are large molecules (few 100s to
several 1000s of atoms)
 They are made of building blocks (amino
acids) drawn from a small “library” of
20 amino-acids
 They have an unusual kinematic
structure: long serial linkage (backbone)
with short side-chains
Proteins are associated with
many challenging problems
 Predict folded structures and motion pathways
 Understand why some proteins misfold or partially
fold, causing such diseases as: cystic fibrosis,
Parkinson, Creutzfeldt-Jakob (mad cow)
 Find structural similarities among proteins and
classify proteins
 Find functional structural motifs in proteins
 Predict how proteins bind against other proteins and
smaller molecules
 Design new drugs
 Engineer and design proteins and protein-like
structures (polymers)
Central Dogma
of Molecular Biology
Central Dogma
of Molecular Biology
translation
transcription
Protein Sequence
(residue i-1)
O
O
N
N
N
N
O
O
 Long sequence of amino-acids (dozens to thousands),
also called residues
 Dictionary of 20 amino-acids (several billion years old)
Protein Sequence
O
O
N
N
N
N
T
O
O
Peptide bond
(partial double
bond character)
Central Dogma
of Molecular Biology
Physiological conditions:
aqueous solution, 37°C, pH 7,
atmospheric pressure
Levels of Protein Structures
Quaternary
hemoglobin (4 polypeptide chains)
Mostly a-helices
Mostly b-sheets
Mixed
Unfolded (denatured) state
Folding
Folded (native) state
Intermediate
states
Many pathways
How (we think) a protein folds ...
DG = DH - TDS
http://www-shakh.harvard.edu/ProFold2.html
How (we think) a protein folds ...
DG = DH - TDS
http://www-shakh.harvard.edu/ProFold2.html
How (we think) a protein folds ...
DG = DH - TDS
http://www-shakh.harvard.edu/ProFold2.html
How (we think) a protein folds ...
DG = DH - TDS
http://www-shakh.harvard.edu/ProFold2.html
How (we think) a protein folds ...
DG = DH - TDS
http://www-shakh.harvard.edu/ProFold2.html
Motion of Proteins
in Folded State
HIV-1 protease
Structural variability of
the overall ensemble of native ubiquitin structures
[Shehu, Kavraki, Clementi, 2005]
Flexible Loop
Loop 7
Amylosucrase
Central Dogma
of Molecular Biology
Binding
Inhibitor binding to HIV protease
Ligand-protein binding
Protein-protein binding
Binding of Pyruvate to LDH
(reduction of pyruvate to lactase)
Loop
GLN-101
ARG-106
CH3
O C
C
O
O
Pyruvate
ASP-195
NADH
Nicotinamide adenine
dinucleotide (coenzyme)
HIS193
+
+
THR-245
ASP-166
+
ARG-169
Lactate dehydrogenase environment
What is CS273 about?
 Algorithms and computational schemes
for molecular biology problems
 Molecular biology seen by computer
scientists
The Shock of Two Cultures
 y = f(x)
 Biologists like experiments, specifics and classifications
They like it better to know many (xi,yi) – i.e., facts – and
classify them, than to know f
 Computer scientists like simulation, abstractions, and
general algorithms
They want to know f – the explanation of the facts – and
efficient ways to compute it, but rarely care for any
(xi,yi)
 One challenge of Computational Biology is to fuse these
two cultures
 Two Views of a
BioComputation Class
Where are IT resources for biology available
and how to use them
How to design efficient data structures and
algorithms for biology
Main Ideas Behind CS273
1. The information is in the sequence
 Sequence  Structure (shape)  Function
 Sequence similarity  Structural/functional similarity
 Sequences are related by evolution
Main Ideas Behind CS273
1. The information is in the sequence
 Sequence  Structure (shape)  Function
 Sequence similarity  Structural/functional similarity
 Sequences are related by evolution
2. Biomolecules move and bind to achieve their functions
 Deformation  folded structures of proteins
 Motion + deformation  multi-molecule complexes
 One cannot just “jump” from sequence to function
Protein
folding
Ligand
protein
binding
sequence
similarity
Sequence
Structure
structure
similarity
Function
Main Ideas Behind CS273
1. The information is in the sequence
 Sequence  Structure (shape)  Function
 Sequence similarity  Structural/functional similarity
 Sequences are related by evolution
2. Biomolecules move and bind to achieve their functions
 Deformation  folded structures of proteins
 Motion + deformation  multi-molecule complexes
 One cannot just “jump” from sequence to function
 CS273 is about algorithms
for sequence, structure and motion
- Finding sequence and shape similarities
- Relating structure to function
- Extracting structure from experimental data
- Computing and analyzing motion pathways
Vision Underlying CS273
 Goal of computational biology:
Low-cost high-bandwidth in-silico biology
 Requirements:
Reliable models  Efficient algorithms
 Algorithmic efficiency by exploiting properties
of molecules and processes:
 Proteins are long kinematic chains
 Atoms cannot bunch up together
 Forces have relatively short ranges
 Computational Biology is more than using
computers to biological problems or mimicking
nature (e.g., performing MD simulation)
Tentative Schedule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
April 5
April 10
April 12
April 17
April 19
April 24
April 26
May 1
May 3
May 8
May 10
May 15
May 17
May 22
May 24
May 31
June 5
June 7
June 12
Introduction
Protein geometric and kinematic models
Conformational space
Inverse kinematics and applications
Sequence similarity
Sequence similarity
Sequence similarity
Structure comparison
Structure comparison
Protein phylogeny, clustering, and classification
Protein phylogeny, clustering, and classification
Energy maintenance
Energy maintenance
Structure prediction
Roadmap methods
Structure prediction
Structure prediction
TBA
Project presentations (2 hours)
Instructors and TAs
 Instructors:
– Serafim Batzoglou
– Jean-Claude Latombe
 TA:
– Sam Gross
 Emails: | serafim | latombe | ssgross | @ cs.stanford.edu
 Class website: http://cs273.stanford.edu
Expected Work
 Regular attendance to lectures and active
participation
 Class scribing (assignments will depend on #
of students)
 Exciting programming project:
http://www.stanford.edu/class/cs273/project/project.html
- Structure prediction
- Clustering and distance metrics
- Protein design
- Something else
Questions?