Transcript CG1-intro
Algorithms in
Computational Biology
Fall 2005/6
Lecturer: Benny Chor (benny AT cs.tau.ac.il)
Lectures: Thursdays 18:00-21:00, Schreiber 007
.
Course Information
Requirements & Grades:
20-25% homework, in five-to-six assignments,
containing both “dry” and “wet” problems.
Submission - two weeks from posting.
Homework submission is obligatory.
You are strongly encouraged to solve the
assignments independently (or at least give it a
serious try).
75-80% exam. Must pass beyond 55 for the
homework’s grade to count
2
Bibliography
Biological
Sequence Analysis, R.Durbin et al. ,
Cambridge University Press, 1998
Introduction to Molecular Biology, J. Setubal and
J. Meidanis, PWS publishing Company, 1997
Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology, D. Gusfield,
Cambridge University Press, 1997.
Post-genome
Informatics, M. Kanehisa , Oxford
University Press, 2000.
More refs on course page.
3
Course Prerequisites
Computer Science and Probability Background
Computational Models
Algorithms (“efficiency of computation”)
Probability (any course)
Some Biology Background
Formally: None, to allow CS students to take this course.
Recommended: Some molecular biology course, and/or a serious
desire to complement your knowledge in Biology by reading the
appropriate material.
Studying the algorithms in this course while acquiring enough
biology background is far more rewarding than ignoring the
biological context.
4
Let Us Start from the Beginning
אשית ,בָּ ָּרא אֱלהִׁ ים ,אֵ ת הַ שָּ מַ יִׁ ם ,וְּ אֵ ת הָּ אָּ ֶרץ .
בְּ ֵר ִׁ
וְּ הָּ אָּ ֶרץ ,הָּ יְּ תָּ ה תהּו וָּבהּו ,וְּ חשֶ ְך ,עַל-פְּ נֵי ְּתהֹום;
וְּ רּוחַ אֱלהִׁ יםְּ ,מ ַרחֶ פֶת עַל-פְּ נֵי הַ מָּ יִׁ ם.
וַּיאמֶ ר אֱלהִׁ ים ,יְּ הִׁ י אֹור; וַיְּ הִׁ י-אֹור.
5
And if you prefer the scientific version
13.7 billion years ago:
the big bang.
5 billion years ago:
our sun.
4.5 billion years ago:
planet earth.
3.8 billion years ago:
early forms of life.
3.8 bya – present:
evolution.
Akilia island, Greenland
6
Evolution
A process where small
random changes
rccumulate within
species over time
till new ones form.
The Tree of Life:
A classical, basic science
problem, since Darwin’s 1859
“Origin of Species”.
7
Phylogeny Reconstruction
Goal: Given a set of species, reconstruct
the tree which best explains their
evolutionary history.
8
Phylogenetic Trees Based on What?
1.
2.
3.
Morphology
Single genes
Whole genomes
9
What is Computational Biology?
Computational biology is the application of computational tools
and techniques to molecular biology (primarily). It enables new
ways of study in life sciences, allowing analytic and predictive
methodologies that support and enhance laboratory work. It is a
multidisciplinary area of study that combines Biology, Computer
Science, and Statistics.
Computational biology is also called Bioinformatics, although
many practitioners define Bioinformatics somewhat narrower by
restricting to the application of specialized software for deducing
meaningful biological information.
10
Accepted Model of Species Evolution
Initial period: primordial
soup, where “you are what
you eat”. (recombination
events, horizontal transfers.)
Later period: Formation of
distinct taxa. Speciation
events induce a tree-like
evolution.
11
Species Evolution (2)
The affinities of
all the beings of
the same class
have sometimes
been represented
by a great tree...
Charles Darwin, 1859
12
Why Bio-informatics ?
An explosive growth in the amount of biological information
necessitates the use of computers for cataloging, retrieval
and analyzing mega-data (> 3 billion bps, > 30,000 genes).
• The human genome project.
• Improved technologies, e.g.
automated sequencing.
• GenBank is now approximately
doubling every year !!!
13
New Biotechnologies & Data
• Micro arrays - gene expression.
• 2D gels – protein expression.
• Multi-level maps - genetic,
physical: sequence, annotation.
• Networks of protein-protein
interactions.
• Cross-species relationships • Homologous genes.
• Chromosome organization.
http://www.the-scientist.com/yr2002/apr/research020415.html
14
BioInformatics Tools are Crucial !
• New biotechnology tools generate
explosive growth in the amount of
biological data.
• Impossible to analyze the data manually.
• Novel mathematical, statistical,
algorithmic and computational tools
are necessary !
15
Areas of Interest (very partial list)
•
•
•
•
•
•
Building evolutionary trees from molecular (and other) data
Efficiently reconstructing the genome sequence from subparts (mapping, assembly, etc.)
Understanding the structure of genomes (Genes, SNP, SSR)
Understanding function of genes in the cell cycle and disease
Deciphering structure and function of proteins
Diagnosing cancer based on DNA microarrays (“chips”)
_____________________
SNP: Single Nucleotide Polymorphism
SSR: Simple Sequence Repeat
Much of this class has been edited from Nir Friedman’s lecture which is available at
www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally
Benny Chor. Additional slides from Zohar Yakhini and Metsada Pasmanik.
16
Growth of DNA Sequence Data: GenBank
42 million sequences
(“words”)
45 billion base pairs
(“letters”)
17
PDB
Content
Growth
The Protein Data Bank
02
(Experimentally
determined)
http://www.rcsb.org/pdb/
18
Four Aspects
Biological
What is the task?
Algorithmic
How to perform the task at hand efficiently?
Learning
How to adapt/estimate/learn parameters and
models describing the task from examples
Statistics
How to differentiate true phenomena from
artifacts
20
Example: Sequence Comparison
Biological
Evolution preserves sequences, thus similar genes might
have similar function
Algorithmic
Consider all ways to “align” one sequence against
another
Learning
How do we define “similar” sequences? Use examples to
define similarity
Statistical
When we compare to ~106 sequences, what is a random
match and what is true one
21
Topics I
Dealing with DNA/Protein sequences:
Finding
similar sequences
Models of sequences: Hidden Markov Models
Genome projects and how sequences are found
Transcription regulation
Protein Families
Gene finding
23
Topics II
High throughput biotechnologies –
potentials and computational challenges
DNA microarrays
applications to diagnostics
applications to understanding gene networks
25
Topics III (Structural BioInfo Course)
Protein World:
How proteins fold - secondary & tertiary structure
How to predict protein folds from sequences data
How to predict protein function from its structure
How to analyze proteins changes from raw
experimental measurements (MassSpec)
26
Algorithmics
Will introduce algorithmic techniques that are
useful in computational genomics (and elsewhere):
Dynamic programing, dynamic programing, dynamic..
Suffix trees and arrays
Probabilistic models: PSSM (Position Specific
Scoring Matrices), HMM (Hidden Markov Models)
Learning and classification, SVM (Support Vector
Machines)
Heuristics for solving hard optimization problems
(Many problems in comp. genomics are NP-hard)
27
DNA – The Basic Genetic Material
• Each human cell contains
23 pairs of chromosomes.
• Chromosomes can be
distinguished by size and
by unique banding
patterns.
• Male – XY, Female – XX.
• This set is from a male.
28
Chromosomes
SKY –
Differentially dye-staining chromosomes
29
Watson and Crick
… On Feb. 28,
1953, Francis
Crick walked into
the Eagle pub in
Cambridge,
England, and, as
James Watson
later recalled,
announced that
"we had found the
secret of life."
"The structure was too pretty not
to be true."
-- JAMES D. WATSON, "The Double
Helix"
30
DNA the Code
for Life
(1953)
1920-1958
Died from ovarian cancer
http://www.nobel.se/medicine/laureates/1962/index.html
31
Source: Alberts et al
The Double Helix
33
The Central Dogma of Molecular Biology
Replication
protein
mRNA
DNA
Transcription
A
C
UA A G
CA
G
G U
U
CA
C
Translation
A
Phenotype
34
Watson-Crick Complementarity
Conclusion: DNA strands are complementary (1953).
Base ratios
% of each base
DNA source
Human
Sheep
Turtle
Sea urchin
Wheat
E. coli
Purines/
Pyrimidines
Pyrimidines
Purines
36
Genome Sizes
E.Coli
(bacteria)
Yeast (simple fungi)
Smallest human chromosome
Entire human genome
4.6 x 106 bases
15 x 106 bases
50 x 106 bases
3 x 109 bases
37
Mendel and the Discovery of Genes
• Gregor Mendel, 1822-1884,
studied inheritance patterns
of common pea traits.
• He concluded that traits
were passed through
generations in basic units
which were later called
genes.
38
Genetic Information
Genome – the collection of
genetic information.
Chromosomes – storage
units of genes.
Gene – basic unit of genetic
information. They determine
the inherited characters.
39
What is a Gene ?
Transcribed region
Un-coded
region
promotor
exon
exon
intron
Start codon
Un-coded
region
exon
intron
Terminal codon
DNA contains various recognition sites:
• Promoter signals.
• Transcription start signals.
• Start codon.
• Exon, intron boundaries.
• Transcription termination signal.
40
Control of the Human b-Globin Gene
41
Alternative Splicing
42
42
Genes: How Many?
The DNA strings include:
Coding regions (“genes”)
E. coli has ~4,000 genes
Yeast has ~6,000 genes
C. Elegans has ~13,000 genes
Humans have ~32,000 genes
Control regions
These typically are adjacent to the genes
They determine when a gene should be “expressed”
So called “Junk” DNA (unknown function - ~90% of the DNA
in human’s chromosomes)
43
Gene Finding
• Only 4% of the human genome encodes for functional genes.
• Genes are found along large non-coding DNA regions.
• Repeats, pseudo-genes, introns, contamination of vectors,
are confusing.
44
45
Gene Finding
Existing programs for locating genes within
genomic sequences utilize a number of
statistical signals and employ statistical
models such as hidden Markov models (HMMs).
The problem is not solved
yet, esp. for the newly
discovered “RNA genes”.
46
48
Diversity of Tissues in Stomach
How is this variety encoded and expressed ?
49
Central Dogma
שעתוק
Transcription
Gene
תרגום
Translation
mRNA
Protein
cells express different subset of the genes
In different tissues and under different conditions
50
Transcription
sequences can be transcribed to RNA
Source: Mathews & van Holde
Coding
RNA
nucleotides:
Similar to DNA, slightly different backbone
Uracil (U) instead of Thymine (T)
51
Transcription: RNA Editing
1. Transcribe to RNA
2. Eliminate introns
3. Splice (connect) exons
* Alternative splicing exists
Exons hold information, they are more stable during evolution.
This process takes place in the nucleus. The mRNA molecules
diffuse through the nucleus membrane to the outer cell plasma.
52
RNA roles
Messenger RNA (mRNA)
Encodes protein sequences. Each three nucleotide acids
translate to an amino acid (the protein building block).
Transfer RNA (tRNA)
Decodes the mRNA molecules to amino-acids. It connects
to the mRNA with one side and holds the appropriate
amino acid on its other side.
Ribosomal RNA (rRNA)
Part of the ribosome, a machine for translating mRNA to
proteins. It catalyzes (like enzymes) the reaction that
attaches the hanging amino acid from the tRNA to the
amino acid chain being created.
...
53
New Roles of RNA
Cellular Regulation
COVER: Researchers are discovering that small RNA molecules play a
surprising variety of key roles in cells. They can inhibit translation of
messenger RNA into protein, cause degradation of other messenger
RNAs, and even initiate complete silencing of gene expression from the
genome.
http://www.sciencemag.org/content/vol298/issue5602/cover.shtml
http://www.nature.com/nature/journal/v408/n6808/fig_tab/408037a0_F1.html
54
Translation in Eukaryotes
http://www1.imim.es/courses/Lisboa01/slide1.6_translation.html
Animation: http://cbms.st-and.ac.uk/academics/ryan/Teaching/medsci/Medsci6.htm
55
Translation
Translation
is mediated by the ribosome
Ribosome is a complex of protein & rRNA
molecules
The ribosome attaches to the mRNA at a translation
initiation site
Then ribosome moves along the mRNA sequence
and in the process constructs a sequence of amino
acids (polypeptide) which is released and folds into
a protein.
56
Genetic Code
There are 20 amino acids from which proteins are build.
57
Protein Structure
Proteins
are polypeptides of 70-3000
amino-acids
This
structure is
(mostly) determined
by the sequence of
amino-acids that
make up the protein
58
Protein Structure
59
60
The Central Paradigm of Bio-informatics
Genetic
information
Molecular
structure
Biochemical
function
Symptoms
61
Similarity Search in Databanks
Find similar sequences
to a working draft.
As databanks grow,
homologies get harder,
and quality is reduced.
Alignment Tools:
BLAST & FASTA
(time saving
heuristicsapproximations).
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17
Sbjct: 1
Query: 77
Sbjct: 60
Pairwise
alignment:
aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
62
Multiple Sequence Alignment
Multiple alignment: Basis for phylogenetic tree
construction. Useful to find protein families
and functional domains.
63
Evolution
Evolution - a process
in which small changes
occur within species
over time.
These changes are mainly monitored today using
molecular sequences (DNA/proteins).
The Tree of Life:
A classical, basic science
problem, since Darwin’s 1859
“Origin of Species”.
64
Evolution
Related
organisms have similar DNA
Similarity in sequences of proteins
Similarity in organization of genes along the
chromosomes
Evolution plays a major role in biology
Many mechanisms are shared across a wide
range of organisms
During the course of evolution existing
components are adapted for new functions
65
Source: Alberts et al
The Tree of Life
67
Phylogeny Reconstruction
Goal: Given a set of species, reconstruct the tree
which best explains their evolutionary history.
68
Trees are Based on What ?
Darwin (Origin of Species, 1859) and his contemporaries
based their work on morphological and physiological
properties (e.g. cold/warm blood, existance of scales,
number of teeth, existance of wings, etc., etc.).
Paleontological data is still
in use when constructing
trees for certain
extinct species (e.g.
dinosaures, mammoths,
moas, unicorns, etc…)
Today most phylogenetic trees are based on
molecular sequence data (DNA or proteins).
70
Phylogenetic Trees Based on What?
1.
2.
3.
Morphology
Single genes
Whole genomes
71
Evolution
www.tomchalk.com/evolution.gif
72
Example for Phylogenetic Analysis
Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken
from four species.
Question: Which evolutionary tree best explains these sequences ?
One Answer (the parsimony principle): Pick a tree that has a
minimum total number of substitutions of symbols between species
and their originator in the evolutionary tree (Also called
phylogenetic tree).
AAA
AAA
1
AAG
2
GGA
AAA
AAA
1
AGA
Total #substitutions = 4
73
DNA Microarrays (Chips)
74
A Modern Use of
WC Complimentarity
A
C
binds to
binds to
T
G
AATGCTTAGTC
TTACGAATCAG
AATGCGTAGTC
TTACGAATCAG
Perfect match
One base mismatch
75
Array Based Hybridization Assays
(DNA Chips)
Unknown sequence (target)
Many copies.
Array of probes
76
Array Based Hyb
Assays
• Target hybs to WC complimentary probes only
• Therefore – the fluorescence pattern is indicative of the
target sequence.
77
Microarrays (“DNA Chips”)
Leading edge, future technologies (since 1988):
In a single experiment, measure expression level of
thousands of genes.
•
Find informative genes that may
have predictive power for
medical diagnosis.
•
Potential for personalized
medicine, e.g. kits for identifying
cancer types and prescribe “personal” treatment.
78
DNA Chips - Structure
• Each chip has n “pixels” on it.
Every pixel contains copies of
a probe from a single gene.
• Do m experiments:
Cells in each experiment
are taken from different conditions:
(different phase of cell cycle, different
patient, different type of tissue etc.).
• Purpose:
Measure mRNA expression
levels (Color coded) of all
n genes in one experiment.
79
Gene Expression
Matrix
•
Rows correspond to genes.
(Typically n between 500 and 15,000).
•
Columns correspond to experiments.
(Typically m between 10 and 200).
•
Entryi, j = expression level
of gene i, in experiment j.
80
Algorithmic Challange
Analyse the vast amount of data in gene expression
matrices.
Discover meaningful biological structures and
functions.
And now, time for a break
81
Are We Close to Being Done ?
“Now this is not the end.
It is not even
the beginning of the end.
But it is perhaps,
the end of the beginning”.
Winston Churchill, 1942 (3 years into WW2)
http://www.globecartoon.com/neweconomy/10.html
82