Transcript Slide 1

Bios 560R, Lecture 2
Genome sequencing (very brief)
Overview of HGP
Shotgun sequencing
Next generation sequencing
Sequence Alignment by Dynamic
Programming
Overview
The difficulty
The scoring system
Dynamic programming for pairwise alignment
Human genome project (HGP) – some milestones
Science 9 February 2001: Vol. 291. no. 5507, p. 1195
1977.
Allan Maxam and Walter Gilbert at Harvard University and Frederick
Sanger at the U.K. Medical Research Council (MRC) independently develop
methods for sequencing DNA.
1988.
NIH establishes the Office of Human Genome Research.
1995.
Patrick Brown of Stanford and colleagues publish first paper using a
printed glass microarray of complementary DNA (cDNA) probes
Researchers at Whitehead and Généthon (led by Lander and Thomas
Hudson at Whitehead) publish a physical map of the human genome containing
15,000 markers.
1996.
NIH funds six groups to attempt large-scale sequencing of the human
genome.
Affymetrix makes DNA chips commercially available.
An international consortium publicly releases the complete genome
sequence of the yeast
1998.
a new company named Celera.
NIH announces a new project to find SNPs
2001.
The HGP consortium publishes its working draft in Nature (15 February),
Celera publishes its draft in Science (16 February).
2006.
Sequence of all chromosomes finalized.
Genome sequencing strategies
Shotgun sequencing.
Shotgun: fires a number of small spherical pellets.
http://www.shotguninsight.com/expressFibreVsPlastic/proFibre5_25yds80percent.jpg
The genome DNA (~ 3 billion base pairs, in 22 autosomes and 2 sex
chromosomes) is too long for sequencing directly.
The DNA is randomly broken is to small pieces.
The pieces are sequenced.
The short sequences are assembled into long sequences based on the
overlapping of fragments.
Genome
Reads
Contigs
(islands)
Genome sequencing strategies
The Lander-Waterman model. Genomics. 1988 Apr;2(3):231-9.
Important result about coverage (direct quotes from the paper):
Genome sequencing strategies
Proc Natl Acad Sci U S A. 2002 March 19; 99(6): 3712–3716.
HGP
“additional
preliminary work”
Celera
“a greater risk of longrange misassembly”
Whole genome using next generation sequencing ?
Next generation
sequencing generates
much shorter reads at
higher throughput.
nature biotechnology 26(10):1135.
Sanger method
Next Generation
Whole genome using next generation sequencing ?
Application of next-generation sequencing (NGS) in various ways:
Nature Reviews Drug Discovery 12, 358–369 (2013)
Between protein and gene sequences
- The gene sequence (DNA/RNA) are made of 4 characters
(nucleotides)
- The protein sequence
is made of 20 characters
(amino acids)
- The DNA codons:
- When discussing
alignment, we consider
both DNA sequence
and protein sequence.
Sequence alignment
• Alignment answers the question of whether two or more
sequences are (evolutionarily) related.
• Example: A new gene is found in human. We wish to study its
properties. To get a hint, we try to find its corresponding part in
mouse. Among the tens of thousands of genes in mouse, which is
the one that’s most related to this human gene? (BLAST.)
• Sequence segments more important to biological functions are
more conserved.
Align proteins sharing a function to find peptide sequences more
important to the function.
Align DNA sequences upstream to (functionally or evolutionarily)
related genes to find segments that bind transcription factors.
Sequence alignment
Kapitonov VV, Jurka J PLoS Biology Vol. 3, No. 6, e181
Sequence alignment
http://pevsnerlab.kennedykrieger.org/wiley/bestfit2.gif
Sequence alignment
• Alignment serves as the basis for determining
evolutionary relationships.
• Example:
http://www.computational-genomics.net/case_studies/sabertooth_demo_37.png
Sequence alignment
In the design and/or data interpretation of
sequence-based high-throughput technologies,
sequence comparison is necessary:
-To find microarray probes that do not have similar
sequence with other genes.
-To match sequences in high-throughput
sequencing data to genome.
-To find motifs based on ChIP-chip/ChIP-seq data.
……
Difficulty in alignment
The huge search space (the number
of possible alignments) !
Sequence 1: GSAQVK
Sequence 2: GNPKVK
GSAQVK----GSAQVK--------GNPKVK
----GNPKVK
GSAQ--VK
-G-NPKVK
-----GSAQVK
GNPKVK-----
………………… How to pick the best??
The scoring system
• A quasi-statistical model
log-likelihood ratio:
  Px i yi 
 Px i yi
 P ( x, y | M ) 
   log 
  log 
log 
 P ( x, y | R ) 
  Px i  Py i 
 Px i Py i
x: sequence segment 1,
y: sequence segment 2
i: aligned position
Pxiyi: Prob(xi and yi are in aligned position)
Pxi: Prob(xi occurs)
Pyi: prob(yi occurs)
M: this aligned model
R: x, y are independent sequences



The scoring system
Note: this kind of “model” is for
illustration purposes only!
There are many possible models:
M1, M2, ……, Mn
According to the Bayes Rule,
Px, y | M i P( M i )
PM i | x, y  
 Px, y | M i 
 P x, y | M j P(M j )


This is MLE also since we are using an uninformative prior.
The scoring system
• Given the above, the best
alignment model is the one that
maximizes
 Px i yi
 log Px Py
 i i



Notice that there are a limited number of possible xi, yi
pairs. For amino acids, there are 20 choose 2 pairs, plus
any amino acid against Gap (all equal values : the “gap
penalty”).
How do we get these values ????
The scoring system
• These values are obtained from
some high-confidence alignments:
GAPVKFC
 Pxi yi 


log
GAPKKFC
 Px Py 
 i i
The log likelihood ratio of the pair occurring as an aligned
pair as opposed to an unaligned pair.
It can be obtained in multiple ways. Generally, such a
matrix describes the rate that one character (amino acid
or nucleotide) is replaced by another one during
evolution.
The scoring system
PAM (Point Accepted Mutation) matrix is calculated
from closely related sequences. Finding aligned
positions is simple as there are only a few positions
where the two sequences are different.
The number following “PAM” denotes evolutionary
distance.
PAM1 matrix considers only direct mutations.
Taking this matrix to N-th power with proper
normalization yields PAM-N matrix.
The scoring system
http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformatics_mscIT/slides/slide
s3/img033.gif
The scoring system
BLOSUM (BLOcks of Amino Acid SUbstitution Matrix).
It comes from blocks of aligned sequences with less
similarity than those used by PAM.
The log-odds ratio are calculated directly from the observed
alignment. There is no extrapolation as PAM.
http://en.wikipedia.org/wiki/Image:BLOSUM62.gif
Dynamic programming
• Now, for every pair of amino acids, we
have a score that can be considered
the log likelihood ratio. So what? Do
we still need to perform endless
computations to compare all models?
• The answer is of course NO.
• Different models share sub-models. We
should not re-calculate the score
GSAQVK
GNPKVK
G-SA-QVK
GN-PK-VK
GSA-QVK
G-NPKVK
• For non-optimal models, we do not need
to know their exact score.
Dynamic programming
“Two sledgehammers of the algorithms craft, dynamic
programming and linear programming”
Dynamic programming:
Breaking the overall optimization problem into overlapping
smaller problems;
Solve each sub-problem once, and reuse the results, thus
reducing the computing cost (dramatically);
Often working backward.
Dynamic programming
A simple example:
Find the shortest path from S to E in the directed acyclic
graph below.
Take node D as an example. The way to get to D is
through B or C. So,
Algorithms, Dasgupta et al.
Dynamic programming
Linearization:
Algorithm:
Algorithms, Dasgupta et al.
Dynamic programming
Compare:
(1) Exhaustive approach
SABE: 1+6+2
SCABE: 2+4+6+2
SCDE: 2+3+1
SCABDE: 2+4+6+1+1
11 additions.
Complexity grows exponentially with
the size of graph
(2) DP approach
Dist(A)=min(1, 2+4)=1
Dist(C)=2
Dist(B)=min(dist(A)+6)=1+6=7
Dist(D)=min(dist(B)+1, dist(C)+3)=min(7+1, 2+3)=5
Dist(E)=min(dist(B)+2, dist(D)+1)=min(7+2, 5+1)=6
6 additions.
Complexity grows linearly with the
size of graph
Dynamic programming
Another example: A game of picking up matches.
There are 30 matches on the table. You start by picking up
1~3 matches, then your opponent picks up 1~3 matches. This
goes on until the last match is picked up. The person picking
up the last is the loser.
The first step: this is what you want to leave to your opponent
29x
1, 5, 9, 13, 17, 21, 25, 29
Two steps back: this is what you want to leave to your opponent
One step back: this is what you want to leave to your opponent
Last step: you want to leave the last match to your opponent
Operations Research: mathematical programming, Winston WL
Dynamic programming
Two critical components of dynamic
programming sequence alignment:
(1)The assumption that the score of an
aligned pair of amino acids is
independent of other parts of the
alignment.
(2)Due to the additive nature of the
likelihood, any sub-alignment of
the optimal overall alignment must
also be optimal. Thus the search
for the global optimum becomes a
sequential search for local optima.
Dynamic programming
Any sub-model of the best model must be
optimal.
Example. if this is the best model,
GS-AQVK
G-NPKVK
Then any part of it, for example the middle
part
-AQ
NPK
Must be better than any other possible submodel. Otherwise, say if it is not as good as
A-Q
NPK
Then,
GSA-QVK
G-NPKVK
Is better than the “best” model.
Dynamic programming
Dynamic programming is used when the overall solution
can be achieved by recursively solving sub-problems.
It saves computing time by re-using results of submodels.
GSAQVK
GNPKVK
G-SA-QVK
GN-PK-VK
GSA-QVK
G-NPKVK
In contrast, the “brute-force” method that simply lists all
possible alignments and re-calculate the scores for every
model solves the same sub-problem over and over again.
Needleman-Wunsch algorithm for global alignment
Create a matrix of dimension (m+1)x(n+1), where m and n
are the lengths of the two sequences.
Let M(0,0)=0
Iteratively calculate other values in the matrix, while
recording which way (align, gap-x, gap-y) led to the
maximum for each cell.
Mi,j = MAXIMUM[
Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)
]
A
W
G
H
E
P
A
W
H
E
A
Gap penalty: -8
E
Needleman-Wunsch algorithm
http://www.matfys.kvl.dk/databehandling/f2004/eksempler/word/needlemanwunsch.png
Smith-Waterman algorithm
The previous algorithm looks for the best OVERALL
alignment between two sequences. A more common
task is to find best SUB-SEQUENCE alignments
between two sequences.
Mi,j = MAXIMUM[
0,
Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)
]
Smith-Waterman algorithm
http://www.bi.a.u-tokyo.ac.jp/~shimizu/bioinfo/s-w.gif
Comments
In biological sequences, a gap is often longer than 1. For
example, the insertion of a new functioning domain. A long
gap may be penalized too much by the previous methods.
- assign two different penalties:
gap initiation penalty.
gap extension penalty.
Now at every position, 3 values, instead of 1, are kept.
Significance of an alignment?
- Bayesian approach model comparison.
- significance derived from extreme value distribution.
Difficulty in higher-dimensions
When we have two sequences to align, we can use dynamic
programming with o(n^2) calculations.
Can we use the same idea when we have more sequences?
It is hard. For example, when we have 3 sequences, instead
of operating in a square, we will be operating in a cube.
The number of calculations explodes with more sequences.
http://www.bioinf2.leed
s.ac.uk/b/MA_sect.gif