Transcript Lecture7

Dynamic Programming
(cont’d)
CS 466
Saurabh Sinha
Affine Gap Penalties
• In nature, a series of k indels often come as a single
event rather than a series of k single nucleotide
events:
ATA__GGC
ATGATCGC
This is more
likely.
ATA_G_GC
ATGATCGC
Normal scoring would
give the same score This is less
for both alignments
likely.
Accounting for Gaps
• Gaps- contiguous sequence of spaces in one of the rows
• Score for a gap of length x is:
-(ρ + σx)
where ρ >0 is the penalty for introducing a gap:
gap opening penalty
ρ will be large relative to σ:
gap extension penalty
because you do not want to add too much of a penalty for
extending the gap.
Affine gap penalty in DP
• When computing si,j, need to look at si,j-1,
si,j-2, si,j-3,…. and si-1,j, si-2,j, …
• Each cell needs O(n) time for update
• O(n2) cells
• Therefore, O(n3) algorithm
• We can still do this in O(n2) time
Affine Gap Penalty
Recurrences
si,j =
max
s i-1,j - σ
s i-1,j –(ρ+σ)
Continue Gap in w (deletion)
Start Gap in w (deletion): from middle
si,j =
max
s i,j-1 - σ
s i,j-1 –(ρ+σ)
Continue Gap in v (insertion)
si,j =
max
si-1,j-1 + δ (vi, wj) Match or Mismatch
End deletion: from top
s i,j
End insertion: from bottom
s i,j
Start Gap in v (insertion):from middle
Reading assignment
Section 6.10 (J & P)
Multiple Alignment
Gene Prediction
Gene Prediction: Computational Challenge
• Gene: A sequence of nucleotides coding
for protein
• Gene Prediction Problem: Determine the
beginning and end positions of genes in a
genome
SOURCE:
http://www.bioscience.org/atlases/genecode/genecode.htm
Codons
• In 1961 Sydney Brenner and Francis Crick
discovered frameshift mutations
• Systematically deleted nucleotides from
DNA
– Single and double deletions dramatically
altered protein product
– Effects of triple deletions were minor
– Conclusion: every triplet of nucleotides,
each codon, codes for exactly one
amino acid in a protein
Great Discovery Provoking Wrong Assumption
• In 1964, Charles Yanofsky and Sydney Brenner
proved collinearity in the order of codons with
respect to amino acids in proteins
• As a result, it was incorrectly assumed that the
triplets encoding for amino acid sequences form
contiguous strips of information.
Exons and Introns
• In eukaryotes, the gene is a combination of
coding segments (exons) that are interrupted by
non-coding segments (introns)
• This makes computational gene prediction in
eukaryotes even more difficult
• Prokaryotes don’t have introns - Genes in
prokaryotes are continuous
Splicing
exon1
intron1
exon2
intron2
exon3
transcription
splicing
exon = coding
intron = non-coding
translation
Batzoglou
Gene prediction
• More difficult in eukaryotes than in
prokaryotes (due to introns).
• In human genome, ~3% of DNA
sequence is genes
• Lot of “junk” DNA between genes, and
even inside genes (between exons).
• Gene prediction must deal with this.
Gene prediction: broadly
speaking
• Statistical approaches:
look for features than appear frequently
in genes and infrequently elsewhere
• Similarity based approaches:
a newly sequenced gene may be similar
to a known gene.
– even this is not so simple. The exon
structures may be different between
otherwise similar genes
Statistical approaches
Splicing Signals
Exons are interspersed with introns and
typically flanked by GT and AG
Splice site detection
Donor site
5’
3’
Position
%
A
C
G
T
-8 … -2 -1
26
26
25
23
…
…
…
…
0
1
2
… 17
60 9 0 1 54 … 21
15 5 0 1 2 … 27
12 78 99 0 41 … 27
13 8 1 98 3 … 25
From lectures by Serafim Batzoglou (Stanford)
Consensus splice sites
Donor: 7.9 bits
Acceptor: 9.4 bits
Splicing and gene prediction
• Using splice sites (profiles) to predict
genes ?
• Limited scope, too many false
predictions
Open Reading Frames (ORFs)
• Let us consider gene prediction in prokaryotes (no introns)
• Detect potential coding regions by looking at ORFs
– A region of length n is comprised of (n/3) codons
– Stop codons break genome into segments between
consecutive Stop codons
– The subsegments of these that start from the Start codon
(ATG) are ORFs
ATG
TGA
Genomic Sequence
Open reading frame
ORFs
• 6 reading frames in any given sequence
– 6 ways to map the DNA sequence to codon
sequence (+1,+2,+3,-1,-2,-3)
– 3 on either strand
• Look at all 6 reading frames for ORFs
Long vs.Short ORFs
• Long open reading frames may be a gene
– At random, we should expect one stop codon
every (64/3) ~= 21 codons
– However, genes are usually much longer than
this
• A basic approach is to scan for ORFs whose length
exceeds certain threshold
– This is naïve because some genes (e.g. some
neural and immune system genes) are relatively
short
Codon usage
• In a given sequence (e.g., an ORF), compute
frequency distribution of codons (64 element
array): codon usage array
• Codon usage array for coding sequences is
different from that for non-coding sequences
• If the codon usage array for an ORF is much
more similar to that of coding sequences than
to that of non-coding sequences, the ORF
could be a gene
Codon usage
• Codons coding for “Arg” in human:
– CGU: 37%, CGC: 38%, CGA: 7%, CGG:
10%, AGA: 5%, AGG: 3%
– In a coding sequence, codon CGC is 12
times more likely than codon AGG
– An ORF preferring CGC over AGG is likely
to be a gene
Codon Usage in Human Genome
Codon usage
• One way to test if an ORF is a gene is to
compute
– Pr(ORF sequence under a coding sequence
model)
– Pr(ORF sequence under a non-coding model)
– Ratio of the two.
• These methods work best in prokaryotes
• The exon-intron trouble is not handled yet
• Hidden Markov models that use codon usage
ideas and splice site ideas, all in one
– We’ll see more of this in second half of course
Promoter Structure in Prokaryotes
(E.Coli)
Transcription starts
at offset 0.
• Pribnow Box (-10)
• Gilbert Box (-30)
• Ribosomal
Binding Site (+10)
Ribosomal Binding Site
Statistical approaches:
summary
• Splicing sites
• Codon usage
• Promoter motifs, such as -10 element,
-30 element
• Ribosome binding site
Similarity based approaches
• Some genomes may be very well-studied,
with many genes having been
experimentally verified.
• Closely-related organisms may have
similar genes
• Unknown genes in one species may be
compared to genes in some closelyrelated species
The basic approach
• Given a protein sequence, and a genomic
sequence, find a set of substrings of the
genomic sequence whose concatenation best
fits the protein sequence
• First cut: Find fragments in the genomic
sequence that match portions of the protein
sequence (local alignment)
• Then find the “optimal” subset of nonoverlapping fragments
Exon chaining
• Each of the fragments of the genomic
sequence that somewhat match the protein
(locally) is a putative exon
• The “goodness” of the match is the “weight”
assigned to this putative exon
• Thus, we have a set of weighted intervals
(l,r,w): for a fragment from l to r, with weight w
representing how well it matches (a portion
of) the protein
Exon Chaining Problem
• Input: A set of weighted intervals (l,r,w)
• Output: A maximum weight chain of
non-overlapping intervals from this set
Exon Chaining Problem: Graph Representation
edge from every li to ri
edge between every two successive vertices
21
• This problem can be solved with dynamic
programming in O(n) time.
Assumptions
• No two intervals have a common
boundary point. So the (li,ri) define 2n
distinct points, if there are n intervals
Exon Chaining Algorithm
ExonChaining (G, n) //Graph, number of intervals
for i ← to 2n
si ← 0
for i ← 1 to 2n
if vertex vi in G corresponds to right end of the
interval I
j ← index of vertex for left end of the interval I
w ← weight of the interval I
si ← max {sj + w, si-1}
else
si ← si-1
return s2n
Not very helpful
• A chain is a set of non-overlapping
exons in order (left to right)
• But the matching protein portions may
not be in the same order !
Spliced Alignment
• Begins by selecting either all putative exons
between potential acceptor and donor sites or by
finding all substrings similar to the target protein
(as in the Exon Chaining Problem).
• This set is further filtered in a such a way that
attempt to retain all true exons, with some false
ones.
• Then find the chain of exons such that the
sequence similarity to the target protein
sequence is maximized
Spliced Alignment Problem: Formulation
• Input: Genomic sequences G, target
sequence T, and a set of candidate
exons (blocks) B.
• Output: A chain of exons Γ such that
the global alignment score between Γ*
and T is maximized
Γ* - concatenation of all exons from chain Γ
The DAG
•
•
•
•
Vertices: One vertex for each block in B
Directed edge connecting non-overlapping blocks
Label of vertex = string of block it represents
A path through the DAG spells out the string
obtained by concatenating that particular chain of
blocks
• Weight of a path is the score of the optimal
alignment between the string it spells out and the
target sequence
Dynamic programming
• Genomic sequence G = g1g2…gn
• Target sequence T = t1t2…tm
• As usual, we want to find the optimal
alignment score of the i-prefix of G and
the j-prefix of T
• Problem is, there are many i-prefixes
possible (since multiple blocks may
include position i)
Idea
• Find the optimal alignment score of the
i-prefix of G and the j-prefix of T
assuming that this alignment uses a
particular block B at position i
• S(i, j, B)
• For every block B that includes i
Recurrence
If i is not the starting vertex of block B:
• S(i, j, B) =
max { S(i – 1, j, B) – indel penalty
S(i, j – 1, B) – indel penalty
S(i – 1, j – 1, B) + δ(gi, tj) }
If i is the starting vertex of block B:
• S(i, j, B) =
max { S(i, j – 1, B) – indel penalty
maxall blocks B’ preceding block B S(end(B’), j, B’) – indel penalty
maxall blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(gi, tj)
}