Transcript Alignment

Chap. 3: Sequence Alignment





Pairwise sequence alignment is the most
fundamental operation of bioinformatics
It is used to decide if two proteins (or genes) are related
structurally or functionally
It is used to identify domains or motifs that are shared
among proteins
It is the basis of BLAST searching (next)
It is used in the analysis of genomes

Example
 Mitochondrial cytochrome b – transport electrons
 From NCBI (National Center for Biotechnology
Information) protein web page
https://www.ncbi.nlm.nih.gov/,
search for cytb and




Loxodonta africana (African elephant)
Elephas maximus (Indian elephant)
Mammuthus primigenius (Siberian wooly Mammoth)
Which modern elephant is closer to a mammoth ?
 Use clustal Omega to do the alignment
 http://www.clustal.org/
>gi|56578537|gb|AAW01445.1| cytochrome b [Loxodonta africana]
MTHIRKSYPLLKIINKSFIDLPTPSNISAWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC
RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF
WGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFALHFILPFTMTALAGVHLTFLHETGSNNPLG
LTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI
LRSVPNKLGGVLALFLSILILGLMPLLHTSKYRSMMLRPLSQVLFWTLTMDLLMLTWIGSQPVEYPYTII
GQMASILYFSIILAFLPIAGMIENYLIK
>gi|81230414|ref|YP_398766.1| cytochrome b [Mammuthus primigenius]
MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC
RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF
WGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFALHFILPFTMIALAGVHLTFLHETGSNNPLG
LTSDSDKIPFHPYYTIKDFLGLLILILFLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI
LRSVPNKLGGVLALLLSILILGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIII
GQMASILYFSIILAFLPIAGMIENYLIK
>gi|2924631|dbj|BAA25017.1| cytochrome b (mitochondrion) [Elephas maximus]
MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC
RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF
WGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLG
LTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI
LRSVPNKLGGVLALFLSILILGLMPFLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIII
GQMASILYFSIILAFLPIAGMIENYLIK
>AKC90946.1 cytochrome b (mitochondrion) [Homo sapiens]
MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHIT
RDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILLLATMATAFMGYVLPWGQMSF
WGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALAALHLLFLHETGSNNPLG
ITSHSDKITFHPYYTIKDALGLLLFLLSLMTLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTI
LRSVPNKLGGVLALLLSILILAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTII
GQVASVLYFTTILILMPTISLIENKMLKWA
Similarity and Homology

Similarity



Observation or measurement of resemblance, independent of the
source of the resemblance
Can be observed now but involves no historical hypothesis
Homology



Specifies that sequences and the organisms descended from a
common ancestor
Implies that similarities are shared ancestral characteristics
Cannot make the assertion of homology from historical evidence,
and thus is an inference from observations of similarity
Homology


Similarity attributed to descent from a common
ancestor
Two types of homology

Orthologs


Homologous sequences in different species that arose from a
common ancestral gene during speciation; may or may not
be responsible for a similar function.
Paralogs

Homologous sequences within a single species that arose by
gene duplication.
Orthologs:
members of a
gene (protein)
family in various
organisms.
This tree shows
globin orthologs.
Paralogs: members of a gene
(protein) family within a
species. This tree shows
human globin paralogs.
Orthologs and paralogs are often viewed in a single tree
Globin phylogeny by Dayhoff (1972)
Globin phylogeny by Dayhoff in evolutionary time (1972)
Globin

Globins carry oxygens and are first proteins to be sequenced
 Hemoglobins – in red blood cell
 Myoglobin – in muscle cells of mammals
 Leghemoglobin – in legumes (beans, etc.)
Globin
(a)
(b)
(c)
(d)
Myoglobin
Tetrameric hemoglobin
Beta globin subunit
Myoglobin & beta globin
Globin family

Myoglobin (NP_005359) & Beta globin (NP_000509)
>gi|4885477|ref|NP_005359.1| myoglobin [Homo sapiens]
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVL
TALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFR
KDMASNYKELGFQG
>gi|4504349|ref|NP_000509.1| hemoglobin subunit beta [Homo sapiens]
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH

20% similarity, but structurally similar and homologous

More challenging to identify homology
Direct Alignment

Given two sequences


+1 if letters in the same positions match
-1, otherwise
RNDKPFSTARN
RNQKPKWWTA
+ + - + +- - - - - -

Extremely simple, but what if there is a gap?

Gap when a base is inserted or deleted (indel)

Maybe more significant mutation – give more negative score as a
penalty
Visual Alignment -- Dotplot

A seq. in x axis and the
other in y axis
Dot on a crosspoint if
identical in both sequences
view

Special Dotplot
Periodic
Palindrome
Sequence Alignment

Direct alignment

An alignment with gaps
gctga-a--cg
- ct-ataatc-

gctgaacg
ctat aatc
gct- -gaa -cg
- ctat-aa tc -
What is the criteria for a good alignment ?


Use score to check for optimality
May not produce a unique optimal alignment
Calculation of an alignment score
General approach to pairwise alignment






Given two sequences
Select an algorithm that generates a score
Allow gaps (insertions, deletions)
Score reflects degree of similarity
Alignments can be global or local
Estimate probability that the alignment occurred by
chance
Pairwise alignment: protein sequences
can be more informative than DNA





protein is more informative (20 vs 4 characters); many
amino acids share related biophysical properties
codons are degenerate: changes in the third position
often do not alter the amino acid that is specified
protein sequences offer a longer “look-back” time
DNA sequences can be translated into protein, and then
used in pairwise alignments
Many times, DNA alignments are appropriate when

to confirm the identity of a cDNA
 to study noncoding regions of DNA
 to study DNA polymorphisms
 example: Neanderthal vs modern human DNA
Scoring Matrix

Dotplot



A numerical method



useful in identifying biological significance and interesting regions
Do not provide a measure of statistical similarity
provide position-by-position overlap
Also provide the nature and characteristics of residues being
aligned
Scoring matrices

Empirical weighting schemes
Scoring Matrix

Three biological factors in constructing a scoring matrix




Conservation
 Account for conservation between proteins, but provide a way
to assess conservation substitutions
 Score represents what residues are capable of substitution for
other residues while not adversely affecting the function of the
native protein (determined by charge, size, hydrophobicity, etc.)
Frequency
 Reflect how often residues occur among proteins
 Rare residues are given more weight
Evolution
 By design, implicitly represent evolutionary patterns
Review

http://books.google.com/books?hl=en&lr=&id=9p3E2sS1aJUC&oi=fnd&pg=PA73&ots=eJ0lzjEg_b&sig=
Fl2kBl5QBq7VIoy-eDgDqXhaZ14#v=onepage&q&f=false
Scoring Matrix

Log-Odds Score

qij : prob. of how often i and j are seen aligned
pi: prob. of observing AA i among all proteins

sij = log(qij/ pipj)


score



Represent the ratio of observed versus random frequency of
replacement of i by j
Positive score – two residues are replaced more often than by
chance
Negative – less likely to substitute than by chance
Scoring Matrix

Nucleotides

AAs

More complicated in 20x20
Other Scores

Gap penalty

Gap initiation and extension
aaagaaa
aaa–aaa

aaagggaaa
aaa- - - aaa
ClustalW recommends use of identity matrix
 For DNA sequences


1 for a match, 0 for a mismatch, gap penalty of 10 for initiation
and 0.1 for extension per residue
For AA sequences

BLOSUM62 matrix for substitution, gap penalty of 11 for initiation
and 1 for extension per residue
Pairwise Alignment: Global and Local

Given a scoring scheme, find alignments maximizing the
score


Global
 Entire sequence of protein or DNA sequence
 Needleman and Wunsch (dynamic programming)
Local
 Focus on regions of greatest similarity
 Smith and Waterman
 In general, preferable to Global Alignment
 Because only portions of proteins align
Global and
Local in Dotplot
Dynamic Programming


Guaranteed to yield an optimal global alignment
Drawback – many alignments may give the same optimal score and
none of them may correspond to biologically correct alignment


W.Fitch and T.Smith found 17 alignments of alpha- and beta-chains of
chicken haemoglobin, one of which is correct based on structures
Drawback – complexity O(nm) for sequences of length n and m
Dynamic Programming

Rock removal game






Two piles of rocks, each with 10 rocks
A and B alternatively remove one rock from a single pile or one rock
each from both piles
Player who remove the last rock(s) wins the game
Use reduction strategy starting with smaller problems
Consider 2+2 problem
 A removes one rock each, B removes one rock each
 A removes one rock, B takes one rock from the same pile
 B wins
3+3 problem ?
Rock Removal with 10+10




↑ A takes one from pile X
← A takes one from pile Y
A takes one from each pile
* A will lose
Manhattan Tourist Problem

Visit as many tourist sites in
a Manhattan grid
 Move to the east or
south only
 Start at upper left
corner
 End at # 15, lower right
corner
Problem Statement


Given a weighted grid G with two vertices
(nodes) for a source and a sink
Find the longest path in a weighted grid


Weight: # of attraction sites on an
edge (link)
Each vertex (node) can be identified
by (i,j)
 Source at (0,0)
 Sink at (n, m)
3
1
2
0
3
4
2
2
6
0
4
4
4
5
7
4
3
4
2
3
5
3
2
0
Solution


Define si,j: the longest path from
source to vertex (i,j) (0 ≤ i < n, 0 ≤ j <
m)
Solve for smaller problems first

Solving for s0,j and si,0 is easy
(0,0)
0
3
1
1 3
4
5 0
4
9 3
2
3
0
5
4
2
2
6
4
4
5
7
4
2
3
5
3
9
2
0
Solution (2)
(0,0)

Iteratively solve for neighboring nodes 0 3
 si,1
1
 si,2, etc.
(1,0) 3
(0,1)
3 2 5 4 9
0
2
4
4
1
4 2
4
6
5
2
(2,0) 0
7
3
5
10
4
4
5
2
(3,0) 3
0
9
14 3

si,j = max[si-1,j + weight on edge between (i-1,j) and (i,j),
si,j-1 + weight on edge between (i,j-1) and (i,j)]
Algorithm

Algorithm
 Given Weast(i,j) and Wsouth(i,j),
s0,0 = 0
for i =1 to n
si,0 = si-1,0 + Wsouth(i,0)
for j =1 to n
s0,j = s0,j-1 + Weast(0,j)
for i =1 to n
for j = 1 to m
si,j = max[si-1,j + Wsouth(i,j),
si,j-1 + Weast(i,j)]
return sn,m
General Graph Problem

Not regular with two inputs (indegree) and two outputs (outdegree) at a
node
Directed Acyclic Graph


DAG: Directed Acyclic Graph
 G = (V, E)
Longest Path Problem
 sv = max(su + weight from u to v) over all u which are
Predecessor(v)
 Predecessor relationship has to be established ahead of the time
u1
5
u2
u3
7
v
3
5
Graph Problem applied to Alignment

Measure of similarity
 Hamming distance: equal-length sequences
 Levenshtein or edit distance, 1966
 unequal-length sequence
 Min. # of ‘edit operations’ (insertion, deletion,
alteration of a single character in either sequence)
required to change one string into the other
 e.g.
ag–tcc
cgctca

Levenshtein distance = 3
Edit Distance and Alignment

Two strings, v and w
 Gaps are allowed in string, except that two gaps are not allowed at
the same char positions
A T - G T T A T A T CG T - A - G


Each char in a string is represented by positions in the original
string without gaps
 v: (1 2 2 3 4 5 6 7 7)
 w: (1 2 3 4 5 5 6 6 7)
For both strings,
 (00) (11) (22) (23) (34) (45) (55) (66) (76) (77)
 Represents a path in a grid
Edit Distance



Vertex (i,j) corresponds to (ij) for
(vi, wj)
G = (V, E)
Longest Path Problem
 sv = max(su + weight from u
to v) over all u,
Predecessor(v)
 Predecessor relationship
has to be established ahead
of the time
Global Alignment




A string has a sequence of characters drawn from an alphabet A of size
k
Scoring matrix, δ, of (k+1)x(k+1)
Problem Statement
 Given two strings, v and w, and a scoring matrix δ,
 Find the longest (max. score) path
Dynamic programming kernel
 Recurrence relationship
si, j
si-1, j + δ(vi, -)
= max [ si, j-1 + δ(-, wj)
si-1, j-1 + δ(vi, wj)
]
Global Alignment

Example of scoring matrix
 Match: +1; mismatch: -μ; indels: -σ
si-1, j

si-1, j
= max [ si, j-1
si-1, j-1
si-1, j-1
-σ
-σ
]
+ 1, if vi=wj
- μ, otherwise
Indels are frequent, and gap penalties proportional to indel sizes are
considered to be severe
 Affine gap penalties soften the penalty rate
 Can be linear, -(a + bx) for the indel length of x
Needleman-Wunsch, 1970

Setting up a matrix

Setting up a matrix

Scoring the matrix

Identifying the optimal alignment
Local Alignment


Global sequence alignment is useful for alignment of
sequences from the same protein family, for example
Substrings from two sequences may be highly conserved in
biological applications


Temple Smith and Michael Waterman, 1981
Biologically irrelevant diagonal matches are likely to have a higher
score
Local Alignment Problem


Given two strings v and w, and a scoring matrix δ
Find substrings of v and w whose global alignment is
maximal among all substrings of v and w


Seemingly harder, because the global alignment is to find the
longest path from (0,0) to (n,m), whereas the local alignment is to
find the longest path among all paths between two arbitrary points,
(i,j) to (i’, j’)
Add edges of weight 0 from (0,0) to
every other vertex (vertex (0,0) is a
predecessor of every vertex
Local Alignment Solution

Recurrence kernel becomes
si, j


si-1, j + δ(vi, -)
= max [ si, j-1 + δ(-, wj) ]
si-1, j-1 + δ(vi, wj)
Select the largest si, j
Other non-maximal local alignments may have biological
significance
 Select k best nonoverlapping local alignments