Sequence Alignment

Download Report

Transcript Sequence Alignment

Some Independent Study on
Sequence Alignment
— Lan Lin
prepared for theory group meeting on
July 16, 2003
Biological Background (1)

Genetic information is stored in DNA
– used to make identical copies
– transferred from DNA to RNA to protein

DNA is a linear polymer of 4 nucleotides
– (A, T, G, C)

RNA is a similar polymer
– (A, U, G, C)

Both can pair one with another — “double helix”
– pairing being sequence specific (G-C, A-T/U)
– templating resulting in DNA replication and RNA copy of a DNA
sequence
Biological Background (2)

Proteins are variable length linear, mixed polymers of 20
different amino acids
– peptides and polypeptides for amino acid polymers
– functional property largely determined by the amino acid sequence

RNA
protein by translation of a code consisting of 3
nucleotides into 1 amino acid
– one amino acid encoded by 1 ~ 6 different triplet codes
– 3 stop codons specifying “end of peptide sequence”
– 3 reading frames for a DNA sequence, 6 for one with its (inferred)
complementary strand
Sequence Analysis (1)

Some difficulties
– Where the code for a protein starts and stops?
– DNA frequently scattered in separate “exons”, not continuous
– RNAs up- and down-stream of the coding region, non-coding regions can
be quite large; not all RNAs encode proteins

Inferring structure and function from a protein sequence is even harder!
– 3 levels of protein structure
 primary structure — sequence of amino acids in the protein
 secondary structure — polypeptide chains folding into regular structures (i.e.,
alpha helix or beta sheet)
 tertiary structure — 3D structure of protein determining biological function
– homology-based approach used to determine the tertiary structure by
primary sequence analysis of related proteins
Sequence Analysis (2)

What can be done?
– Identification of protein primary sequence from DNA
sequence
– searching for DBs for similar sequences

DNA sequences: DDBJ, EMBL, GenBank
– for rapid search for a query sequence: BLAST and FASTA

protein sequences: SwissProt, PIR
– calculation of sequence alignments for evolutionary
inferences and to aid in structural and functional
analysis
Pairwise Sequence Alignment

Two quantitative measures
– similarity (the larger the better)
– distance (the smaller the better)

Edit operations by introducing a gap character “-”
– match, replacement, insertion/deletion (“indel”)

The unit cost model
– The cost of an alignment of two sequences s and t is the sum of the
cost of all the edit operations that lead from s to t.
– An optimal alignment is one with the minimum cost.
– The edit distance of s and t is the cost of an optimal alignment of s
and t under a cost function w denoted by dw(s, t).
Pairwise Alignment via
Dynamic Programming (1)


Recursion step
dw(0:s:i, 0:t:j ) = min {dw(0:s:(i-1), 0:t:(j-1) ) + w(si, tj),
dw(0:s:(i-1), 0:t:j ) + w(si,- ),
dw(0:s:i, 0:t:(j-1) ) + w(-, tj)}
for i, j  1
Base
dw(0:s:0, 0:t:0 ) = 0
dw(0:s:i, 0:t:0 ) = dw(0:s:(i-1), 0:t:0 ) + w(si,- ) for i = 1, …, m
dw(0:s:0, 0:t:j ) = dw(0:s:0, 0:t:(j-1) ) + w(-, tj) for j = 1, …, n
Pairwise Alignment via
Dynamic Programming (2)




The edit distances of all prefixes define an (m+1) (n+1) distance
marix D = (di, j) with di, j = dw(0:s:i, 0:t:j).
Pattern of dependencies between matrix elements
di-1, j-1
di-1, j
di, j-1
di, j
The bottom right corner contains the desired result:
dmn = dw(0:s:m, 0:t:n) = dw(s, t).
A path through the distance matrix indicating how to align
– A diagonal line means replacement/match
– A vertical line means deletion
– A horizontal line means insertion

The most common order of calculation is line by line (each line from
left to right), or column by column (each column from top to bottom).
On Scoring Functions

Different words all attributing a numeric value to a pair of
sequences
– “distance” values are never negative; should be minimized
– “cost” implies positive values, with the overall cost to be
minimized
– “weights” and “scores” can be positive or negative; the optimal
alignments maximize scores
– “similarity” implies large values are good; should be maximized

If relating sequences of different length, length-relative
scores make sense.
Realistic Gap Models

No-gap alignment
– using matches/replacements only in some regions (i.e.,
sites of protein-protein interaction)
– DP algorithm geared to do this by setting costs for indel
to infinity (or something close to it)

Block-indel
– charging a certain set-up cost for introducing the gap,
whereas extending the gap is less expensive
– DP algorithm adapted without much effect on its
efficiency
Variations of Pairwise Alignment (1)

Local alignment (approximate pattern matching)
– where s is relatively short with respect to t and we seek that
subunit of t which s aligns best with:
Given 0:s:m and 0:t:n, find i:t:j such that dw(s, i:t:j ) is minimal among all
choices of 0 i j n.

Local alignment recursion
– no cost for deletion of a prefix 0:t:i,
– no cost for deletion of a suffix j:t:n,
– dmn gives the cost of the optimal local alignment, i:t:j is found by:
j = min {k|dm,k = dm, n}
i is the point where the optimal path leading to dm, j starts from the 1st
row
Variations of Pairwise Alignment (2)

Local similarity
– asking for those subunits of s and t that exhibit most similarity
– using a similarity rather than a distance measure
w(a, b) > 0, if a, b are similar,
w(a, b) < 0, if a, b are not similar
w(a, -) < 0, and w(-, b) < 0, in particular
– score 0 as a cut-off value between subsequences with/without
similarity


long stretches of dissimilarity shown as regions of zeroes in the
matrix
stretches of local similarity rising as islands of positive values
Heuristic Methods

Edit distance calculation complexity
– for input sequences 0:s:m,and 0:t:n, DP calculates m n matrix
entries; time complexity is O(m n)
– to only get the edit distance, only one column (or one row) of the
matrix needs to be stored; space complexity is O(m) or O(n)
– to retrace optimal path, the whole matrix needs to be stored; space
complexity is also O(m n)

Heuristic methods approximate optimal alignment in a
time complexity close to O(m+n)
– trading speed for precision
Multiple Alignment



Helpful for protein structure prediction and evolutionary
history inference
A multiple alignment of k sequences is a rectangular array
of k rows which resemble the corresponding sequences
when ignoring the gap character, with each column
containing at lease one character different from “-”.
Two ways to formulate a cost/weight function
– colomuns-first
– pairs-first

based on SP-cost (“sum-of-pairs”)
An optimal multiple alignment is one with minimum
overall cost, or maximal overall similarity.
MSA by Standard DP and Heuristics

DP matrix
DP hyperlattice
– taking time in O(2ki=1,…,k
 |si|) and space in O(i=1,…,k
 |si|)
– NP-hard with regard to the number of sequences with the SP
measure

Alignment along a phylogenetic tree
– tree generation through all optimal pairwise alignments
– most similar pairs aligned first before aligning alignments
– not necessarily optimal due to error accumulation
– “sequences”
“sum-of-pairs”
“profiles”
“scoring along a tree”
More Interesting Topics

Phylogenetic tree
 Genetic algorithms and protein folding
 RNA secondary structure prediction
 Protein structures
 Finding instances of known/unknown sites
 etc, …
References

Online Lectures on Bioinformatics
http://lectures.molgen.mpg.de/online_lectures.html

Biocomputing Hypertext Coursebook
http://www.techfak.unibielefeld.de/bcd/Curric/welcome.html

Lecture Notes on Biological Sequence
Analysis
http://www.cs.uml.edu/bioinformatics/resources/Lectu
res/tompa00lecture.pdf