Lecture 5. Sequence Analysis

Download Report

Transcript Lecture 5. Sequence Analysis

Sequence Analysis
CSC 487/687 Introduction to computing
for Bioinformatics
Aligning Sequences

Sequences



Representing proteins or nucleic acid (DNA/RNA)
molecules
Order of amino acids (for proteins – nucleotides for
DNA/RNA) along one chain
Sequence alignment


The identification of residue-residue correspondences
Any assignment of correspondences that preserves the
order of residues within the sequences
Evolutionary Basis of Sequence
Alignment



Identity: Quantity that describes how much
two sequences are alike in the strictest terms.
Similarity: Quantity that relates how much
two amino acid sequences are alike.
Homology: a conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
Evolutionary Basis of Sequence
Alignment

Homologous sequences


Related by evolution (common ancestors)
Alignment of homologous sequences


Identifying relationship between the sequence
elements
Match up characters coming from same
characters in ancestor
Alignment and Evolution

Assume we know evolutionary history
relating q and d:

The true alignment can be found using h as
a template:
h : GLVS T
q’: GLISVT
d’: GIV--T
Alignment


Evolution
Given an alignment, several different evolutionary
histories may be (equally) plausible
Example:

Alignment:
q’: GLISVT
d’: G-I-VT

One possible history:
H*:GLIVT
/\
->S /
\ L->
/
\
q:GLISVT
d:GIVT
Global and Local Alignment

Global

Assuming that the complete sequences are
the results of evolution from the same
S2
ancestor sequence
Ancestor

S1
Local

Align segments of the sequences so that the
segments are evolutionarily related
S2
Ancestor
S1
Pairwise sequence alignments Vs Multiple
sequence alignments

Pairwise sequence alignment: two sequence

Multiple sequence alignments: a mutual
alignment of more than two sequences
The dotplot
The dotplot

Captures not only the overall similarity of two
sequences, but also the complete set and
relative quality of different possible
alignments



Diagonal ―
Horizontal ― a gap is introduced in the sequence
indexing the rows
Vertical ― a gap is introduced in the sequence
indexing the columns
Dotplots and alignments




A path through the dotplot is as an edit script;
Each move performs an operation ― a
substitution, an insertion or a deletion.
When the end of the path is reached, the
effect will change one sequence into the
other.
Several different sequences of edit
operations may convert one string to the
other in the same number of steps.
Dotplots and alignments



Although a sequence of edit operations
derived from an optimal alignment may
correspond to an actual evolutionary pathway
Impossible to prove that it does.
The larger the edit distance, the larger the
number of reasonable evolutionary pathways
between two sequences.
Dotplots and alignments

The dotplots between pairs of proteins with
increasingly more distant relationships.

The dotplot comparisons of the sulphydryl
proteinase papain from papaya, with four
homologues ― the close relative, kiwi fruit
actinidin, the more distant relatives, human
procathepsin L, human cathepsin B, and
staphyloccus anueus.
Example
Example
Example
Example
Measures of sequence similarity

Hamming distance ― the number of positions
with mismatching characters.

Edit distance ― the minimum number of “edit
operations” required to change one string into
the other.
What is an Alignment?

A global alignment of two sequences A and B
contains all characters of A and B in the same
order





one symbol from A can be aligned with one symbol from B
a symbol can be aligned with a blank, written as ‘-’
two blanks cannot be aligned
Every symbol from A and from B must be aligned
Example:
A:INVEST, B:INTEREST
IN--VEST
INV--EST
INTEREST
INTEREST
IN-V--EST
IN-TEREST
Computing Alignments


There exist a large number of alignments
for a pair of sequences
In order to use a computer to do the
alignment process in a meaningful way, we
need


Scoring scheme – mathematical way to
calculate goodness of candidate alignments
Search method – algorithm able to identify
high scoring alignments
Choosing Scoring Scheme

Scoring scheme should be

Simple – to allow for



efficient calculation and
search for best alignment
Biologically meaningful (give score to
biologically good alignments)
Simple Scoring Scheme

Assign score to each column in the alignment
Columns are of the following sorts:

Alignment score: sum of score over all columns

R: matrix giving score for all possible character
pairs (e.g., all pairs of amino acid symbols)

Alignment Score – Example


R identity matrix – identical characters
score1, unequal 0,
g=1
ALIGN1:
V - E I
P R E 0 -1 1 -1
ALIGN2:
V E I
P R E
0 0 0
T
T
1
G
E
0
E
R
0
I S
I 1 -1
T G
T 1 -1
E
E
1
I
R
0
S
I
0
T
T
1
T
T
1
Score: 1
Score: 2
Finding the Minimum Scoring Alignment

Large number of possible alignments –
cannot generate all and score them to find
the best

Task – align
A=a1a2...am
B=b1b2...bn
and
Independence Between
Sub-alignments

Observations:


The score of the alignment up to and including
character i from A and character j from B is
independent of how the rest of the sequences
are aligned
The best solution to (i,j) can be “locked”, its
score recorded in Di,j


Dm,n is the score of the best global alignment
Amenable to dynamic Programming
Dynamic programming algorithm

Individual edit operations include:



Substitution of bj for ai ― represented (ai, bj)
Deletion of ai from sequence A― represented
(ai,)
Deletion of bj from sequence B― represented
(,bj)
Dynamic programming algorithm

A cost function d is defined on edit operations



d(ai, bj)=cost of a mutation in an alignment in
which position i of sequence A corresponds to
position j of sequence B
d(ai,) or d( bj) = cost of a deletion or insertion
The minimum weighted distance between
sequences A and B as

D(A,B)=min (d(x,y))
Three Alternative Alignment Ends

The alignment between a1a2...ai and b1b2...bi
ends in one of three ways:
a1..i-1
b1..j
ai
a1..i-1
b1..j-1
ai
bj
a1..i
b1..j-1
bj
-
To calculate Di,j we pick the one that
gives the lowest cost
Recurrence Relation
Assume that Di-1,j, Di-1,j-1, Di,j-1 have been calculated already
Di 1, j
a1..i-1
b1..j
ai
Di 1, j 1
a1..i-1
b1..j-1
ai
bj
Di , j 1
a1..i
b1..j-1
bj
Di 1, j 
d(ai,)
-
Di , j  min
Di 1, j 1  d(ai,bj)
Di , j 1 
d(,bj)
Basis of Recursion

Align empty string to string of length i (resp. j)
– can be done by aligning to i (resp. j) blanks:
j
D0, j   d (, bk )
k 0
i
Di , 0   d ( ak ,  )
k 0
Calculating Score of Best
Alignment Using Matrix
H matrix
cost of best
alignment
Time Complexity

Sequences of lengths n and m
O (nm)

Two sequences of length l
2
O(l )