Transcript Slide 1

OUTLINE





Sequence Alignment
Scoring Alignments and Substitution Matrices
Inserting Gaps
Dynamic Programming
Database Searches
Sequence Alignment

Comparing sequences for
–
–



Similarity
Homology
Prediction of function of genes and proteins
Construction of phylogeny
Finding motifs
Sequence Alignment - HOMOLOGY

Orthologues : any gene pairwise relation
where the ancestor node is a speciation
event. Often have similar function

Paralogues : any gene pairwise relation
where the ancestor node is a duplication
event. Paralogs tend to have different
functions
Sequence Alignment - HOMOLOGY
Sequence Alignment - HOMOLOGY
Sequence Alignment - PHYLOGENY
Sequence Alignment – PROTEIN
FUNCTIONS
Sequence Alignment – FINDING
MOTIFS
Scoring Alignments and Substitution
Matrices


The quality of an alignment is measured by
giving it a quantitative score
The simplest way of quatifying similarity
between two sequences is percentage
identity.
–
Simply measured by counting the number of
identical bases or amino acids matched
between the aligned sequences.
Scoring Alignments and Substitution
Matrices

The dot-plot
gives a visual
assesment of
similarity based
on identity.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Scoring Alignments and Substitution
Matrices


Percentage identity is a relatively crude
measure and does bot give a complete
picture of the degree of similarity of two
sequences.
Scoring identical matches 1 and mismatches
as 0 ignores the fact that the type of amino
acids involved is highly significant.
Scoring Alignments and Substitution
Matrices

Genuine matches may not be identical:
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
Scoring Alignments and Substitution
Matrices

Scoring pairs of amino acids:
–
–
with similar properties  higher scores
With different properties  lower scores
Scoring Alignments and Substitution
Matrices

To assign scores for alignmens use
SUBSTITUTION MATRICES
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Scoring Alignments and Substitution
Matrices

Different types of substitution matrices are
being used based on:
–
–
The number of mutations required for
convertion of one amino acid to the other
Similarities in physicochemical properties.
Scoring Alignments and Substitution
Matrices

PAM substitution matrices:
–
–
Use closely related protein sequences to
derive substitution frequencies
Accepted Point Mutations per 100 residues

250 PAM  250 mutation on 100 residues
Scoring Alignments and Substitution
Matrices

BLOSUM substitution matrices:
–
–
–
BLOcks of Amino Acid SUbstitution Matrix
Use mutation data from highly conserved
local regions
BLOSUM 62  62% identity
Scoring Alignments and Substitution
Matrices

Which matrix to use ?
–
–
–
Depends on the problem properties,
Distantly related sequences : PAM 250 –
BLOSUM 50
Closely related sequences: PAM 120,
BLOSUM 80
Scoring Alignments and Substitution
Matrices

Which matrix to use ?
–
–
Some special purpose matrices (SLIM and
PHAT are designed for membrane proteins)
The length of the sequende is important


Short sequences  PAM 40 or BLOSUM 80
Long sequences  PAM 250 or BLOSUM 50
Scoring Alignments and Substitution
Matrices

BLOSUM – 62
and
PAM 120
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Inserting Gaps



Gap insertion requires a scoring penalty (gap
penalty).
To achieve correct matches gaps are
required
Alignment programs use gap penalties to
limit the introduction of gaps in the
alignments
Inserting Gaps

Insertions tend to be several residues long
rather than just a single residue long
–
–
–
–
Fewer insertions and deletions occur in sequences
of structural importance
Smaller penalty on lengthening an existing gap
(gap extension penalty) than introducing a new
gap
Gap penaly is high  the number of gaps will be
decreased
Gap penalty is low  more and large gaps will be
inserted.
Inserting Gaps

Choosing gap penalties:
–
–
Linear
Affine


Gap open penalty
Gap extension penlty
Dynamic Programming

Global and Local alignments

Pairwise and Multiple alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming


For a pair of sequences there is a large
number of possible alignments.
2 sequences of length 1000 have
appriximately 10600 different alignments.
Dynamic Programming

Dynamic Programming:
–
–
–
–
Problem can be divided into many smaller parts.
Optimal alignment will not contain parts that are
not themselves optimal.
Start from sufficiently short sub-sequences.
Alignement is additive:
Dynamic Programming



Needleman and Wunsch were the first to
propose this method.
Find optimal global alignments.
Align sequences:
–
–
Seq1: x (x1x2x3…xm)
Seq1: y (y1y2y3…yn)
Dynamic Programming



s(a,b) = score of aligning a and b
F(i,j) = optimal similarity of X(1:i) and Y(1:j)
Recurrence relation:
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
– Assume linear gap penalty
Dynamic Programming

Sequences:
–

X = THISLINE, Y = ISALIGNED
Gaps:
–
Linear gap penalty (E=8)
g (ngap )  ngap E

Scoring matrix
(BLOSUM-62) 
Dynamic Programming

Matrix S of optimal scores of sub-sequence
alignments.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
S(I, T) = -1,
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
1. I -- H
2. I -- gap
3. gap -- H
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
S(I, H) = -3,
S(I, gap) = -8,
S(gap, H) = -8
Recurrence relation:
F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
S(I, H) = -3,
S(I, gap) = -8,
S(gap, H) = -8
Recurrence relation:
F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
–Linear
gap penalty (E=4)
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
Example:
Dynamic Programming
Example (cont.):
Dynamic Programming
Example (cont.):
Dynamic Programming

Semi – global alignment:
–
–
When we treat terminal gaps differently than
internal gaps
How to modify dynamic programming to be able
to make semi – global alignment ?
Dynamic Programming

Local alignment:
–
–
If we compare a sequence to whole genome
Find sub-strings whose optimal global
alignment value is maximum
Dynamic Programming
Dynamic Programming


What is the difference between global and
local alignment ?
Can we define the recuernce relation of local
alignment similar to global alignment ?
Dynamic Programming
Recurrence relation of GLOBAL ALIGNMENT:
(Needleman & Wunsch)
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
Recurrence relation of LOCAL ALIGNMENT:
(Smith-Waterman)
– F(i,0) = 0
– F(0,j) = 0
– F(i,j) = max [ 0,
F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
Dynamic Programming
Database Searches



FASTA and BLAST
Use some heuristics
Dynamic Programming Complexity
–
–
Time O(n*m)
Space O(n*m)
Database Searches FASTA




Good local alignment should have some
exact match subsequence.
Find all k-tuples. (k=1-2 for proteins, 3-6 for
DNA sequences)
Protein k – tuples  nc, sp, … (k = 2)
Nucleotide k – tuples  TAAA, CTCC,…(k = 4)
Database Searches FASTA

If k = 3 for nucleotide sequences.
–
–
There will be 64 possible k – tuples
Assign a number e( ):

e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3

Each 3 – tuples are represented as  xi xi+1xi+2
Assign a number to each 3 – tuple

–
–
Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40
For example: AAA 

AAA 042 + 041 + 040 = 0

CAA 142 + 041 + 040 = 16
Database Searches FASTA



Find each occurance of k – tuples in the
sequences.
Chaining  Look – Up Tables
Consider TAAAACTCTAAC (if k = 3):
3 - tuples
Position
AAA (0)
2, 3
AAC (1)
4, 10
AAG (2)
0
AAT (3)
0
…
…
Database Searches FASTA

Find hot – spots on same diagonal
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Database Searches FASTA

Consider the highes scoring alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Database Searches FASTA

Join the best ones. You can use DP
Database Searches BLAST



Use short words to search the database
sequence.
Searches for k – mers that will score above a
threshold (T) value when aligned with query k mer (Remember FASTA looks for k – tuples
which are identical).
Use a scheme based on finite state automata
(Remember FASTA use hashing and chaining
fot rapid identification of k - tuples)
Database Searches BLAST

From Query Sequence, create query words
(for protein sequences word size is 3)
Database Searches BLAST

Blast uses a list of high scoring words created
from words similar to query words. Considers
the words with a score bigger than a threshold
value.
Database Searches BLAST


Scan each database sequence for an exact
match to the list of words.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
Database Searches BLAST


Keep only the extended matches that have a
score at least S.
Determine statistical significance of each
remaining match.
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
[Loxodonta cyclotis]
[Elephas maximus maximus]
[Elephas maximus indicus]
Database Searches HISTORY




1970: NW
1980: SW
1985: FASTA
1989: BLAST
References



M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland
Science
Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical
guide to the analysis of genes and proteins”, 2001, Wiley.
NCBI HelpDesk - Field Guide Course Materials