Transcript Slide 1
OUTLINE
Sequence Alignment
Scoring Alignments and Substitution Matrices
Inserting Gaps
Dynamic Programming
Database Searches
Sequence Alignment
Comparing sequences for
–
–
Similarity
Homology
Prediction of function of genes and proteins
Construction of phylogeny
Finding motifs
Sequence Alignment - HOMOLOGY
Orthologues : any gene pairwise relation
where the ancestor node is a speciation
event. Often have similar function
Paralogues : any gene pairwise relation
where the ancestor node is a duplication
event. Paralogs tend to have different
functions
Sequence Alignment - HOMOLOGY
Sequence Alignment - HOMOLOGY
Sequence Alignment - PHYLOGENY
Sequence Alignment – PROTEIN
FUNCTIONS
Sequence Alignment – FINDING
MOTIFS
Scoring Alignments and Substitution
Matrices
The quality of an alignment is measured by
giving it a quantitative score
The simplest way of quatifying similarity
between two sequences is percentage
identity.
–
Simply measured by counting the number of
identical bases or amino acids matched
between the aligned sequences.
Scoring Alignments and Substitution
Matrices
The dot-plot
gives a visual
assesment of
similarity based
on identity.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Scoring Alignments and Substitution
Matrices
Percentage identity is a relatively crude
measure and does bot give a complete
picture of the degree of similarity of two
sequences.
Scoring identical matches 1 and mismatches
as 0 ignores the fact that the type of amino
acids involved is highly significant.
Scoring Alignments and Substitution
Matrices
Genuine matches may not be identical:
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
Scoring Alignments and Substitution
Matrices
Scoring pairs of amino acids:
–
–
with similar properties higher scores
With different properties lower scores
Scoring Alignments and Substitution
Matrices
To assign scores for alignmens use
SUBSTITUTION MATRICES
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Scoring Alignments and Substitution
Matrices
Different types of substitution matrices are
being used based on:
–
–
The number of mutations required for
convertion of one amino acid to the other
Similarities in physicochemical properties.
Scoring Alignments and Substitution
Matrices
PAM substitution matrices:
–
–
Use closely related protein sequences to
derive substitution frequencies
Accepted Point Mutations per 100 residues
250 PAM 250 mutation on 100 residues
Scoring Alignments and Substitution
Matrices
BLOSUM substitution matrices:
–
–
–
BLOcks of Amino Acid SUbstitution Matrix
Use mutation data from highly conserved
local regions
BLOSUM 62 62% identity
Scoring Alignments and Substitution
Matrices
Which matrix to use ?
–
–
–
Depends on the problem properties,
Distantly related sequences : PAM 250 –
BLOSUM 50
Closely related sequences: PAM 120,
BLOSUM 80
Scoring Alignments and Substitution
Matrices
Which matrix to use ?
–
–
Some special purpose matrices (SLIM and
PHAT are designed for membrane proteins)
The length of the sequende is important
Short sequences PAM 40 or BLOSUM 80
Long sequences PAM 250 or BLOSUM 50
Scoring Alignments and Substitution
Matrices
BLOSUM – 62
and
PAM 120
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Inserting Gaps
Gap insertion requires a scoring penalty (gap
penalty).
To achieve correct matches gaps are
required
Alignment programs use gap penalties to
limit the introduction of gaps in the
alignments
Inserting Gaps
Insertions tend to be several residues long
rather than just a single residue long
–
–
–
–
Fewer insertions and deletions occur in sequences
of structural importance
Smaller penalty on lengthening an existing gap
(gap extension penalty) than introducing a new
gap
Gap penaly is high the number of gaps will be
decreased
Gap penalty is low more and large gaps will be
inserted.
Inserting Gaps
Choosing gap penalties:
–
–
Linear
Affine
Gap open penalty
Gap extension penlty
Dynamic Programming
Global and Local alignments
Pairwise and Multiple alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
For a pair of sequences there is a large
number of possible alignments.
2 sequences of length 1000 have
appriximately 10600 different alignments.
Dynamic Programming
Dynamic Programming:
–
–
–
–
Problem can be divided into many smaller parts.
Optimal alignment will not contain parts that are
not themselves optimal.
Start from sufficiently short sub-sequences.
Alignement is additive:
Dynamic Programming
Needleman and Wunsch were the first to
propose this method.
Find optimal global alignments.
Align sequences:
–
–
Seq1: x (x1x2x3…xm)
Seq1: y (y1y2y3…yn)
Dynamic Programming
s(a,b) = score of aligning a and b
F(i,j) = optimal similarity of X(1:i) and Y(1:j)
Recurrence relation:
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
– Assume linear gap penalty
Dynamic Programming
Sequences:
–
X = THISLINE, Y = ISALIGNED
Gaps:
–
Linear gap penalty (E=8)
g (ngap ) ngap E
Scoring matrix
(BLOSUM-62)
Dynamic Programming
Matrix S of optimal scores of sub-sequence
alignments.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
S(I, T) = -1,
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
1. I -- H
2. I -- gap
3. gap -- H
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
S(I, H) = -3,
S(I, gap) = -8,
S(gap, H) = -8
Recurrence relation:
F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
S(I, H) = -3,
S(I, gap) = -8,
S(gap, H) = -8
Recurrence relation:
F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
–Linear
gap penalty (E=4)
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Dynamic Programming
Example:
Dynamic Programming
Example (cont.):
Dynamic Programming
Example (cont.):
Dynamic Programming
Semi – global alignment:
–
–
When we treat terminal gaps differently than
internal gaps
How to modify dynamic programming to be able
to make semi – global alignment ?
Dynamic Programming
Local alignment:
–
–
If we compare a sequence to whole genome
Find sub-strings whose optimal global
alignment value is maximum
Dynamic Programming
Dynamic Programming
What is the difference between global and
local alignment ?
Can we define the recuernce relation of local
alignment similar to global alignment ?
Dynamic Programming
Recurrence relation of GLOBAL ALIGNMENT:
(Needleman & Wunsch)
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
Recurrence relation of LOCAL ALIGNMENT:
(Smith-Waterman)
– F(i,0) = 0
– F(0,j) = 0
– F(i,j) = max [ 0,
F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
Dynamic Programming
Database Searches
FASTA and BLAST
Use some heuristics
Dynamic Programming Complexity
–
–
Time O(n*m)
Space O(n*m)
Database Searches FASTA
Good local alignment should have some
exact match subsequence.
Find all k-tuples. (k=1-2 for proteins, 3-6 for
DNA sequences)
Protein k – tuples nc, sp, … (k = 2)
Nucleotide k – tuples TAAA, CTCC,…(k = 4)
Database Searches FASTA
If k = 3 for nucleotide sequences.
–
–
There will be 64 possible k – tuples
Assign a number e( ):
e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3
Each 3 – tuples are represented as xi xi+1xi+2
Assign a number to each 3 – tuple
–
–
Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40
For example: AAA
AAA 042 + 041 + 040 = 0
CAA 142 + 041 + 040 = 16
Database Searches FASTA
Find each occurance of k – tuples in the
sequences.
Chaining Look – Up Tables
Consider TAAAACTCTAAC (if k = 3):
3 - tuples
Position
AAA (0)
2, 3
AAC (1)
4, 10
AAG (2)
0
AAT (3)
0
…
…
Database Searches FASTA
Find hot – spots on same diagonal
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Database Searches FASTA
Consider the highes scoring alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
Database Searches FASTA
Join the best ones. You can use DP
Database Searches BLAST
Use short words to search the database
sequence.
Searches for k – mers that will score above a
threshold (T) value when aligned with query k mer (Remember FASTA looks for k – tuples
which are identical).
Use a scheme based on finite state automata
(Remember FASTA use hashing and chaining
fot rapid identification of k - tuples)
Database Searches BLAST
From Query Sequence, create query words
(for protein sequences word size is 3)
Database Searches BLAST
Blast uses a list of high scoring words created
from words similar to query words. Considers
the words with a score bigger than a threshold
value.
Database Searches BLAST
Scan each database sequence for an exact
match to the list of words.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
Database Searches BLAST
Keep only the extended matches that have a
score at least S.
Determine statistical significance of each
remaining match.
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
Database Searches BLAST
[Loxodonta cyclotis]
[Elephas maximus maximus]
[Elephas maximus indicus]
Database Searches HISTORY
1970: NW
1980: SW
1985: FASTA
1989: BLAST
References
M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland
Science
Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical
guide to the analysis of genes and proteins”, 2001, Wiley.
NCBI HelpDesk - Field Guide Course Materials