Sequence Alignment III
Download
Report
Transcript Sequence Alignment III
Sequence Alignment III
CIS 667 February 10, 2004
Extensions to the Basic
Algorithm
• We have seen that the basic dynamic
programming algorithm can be used to
solve
Global alignment
Semi-global alignment
Local alignment
• We can extend the algorithm to more
accurately reflect the cost of gap penalties
General Gap Penalty Functions
• Gaps are caused by mutations
It is more likely to have a single large gap
than several smaller ones
Gap penalties should reflect that
Let w(k) denote a gap penalty function (the
cost of a gap of k spaces)
We have been using w(k) = bk - a linear
function
General Gap Penalty Functions
• We can modify the basic algorithm to
compute a score with a general gap
penalty w (i.e. any function)
• The modified algorithm is slower, however
- O(n3)
The new algorithm scoring scheme is no
longer additive
The space required is also larger
Affine Gap Penalty Functions
• Can we do better than O(n3) and still have
a reasonable function?
Yes. We need to have w(k) kw(1)
An affine function - w(k) = h + gk with w(0) =
0 and h, g > 0 works
Think of h as the cost of opening a gap and g
as the cost of extending a gap
We can develop an algorithm with time
complexity O(n2)
Gap Penalties - Overview
Imagine we want to align:
Bad alignment:
CAGT
CCAAGGTTCAGT
C-A-G-T----CCAAGGTTCAGT
-16
Better alignment: --------CAGT -16
CCAAGGTTCAGT
-12
-9
Gap cost with linear gap penalty (-2)
Gap cost with affine gap penalty (h = -2, k = -1)
Multiple Sequence Alignment
• Once a protein sequence is newly
determined, an important goal is to assign
possible functions to it
First search for similar sequences in the DNA
and protein sequence databases
If more than one similar sequence is found,
the next step is to multiply align all of the
sequences
Multiple Sequence Alignment
• Multiple alignments are key starting point
for
Protein secondary structure prediction
Residue accessibility
Function
• Also provide the basis for the most
sensitive sequence searching algorithms
Multiple Sequence Alignment
• A multiple sequence alignment is simply
an alignment that contains more than two
sequences
MPQILLL
MLR-LLMK-ILLL
MPPVLIL
Multiple Sequence Alignment
• We must decide how to score a multiple
alignment
• One possibility is the sum-of-pairs function
Simply add up the pairwise scores of all pairs
in a column to get the score of the column
Note that in multiple sequence alignment we may
have two spaces in a column - the score of (-,-)
then is usually set to 0
Multiple Sequence Alignment
• A straightforward dynamic programming
approach to multiple sequence alignment
results in an exponential algorithm
Heuristics can be used to reduce the
complexity in most cases
Multiple Sequence Alignment
• Automatic alignment programs such as
CLUSTAL W can be used to produce
multiple alignments
• The PSI-BLAST program uses multiple
sequence alignments to make more
sensitive searches of protein sequence
databases than is possible with a single
sequence
PAM Matrices
• When comparing protein sequences, we
need a more complex scoring scheme
A mismatch with two amino acids with similar
biochemical properties should score higher
than one with two dissimilar ones
Evolution is more likely to result in a similar amino
acid (e.g. same size, both hydrophobic, etc.)
replacing another
PAM Matrices
• PAM - Point Accepted Mutations or
Percent of Accepted Mutations
1-PAM matrix reflects an amount of evolution
producing on average one mutation per
hundred amino acids
250-PAM matrix is suitable for comparing
sequences that are 250 units of evolution
apart
Works well for long, weakly similar sequences
Small values good for short, similar sequences
PAM-250 Matrix
BLOSUM Matrices
• Another widely used set of matrices is
BLOSUM - Blocks Substitution Matrix
BLOSUM is often better for highly divergent
sequences
PAM better for more highly similar sequences
BLAST
• BLAST - Basic Local Alignment Search
Tool is a family of sequence similarity tools
Can be used to search sequence databases
worldwide
Can be run locally, or via web-based interface
on a server
Given a query sequence, BLAST returns all
matches above a user-defined threshold
BLAST
• BLAST uses a heuristic technique
Compile list of high-scoring words (use PAM
matrix to score words w characters long)
Search for matches in the database (use a
hash table to speed up search) - call a match
a seed
Extend the seeds in both directions until the
score of the extension falls below a limit