Sequence Alignment III

Download Report

Transcript Sequence Alignment III

Sequence Alignment III
CIS 667 February 10, 2004
Extensions to the Basic
Algorithm
• We have seen that the basic dynamic
programming algorithm can be used to
solve
 Global alignment
 Semi-global alignment
 Local alignment
• We can extend the algorithm to more
accurately reflect the cost of gap penalties
General Gap Penalty Functions
• Gaps are caused by mutations
 It is more likely to have a single large gap
than several smaller ones
 Gap penalties should reflect that
 Let w(k) denote a gap penalty function (the
cost of a gap of k spaces)
 We have been using w(k) = bk - a linear
function
General Gap Penalty Functions
• We can modify the basic algorithm to
compute a score with a general gap
penalty w (i.e. any function)
• The modified algorithm is slower, however
- O(n3)
 The new algorithm scoring scheme is no
longer additive
 The space required is also larger
Affine Gap Penalty Functions
• Can we do better than O(n3) and still have
a reasonable function?
 Yes. We need to have w(k)  kw(1)
 An affine function - w(k) = h + gk with w(0) =
0 and h, g > 0 works
 Think of h as the cost of opening a gap and g
as the cost of extending a gap
 We can develop an algorithm with time
complexity O(n2)
Gap Penalties - Overview
Imagine we want to align:
Bad alignment:
CAGT
CCAAGGTTCAGT
C-A-G-T----CCAAGGTTCAGT
-16
Better alignment: --------CAGT -16
CCAAGGTTCAGT
-12
-9
Gap cost with linear gap penalty (-2)
Gap cost with affine gap penalty (h = -2, k = -1)
Multiple Sequence Alignment
• Once a protein sequence is newly
determined, an important goal is to assign
possible functions to it
 First search for similar sequences in the DNA
and protein sequence databases
 If more than one similar sequence is found,
the next step is to multiply align all of the
sequences
Multiple Sequence Alignment
• Multiple alignments are key starting point
for
 Protein secondary structure prediction
 Residue accessibility
 Function
• Also provide the basis for the most
sensitive sequence searching algorithms
Multiple Sequence Alignment
• A multiple sequence alignment is simply
an alignment that contains more than two
sequences
MPQILLL
MLR-LLMK-ILLL
MPPVLIL
Multiple Sequence Alignment
• We must decide how to score a multiple
alignment
• One possibility is the sum-of-pairs function
 Simply add up the pairwise scores of all pairs
in a column to get the score of the column
 Note that in multiple sequence alignment we may
have two spaces in a column - the score of (-,-)
then is usually set to 0
Multiple Sequence Alignment
• A straightforward dynamic programming
approach to multiple sequence alignment
results in an exponential algorithm
 Heuristics can be used to reduce the
complexity in most cases
Multiple Sequence Alignment
• Automatic alignment programs such as
CLUSTAL W can be used to produce
multiple alignments
• The PSI-BLAST program uses multiple
sequence alignments to make more
sensitive searches of protein sequence
databases than is possible with a single
sequence
PAM Matrices
• When comparing protein sequences, we
need a more complex scoring scheme
 A mismatch with two amino acids with similar
biochemical properties should score higher
than one with two dissimilar ones
 Evolution is more likely to result in a similar amino
acid (e.g. same size, both hydrophobic, etc.)
replacing another
PAM Matrices
• PAM - Point Accepted Mutations or
Percent of Accepted Mutations
 1-PAM matrix reflects an amount of evolution
producing on average one mutation per
hundred amino acids
 250-PAM matrix is suitable for comparing
sequences that are 250 units of evolution
apart
 Works well for long, weakly similar sequences
 Small values good for short, similar sequences
PAM-250 Matrix
BLOSUM Matrices
• Another widely used set of matrices is
BLOSUM - Blocks Substitution Matrix
 BLOSUM is often better for highly divergent
sequences
 PAM better for more highly similar sequences
BLAST
• BLAST - Basic Local Alignment Search
Tool is a family of sequence similarity tools
 Can be used to search sequence databases
worldwide
 Can be run locally, or via web-based interface
on a server
 Given a query sequence, BLAST returns all
matches above a user-defined threshold
BLAST
• BLAST uses a heuristic technique
 Compile list of high-scoring words (use PAM
matrix to score words w characters long)
 Search for matches in the database (use a
hash table to speed up search) - call a match
a seed
 Extend the seeds in both directions until the
score of the extension falls below a limit