download report

Transcript LecturesPart08

Computational Biology, Part 8
Sequence Comparison with
Dynamic Programming
Robert F. Murphy
Copyright  1996, 1999-2001.
All rights reserved.
Dynamic programming
algorithms for sequence
Similar in concept to dot matrix
 Introduced for biological sequences by
 S.
B. Needleman & C. D. Wunsch. A general
method applicable to the search for similarities
in the amino acid sequence of two proteins. J.
Mol. Biol. 48:443-453 (1970)
Steps of basic dynamic
programming method
1. Initialize matrix to match scores (0 or 1)
 2. Do summation operation
 Finds
the maximum number of matches that can
be obtained starting at any position and
proceeding "forward"
3. Traceback to find maximum match
Summation operation
1. Start in lower right corner
2. Move up one position and left one position
3. Find largest value in either (a) row segment
starting one below current position and
extending to the right or (b) column
segment starting one to the right of current
position and extending down
Summation operation (cont.)
4. Add this value to the value in the current
5. Repeat steps 3 and 4 for all cells to the left
in current row and all cells above in current
6. If we are not in the top left corner, go to
step 2
Illustration of Simple Dynamic
Programming Method
(Demonstration A7)
Use of dynamic programming to
evaluate homology between pairs
of sequences
If we just want to know maximum match
possible between two sequences, don't need
to do traceback but can just look at the
highest value in the first row or column
("match score"). This represents the best
possible alignment score.
Extensions to basic dynamic
programming method
use similarity function in initialization step
 allow gaps
 use
gap penalties
Illustration with a Similarity
(Demonstration A8)
Gap penalty alternatives
constant gap penalty for gap > 1
 gap penalty proportional to gap size
 one
penalty for starting a gap (gap opening
 different (lower) penalty for adding to a gap
(gap extension penalty)
Gap penalty alternatives (cont.)
gap penalty proportional to gap size and
 for
nucleic acids, can be used to mimic
thermodynamics of helix formation
 two kinds of gap opening penalties
 one
for gap closed by AT, different for GC
 different
gap extension penalty
End gaps
Some programs treat end gaps as normal
gaps and apply penalties, other programs do
not apply penalties for end gaps
End gaps (cont.)
 Can
determine which a program does by adding
extra (unmatched) bases to the end of one
sequence and seeing if match score changes
 Penalties for end gaps appropriate for aligned
sequences where ends "should match"
 Penalties for end gaps inappropriate when
surrounding sequences are expected to be
different (e.g., conserved exon surrounded by
varying introns
Global vs. local similarity
Should result of alignment include all amino
acids or proteins or just those that "match"?
 If
yes, a global alignment is desired
 In
a global alignment, presence of mismatched
elements is neutral - doesn't affect overall match
Global vs. local similarity
Should result of alignment include all amino
acids or proteins or just those that "match"?
 If
no, a local alignment is desired
 Local
alignments accomplished by including
negative scores for "mismatched" positions, thus
scores get worse as we move away from region of
 Instead of starting traceback with highest value in
first row or column, start with highest value in entire
matrix, stop when score hits zero
Reading for next class
B & O, Chapter 7 just pp. 155-170
 Additional optional reading: Sequence
Analysis Primer, pp. 138-141 "Hashing and
Neighborhood Algorithms" and "Statistical
Approaches - BLAST” (on web site as
Reading 1)
 Altschul et al. paper (on web site as Reading