May 21, 2002 2

Download Report

Transcript May 21, 2002 2

Developing Sequence
Alignment Algorithms in C++
Dr. Nancy Warter-Perez
May 21, 2002
Outline






Hand out project
Group assignments
References for sequence alignment
algorithms
Board example of Needleman-Wunch
Discussion of LCS Algorithm and how it can
be extended for global alignment (SmithWaterman)
Extensions: local alignment and gap penalties
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
2
Project Group Members

Group 1:


Group 2:


Michael, Hardik, Daisy
Group 4:


Thi, Edain
Group 3:


Bonnie, Eduardo, Sara
Dennis, Ivonne, Patrick
Group 5:

Chuck, Ronny
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
3
Project References






http://www.sbc.su.se/~arne/kurser/swell/pairwise_alignme
nts.html
http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dyna
mic.html
Lectures: Database search (4/16) and Rationale for DB
Searching (5/16)
Computational Molecular Biology – An Algorithmic
Approach, Pavel Pevzner
Introduction to Computational Biology – Maps, sequences,
and genomes, Michael Waterman
Algorithms on Strings, Trees, and Sequences – Computer
Science and Computational Biology, Dan Gusfield
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
4
Classic Papers



Needleman, S.B. and Wunsch, C.D. A General Method
Applicable to the Search for Similarities in Amino Acid Sequence
of Two Proteins. J. Mol. Biol., 48, pp. 443-453,
1970.(http://poweredge.stanford.edu/BioinformaticsArchive/Clas
sicArticlesArchive/needlemanandwunsch1970.pdf)
Smith, T.F. and Waterman, M.S. Identification of Common
Molecular Subsequences. J. Mol. Biol., 147, pp. 195-197,
1981.(http://poweredge.stanford.edu/BioinformaticsArchive/Clas
sicArticlesArchive/smithandwaterman1981.pdf)
Smith, T.F. The History of the Genetic Sequence Databases.
Genomics, 6, pp. 701-707, 1990.
(http://poweredge.stanford.edu/BioinformaticsArchive/ClassicArt
iclesArchive/smith1990.pdf)
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
5
Longest Common
Subsequence (LCS) Problem


Can have insertion and deletions but no
substitutions
Ex: V: ATCTGAT
W: TGCATA
LCS: TCTA
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
6
LCS Problem (cont.)

Similarity score
si-1,j
si,j = max {
si,j-1
si-1,j-1 + 1, if vi = wj
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
7
Indels – insertions and
deletions (e.g., gaps)

alignment is V and W






Alignment A is a 2xl matrix (l >= n,m)
First row of A contains characters of V
interspersed with l-n spaces
Second row of A contains characters of W
interspersed with l-m spaces
Space in first row = insertion
 (UP)
Space in second row = deletion  (LEFT)
Match (no mismatch in LCS)
(DIAG)
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
8
LCS(V,W) Algorithm
for i = 1 to n
si,0 = 0
for j = 1 to n
s0,j = 0
for i = 1 to n
for j = 1 to m
if vi = wj
si,j = si-1,j-1 + 1; bi,j = DIAG
else if si-1,j >= si,j-1
si,j = si-1,j; bi,j = UP
else
si,j = si,j-1; bi,j = LEFT
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
9
Print-LCS(b,V,i,j)
if i = 0 or j = 0
return
if bi,j = DIAG
PRINT-LCS(b, V, i-1, j-1)
print vi
else if bi,j = UP
PRINT-LCS(b, V, i-1, j)
else
PRINT-LCS(b, V, I, j-1)
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
10
Extend LCS to Global
Alignment
si,j
= max {
si-1,j + (vi, -)
si,j-1 + (-, wj)
si-1,j-1 + (vi, wj)
(vi, -) = (-, wj) = - = extend gap penalty
(vi, wj) = score for match or mismatch – can be
fixed, from PAM or BLOSUM
 Modify LCS and PRINT-LCS algorithms to support
global alignment (On board discussion)
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
11
Extend to Local Alignment
si,j
= max {
0
(no negative scores)
si-1,j + (vi, -)
si,j-1 + (-, wj)
si-1,j-1 + (vi, wj)
(vi, -) = (-, wj) = - = extend gap penalty
(vi, wj) = score for match or mismatch – can
be fixed, from PAM or BLOSUM
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
12
Discussion on adding
affine gap penalties

Affine gap penalty


Score for a gap of length x
-( + x)
Where



 > 0 is the insert gap penalty
 > 0 is the extend gap penalty
On board example from
http://www.sbc.su.se/~arne/kurser/swell/pairwise_ali
gnments.html
May 21, 2002
Developing Sequence Alignment
Algorithms in C++
13
Alignment with Gap Penalties
Can apply to global or local (w/ zero) algorithms
si,j
= max {
si,j
= max {
si-1,j - 
si-1,j - ( + )
si1,j-1 - 
si,j-1 - ( + )
si,j
= max {
May 21, 2002
si-1,j-1 + (vi, wj)
si,j
si,j
Developing Sequence Alignment
Algorithms in C++
14
Implementing Global
Alignment Program in C++

Keeping it simple (e.g., without classes or
structures)



Score matrix
Traceback matrix
Simple algorithm:





May 21, 2002
Read in two sequences
Compute score and traceback matrices (modified LCS)
Print alignment score = score[n][m]
Print each aligned sequence (modified PRINT-LCS) using
traceback
For debugging – can also print the score and traceback
matrices
Developing Sequence Alignment
Algorithms in C++
15