03_seq_alignemnt_DP

Download Report

Transcript 03_seq_alignemnt_DP

Pairwise Sequence
Alignment
Pairwise Sequence Alignment
• The most important class of bioinformatics tools – pairwise
alignment of DNA and protein seqs.
alignment 1
alignment 2
Seq. 1 ACGCTGA
ACGCTGA
Seq. 2 A - - CTGT
ACTGT - Seeks alignments  high seq. identity, few mismatchs and gaps
Assumption – the observed identity in seqs. to be aligned is the result
of either random or of a shared evolutionary origin
Identity ≠ similarity
Sequence identity = Homology (a risky assumption)
Sequence identity ≠ Homology
Pairwise Sequence Alignment
Same true alignment arise through different evolutionary events
Scoring scheme: substitution  -1, indel  -5, match  3
indel
Score
9
5
4
4
Figure A Common evolutionary events and their effects on alignment
Pairwise Sequence Alignment
Find the optimal score  the best guess for the true alignment
Find the optimal pairwise alignment of two seqs.  inserted
gaps into one or both of them  maximize the total alignment
score
Dynamic programming (DP) – Needleman and Wunsch (1970),
Smith and Waterman (1980), this algorithm guarantees that
we find all optimal alignments of two seqs. of lengths m and n
BLAST is based on DP with improvement on speed
Prof. Waterman
http://www.usc.edu/dept/LAS/biosci/faculty/waterman.html
Pairwise Sequence Alignment
The score for alignment of i residues of sequence 1 against
j residues of sequence 2 is given by
S (i  1, j  1)  c(i, j )

S (i, j )  max  S (i  1, j )  c(i,)
 S (i, j  1)  c(, j )

where
c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or
-1 for a mismatch,
c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5
Pairwise Sequence Alignment
• The entry for S(1,1) is the maximum of the
following three events:
• S(0,0) + c(A,A) = 0 + 3 = 3
[c(A,A) = c(1,1)]
• S(0,1) + c(A, -) = -5 + -5 = -10 [c(A, -) = c(1, -)]
• S(1,0) + c(-, A) = -5 + -5 = -10 [c(- ,A) = c(-, 1)]
• Similarly, one finds S(2,1) as the maximum of
three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15
 the best is entry is the addition of the C indel
to the A-A match, for a score of -2 (see next
page).
Pairwise Sequence Alignment
The alignment matrix of sequences 1 and 2
A
G
T
C
G
C
A

 35
 30
 25
 20
 15
 10
5
0

 27  19  9  3
1
 22  14  4
2
2
 17  9  1  1
7
 12  4
0
4
3
7
1
5
4
1
2
6
1
4 9
3
 2  7  12  17
 5  10  15  20  25
A
C
T
G
T
S(2,1)
= max {S(1,0) + c(2,1),
S(1,1) + c(2,-), S(2,0) + c(-,1)}
= max { S(1,0) + c(C,A),
S(1,1) + c(C,-), S(2,0) + c(-,A) }
= max { -5-1, 3-5, -10-5 }
= -2
Pairwise Sequence Alignment
Traceback  determine the actual alignment
From the top right hand corner  the (7,5) cell
A
G
T
C
G
C
A

 35
 30
 25
 20
 15
 10
5
0

 27  19  9  3
1
 22  14  4
2
2
 17  9  1  1
7
 12  4
0
4
3
7
1
5
4
1
2
6
1
4 9
3
 2  7  12  17
 5  10  15  20  25
A
C
T
G
T
For example the 1 in the
(7,5) cell could only be
reached by the addition of
the mismatch A-T
ACGCTGA
A - - CTGT
or
ACGCTGA
AC - - TGT
4 matches
1 mismatch
2 indels
Ambiguity – has to do with
which C in seq. 1 aligns
with the C in seq. 2
Pairwise Sequence Alignment
Parameters settings - Gap penalties
• Default settings are the easiest to use but they are not necessarily yield
the correct alignment
• constant penalty  independent of the length of gap, A
• proportional penalty  penalty is proportional to the length L of the gap,
BL (that is what we used in the this lecture)
• affine gap penalty  gap-opening penalty + gap-extension penalty =
A+BL
• There is no rule for predicting the penalty that best suits the alignment
• Optimal penalties vary from seq. to seq.  it is a matter of trial and error
• Usually A > B, because of opening a gap (usually A/B ~ 10)
• Hint: (1) compare distantly related seqs. high A and very low B often
give the best results  penalized more on their existence than on their
length, (2) compare closely related seqs., penalize both of extension and
extension
Exercise - Computing an optimal
sequence alignment
Two score schemes
(1) Gap penalty = -5, mismatch = -1, match =3
(2) Gap penalty = -1, mismatch = -1, match =3
(1) First alignment score = 5*3 + 2*(-1) =13
Second/Third alignment score = 6*3 + 2*(-5) = 8
(2) First alignment score = 5*3 + 2*(-1) =13
Second/Third alignment score = 6*3 + 2*(-1) = 16
A more serious problem – identify the wrong alignment
Exercise
Computing an optimal sequence alignment
Gap penalty = -5
T
A
T
G
C
G
A

 35
13
 30
 25
 20
 15
 10
5
3
0
 5  10  15  20  25  30  35

A
C
G
G
T
A
T
Gap penalty = -1
T
A
T
G
C
G
A

7
16
6
5
4
3
2
1 3
0 1  2  3  4  5  6  7

A C G G T
A T
BLAST (Scoring matrices)
• Dynamic Programming do not provide the user with a measure of
statistical similarity when regions of local similarity when regions of
local similarity are found
• Take into account not just the position-position overlap between two
seqs. but the characteristics of the a.a being aligned  define
scoring matrices
• Protein scoring matrices take three major biological factors into
account:
• Conservation – the numbers within the scoring matrix provide a
way of representing what a.a. are capable of substituting for other
a.a. (characteristics such as charge, size, hydrophobicity)
• Frequency – a.a cannot freely substitute for one another, the
matrices need to reflect how often particular a.a occur among the
entire proteins.
• Evolution – scoring matrices implicitly represent evolutionary
patterns, and matrices can be adjusted to favor the detection of
closely related or more distantly related proteins.
BLAST (Scoring matrices)
Scoring matrices and the Log Odds Ratio
Si , j  log[
qi , j
pi p j
]
where pi[pj] = probability with which a.a i [j] occurs
among all proteins
qi,j = how often the two a.a i and j are seen to align
with one another in MSA of protein families or in
seqs. that are known to have a biological
relationship.
BLAST (PAM matrices)
Amino acid substitution matrix (PAM and BLOSUM)
• Leave most adjustable parameters to the default value except the
scoring matrix
• Box 2.1  a simple scheme for scoring seq. matches and mismatches
(all mismatches received the same penalty)
• Scoring matrix allows some mismatches to be penalized less then
others
• Leucine-isoleucine mismatch < leucine-tryptophan mismatch
• PAM (Point Accepted Mutations) scoring matrices – derived from
closely related species (evolutionary point of view, avoid the
complications of unobserved multiple substitutions at a single position)
• PAM derived from the likelihood of amino acids substitution during the
evolutionary process
• PAM matrices with a smaller number represent shorter evolutionary
distance
• PAM1 – one a.a change per 100 a.a, or roughly 1% divergence
PAM
Asp  Glu
0.95%
BLAST (BLOSUM matrices)
BLOSUM (BLOck SUM) – there are evidence it outperform PAM
• Block  proteins in the same family can be aligned without
introducing a gap (not the individual seqs.)
• So any given protein can contain one or more blocks, corresponding
to each of its functional or structural motif
• With these protein blocks, it is possible to look for substitution
patterns only in the most conserved regions of a protein  block
substitution matrices are generated
• BLOSUM scoring matrix – based on data from distantly related
seqs. (default BLOSUM62 for general use)
• The most commonly used matrices are PAM120, PAM250,
BLOSUM50 and BLOSUM 62
• BLOSUM matrices with a smaller number represent a longer
evolutionary distance
BLAST (BLOSUM matrices)
The BLOSUM62
substitution matrix
Values below zero
indicate amino acid
changes that are
more likely to have
a functional effect
than values of zero
and above.
BLAST (relating PAM to BLOSUM)
PAM250 equivalent to BLOSUM45
PAM160 equivalent to BLOSUM62
PAM120 equivalent to BLOSUM80
Selecting an appropriate scoring matrix
Matrix
Best use
Similarity(%)
BLOSUM90
Short alignments that are
highly similar
70-90
BLOSUM80
Detecting members of a
protein family
50-60
BLOSUM62
Most effective in finding all
potential similarities
30-40
BLOSUM30
Longer alignment of more
divergent seqs.
<30
BLAST (Sensitivity and Specificity)
BLAST (Sensitivity and Specificity)
BLAST (Sensitivity and Specificity)
BLAST (Sensitivity and Specificity)