A Risk Minimization Framework for Information Retrieval

Download Report

Transcript A Risk Minimization Framework for Information Retrieval

Pairwise Sequence Alignment (II)
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Sept. 27, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm
Review: Dynamic Programming for LCS
-Edit graph representation
of alignment
-Path = alignment
-Incrementally fill in the
table
-Backtrack to find the best
alignment
The LCS Recurrence Revisited
• The formula can be rewritten by adding zero to
the edges that come from an indel, since the
penalty of indels are 0:
si-1, j-1+1 if vi = wj
si,j =
max
Matching score
si-1, j + 0
si, j-1 + 0
Insertion/deletion score
How do we improve scoring?
How do we improve the
scoring of alignments?
Can we still find an alignment
efficiently?
Outline
• Improve Scoring
– Scoring Matrix
– Affine Gap Penalty
• Variants of Alignment
– Global vs. Local alignment
• Assessing Score Significance
Scoring Matrices
To generalize scoring, consider a (4+1) x(4+1) scoring
matrix δ.
In the case of an amino acid sequence alignment, the
scoring matrix would be a (20+1)x(20+1) size. The
addition of 1 is to include the score with comparison
of a gap character “-”.
This will simplify the scoring algorithm as follows:
si-1,j-1 + δ (vi, wj)
si,j =
max
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
The same dynamic programming algorithm would still work!
The Global Alignment Problem
Find the best alignment between two strings under a given
scoring matrix
Input : Strings v & w and a scoring matrix δ
Output : Alignment of maximum score
Algorithm: Dynamic programming
si,j =
max
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
The only question left is how to define the scoring matrix…
Measuring Similarity
• Measuring the extent of similarity between two
sequences
– Based on percent sequence identity
– Based on conservation
Percent Sequence Identity
• The extent to which two nucleotide or amino
acid sequences are invariant
AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
Simple Scoring
• When mismatches are penalized by some
constant –μ, indels are penalized by some
other constant –σ, and matches are rewarded
with +1, the resulting score is:
#matches – μ(#mismatches) – σ (#indels)
Making a Better Scoring Matrix
• Scoring matrices are created based on
biological evidence.
• Alignments can be thought of as two
sequences that differ due to mutations in the
sequence.
• Some of these mutations have little effect on
the organism’s function, therefore some
penalties, δ(vi , wj), will be less harsh than
others.
Scoring Matrix: Example
A
R
N
K
A
5
-2
-1
-1
R
-
7
-1
3
N
-
-
7
0
K
-
-
-
6
• Notice that although
R and K are different
amino acids, they
have a positive score.
• Why? They are both
positively charged
amino acids will not
greatly change
function of protein.
Scoring matrices
• Amino acid substitution matrices
– PAM
– BLOSUM
• DNA substitution matrices
– DNA: less conserved than protein
sequences
– Less effective to compare coding regions at
nucleotide level
– Simple scoring is often used
PAM
•
•
Point Accepted Mutation (Dayhoff et al.)
1 PAM = PAM1 = 1% average change of all amino acid
positions
– After 100 PAMs of evolution, not every residue will
have changed
• some residues may have mutated several times
• some residues may have returned to their
original state
• some residues may have not changed at all
PAMX
• PAMx = PAM1x
Think of PAM1 as 1-step transitions
and PAM250 as 250-step transitions
– PAM250 = PAM1250
• PAM250 is a widely used scoring matrix:
Ala
Arg
Asn
Asp
Cys
Gln
...
Trp
Tyr
Val
A
R
N
D
C
Q
Ala
A
13
3
4
5
2
3
Arg
R
6
17
4
4
1
5
Asn
N
9
4
6
8
1
5
Asp
D
9
3
7
11
1
6
Cys
C
5
2
2
1
52
1
Gln
Q
8
5
5
7
1
10
Glu
E
9
3
6
10
1
7
Gly
G
12
2
4
5
2
3
His
H
6
6
6
6
2
7
Ile
I
8
3
3
3
2
2
W
Y
V
0
1
7
2
1
4
0
2
4
0
1
4
0
3
4
0
1
4
0
1
4
0
1
4
1
3
5
0
2
4
Leu
L
6
2
2
2
1
3
Lys ...
K ...
7 ...
9
5
5
1
5
1
2
15
0
1
10
BLOSUM
• Blocks Substitution Matrix
• Scores derived from observations of the
frequencies of substitutions in blocks of
local alignments in related proteins
• Matrix name indicates evolutionary distance
– BLOSUMx was created using sequences
sharing no more than x% identity
– E.g., BLOSUM62 <-> 62% identity
The Blosum50 Scoring Matrix
Probability of seeing x aligned with y
Val(x,y)=log(p(x,y)/p(x)p(y))
Probability of seeing x (or y) alone
Deficiency in Scoring of Indels
• A fixed penalty σ is given to every indel:
– -σ when there is 1 indel, -2σ for 2
consecutive indels, -3σ for 3 consecutive
indels, etc.
Can be too severe penalty for a series of 100
consecutive indels
Deficiency in Scoring of Indels (cont.)
• In nature, many times indels come as a unit,
not just at 1 nucleotide at a time.
In nature, this
is more likely.
Normal scoring would
give the same score
for both alignments
Accounting for Gaps
• Gaps- contiguous sequence of spaces in one
of the rows
• Score for a gap of length x is: -(ρ + σx), where
ρ >0 is the penalty for introducing a gap. ρ will
be large relative to σ because you do not want
to add too much of a penalty for extending the
gap.
Affine Gap Penalties
• Gap penalties:
– -ρ-σ when there is 1 indels, -ρ-2σ when
there are 2 indels, -ρ-3σ when there are 3
indels, etc.
– -ρ- x * σ (-gap opening - x gap extensions)
• Somehow reduced penalties (as compared to
naïve scoring) are given to runs of horizontal
and vertical edges
Affine Gap Penalty Recurrences
si,j =
max
si,j =
max
si,j =
max
s i-1,j - σ
s i-1,j –(ρ+σ)
Continue Gap in w (deletion)
Start Gap in w (deletion)
s i,j-1 - σ
Continue Gap in v (insertion)
s i,j-1 –(ρ+σ)
Start Gap in v (insertion)
si-1,j-1 + δ (vi, wj)
Match or Mismatch
s i,j
End deletion
s i,j
End insertion
Once again, the same dynamic programming algorithm would work!
Local vs. Global Alignment
• The Global Alignment Problem tries to find the
longest path between vertices (0,0) and (n,m)
in the edit graph.
• The Local Alignment Problem tries to find the
longest path among paths between arbitrary
vertices (i,j) and (i’, j’) in the edit graph.
Local vs. Global Alignment (cont’d)
• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | |||
|| | | | | ||||
|
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
• Local Alignment—better alignment to find
conserved segment
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
Local Alignments: Why?
• Two genes in different species may be similar
over short conserved regions and dissimilar
over remaining regions.
• Example:
– Homeobox genes have a short region called
the homeodomain that is highly conserved
between species.
– A global alignment would not find the
homeodomain because it would try to align
the ENTIRE sequence
The Local Alignment Problem
• Goal: Find the best local alignment between
two strings
• Input : Strings v, w and scoring matrix δ
• Output : Alignment of substrings of v & w
whose alignment score is maximum among all
possible alignment of all possible substrings
Local Alignment in Edit Graph
Local alignment
Global alignment
Compute a “mini”
Global Alignment to
get Local
The Problem with this Problem
• Problem of this, long run time O(n4):
- There are ~n2 pairs of vertices (i,j)
- For each pair of vertices computing an
alignment takes O(n2) time.
• Solution: Dynamic programming again!
• Question: How do we recursively compute the
best score of any local (as opposed to global)
alignment for each cell in the edit graph?
The Local Alignment Recurrence
• The largest value of si,j over the whole edit
graph is the score of the best local alignment.
• The recurrence is shown below:
0
si,j = max
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
Notice there is only
this change from the
original recurrence of
a Global Alignment
This is the well-known Waterman-Smith local alignment algoirthm
Assessing Score Signficance
• In general, larger s  more significant. The
question is how large should s be?
• Factors to be considered:
– Sequence length: longer sequences are expected
to give higher scores
– # sequences in the database: the score of the best
alignment is expected to be higher for a larger DB
– Evolution time: longer evolution causes more
mismatches, making a lower score more significant
• The Challenge is how to quantify all these…
Two Basic Approaches
• The classical approach: Extreme value
distribution (EVD)
– Assume a null (random) model for scores MR
– P(Score > s|MR, a(x, y))=? (a(x,y)=alignment of x, y)
• The Bayesian approach: Model comparison
– Assume two models for a(x,y): random R; aligned: M
– P(M|a(x,y))/P(R|a(x,y))=?
log
p(M | a( x, y))
p(a( x, y) | M )
p( M )
 log
 log
p( R | a( x, y))
p(a( x, y) | M R )
p( M R )
Log-odds score of the alignment
prior
EVD of the Best Score
in Ungapped Local Alignment
•
The number of unrelated local matches with score higher than
S is approximately Poisson distributed, with mean
Parameters
E(S )  Kmne
•
 S
e E ( S ) E ( S )n
p ( N  n) 
n!
Sequence lengths
The probability that there is a match of score greater than S is
p( x  S )  1  p( N  0)  1  e E ( S )
•
•
K and  can be fit using randomly generated data
This gives a way to test statistical significance p(x>21)= 0.01
vs. p(x>21)=0.3
Bayesian Model Comparison
Assumptions:
log
• M is a model for related sequences
• R is a model for unrelated sequences (random)
• Ungapped alignment n=m
• Alignment of each pair is independent
p( M | a ( x, y ))
p ( a ( x, y ) | M )
p( M )
 log
 log
p( R | a( x, y ))
p ( a ( x, y ) | R )
p( R)
n
p ( xi , yi | M )
p( M )
  log
 log
p ( xi , yi | R)
p( R)
i 1
n
p ( xi , yi | M )
p( M )
  log
 log
p ( xi | R) p( yi | R)
p( R)
i 1
BLOSUM
Scoring
n
  log s ( xi , yi )  log
i 1
p( M )
p( R)
log
p( M | a( x, y ))
 0?
p( R | a( x, y ))
n
 log s( x , y )   log
i 1
i
Score
S(x,y)
This partially addresses Q1: how to design the scoring function?
i
p( M )
p( R)
Prior
(Subjective!)
Pairwise Alignment Summary
(Modeling evolution)
Q1: How should we define s?
Q2: How should we define A?
(Application-specific)
X=x1,…,xn
Model: scoring function s: A
X=x1,…,xn
Possible alignments of X and Y: A ={a1,…,ak}
…
Find the best alignment(s)
Y=y1,…,ym
a*  arg max a s(a( X , Y ))
Q3: How can we find a* quickly?
(Dynamic programming)
Q1 & Q4 are related!
S(a*)= 21
Y=y1,…,ym
Q4: Is the alignment biologically
Meaningful or just the best
alignment of two unrelated
sequences?
(Models for scores)
What You Should Know
• Alignment Scoring Methods (Matrix & Gap)
• Global vs. Local alignments
• How the dynamic programming algorithm
solves both local and global alignments with a
number of scoring strategies
• Basic idea in assessing score significance