Transcript bioinfo4

Alignment Class III
We continue where we stopped last
week: FASTA – BLAST
FASTA-Stages
1. Find k-tups in the two sequences (k=1,2 for
proteins, 4-6 for DNA sequences)
2. Score and select top 10 scoring “local
diagonals”
a.
b.
c.
For proteins, each k-tup found is scored using
the PAM250 matrix
For DNA, the number of k-tups found
Penalize intervening gaps
Finding k-tups
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in
offset
amino acid
protein A protein B
pos A - posB
----------------------------------------------------a
6
6
0
c
2
7
-5
k
11
n
1
p
4
9
-5
r
10
s
3
8
-5
t
5
----------------------------------------------------Note the common offset for the 3 amino acids c,s and p
A possible alignment is thus quickly found protein 1 n c s p t a
| | |
protein 2 a c s p r k
FASTA, K-tups with common
offset
FASTA-Stages
3. Rescan top 10 regions, score with PAM250
(proteins) or DNA scoring matrix. Trim off the
ends of the regions to achieve highest scores.
4. Try to join regions with gapped alignments. Join
if similarity score is one standard deviation above
average expected score
5. After finding the best initial region, FASTA
performs a global alignment of a 32 residue wide
region centered on the best initial region, and uses
the score as the optimized score.
BLAST
 Basic Local Alignment Search Tool

Altschul et al. 1990,1994,1997
 Heuristic method for local alignment
 Designed specifically for database searches
 Idea: Good alignments contain short lengths
of exact matches
Blast Application
 Blast is a family of programs: BlastN, BlastP, BlastX,
tBlastN, tBlastX Query:
DNA
Protein





Database:
DNA
Protein
BlastN - nt versus nt database
BlastP - protein versus protein database
BlastX - translated nt versus protein database
tBlastN - protein versus translated nt database
tBlastX - translated nt versus translated nt database
Mathematical Basis of BLAST
 Model matches as a sequence of coin tosses
 Let p be the probability of a “head”

For a “fair” coin, p = 0.5
 (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R = log1/p (n).
 Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
 Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
Mathematical Basis of BLAST
 To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.
AATCAT
HTHHHT
ATTCAG
 For DNA, the probability of a “head” is 1/4
 Same logic applies to amino acids
Mathematical Basis of BLAST
 So, for one particular alignment, the Erdös-Rényi
property can be applied
 What about for all possible alignments?

Consider that sequences are being shifted back and forth,
dot matrix plot
 The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
Steps of BLAST
1. Filter out low-complexity regions


K  1 / L log N  L! /  ni !
i


where L is length, N is alphabet size, ni is the
number of letter i appearing in sequence.
Example: AAAT
K=1/4 log4(24/(3!*1!*0!*0!))=0.25
Steps of BLAST
2. Query words of length 3 (for proteins) or 11 (for
DNA) are created from query sequence using a
sliding window
MEFPGLGSLGTSEPLPQFVDPALVSS
MEF
EFP
FPG
PGL
GLG
Steps of BLAST
3. Using BLOSUM62 (for proteins) or scores of
+5/-4 (DNA, PAM40), score all possible words of
length 3 or 11 respectively against a query word.
4. Select a neighborhood word score threshold (T)
so that only most significant sequences are kept.
Approximately 50 hits per query word.
5. Repeat 3 and 4 for each query word in step 2.
Total number of high scoring words is
approximately 50 * sequence length.
Steps of BLAST
6. Organize the high-scoring words into a search
tree
M
E
E
F
G
P
7. Scan each database sequence for match to highscoring words. Each match is a seed for an
ungapped alignment.
Steps of BLAST
8. (Original BLAST) extend matching words to the
left and right using ungapped alignments.
Extension continues as long as score increases or
stays same. This is a HSP (high scoring pair).
(BLAST2) Matches along the same diagonal
within a distance A of each other are joined and
then the longer sequence extended as before.
Steps of BLAST
9. Using a cutoff score S, keep only the
extended matches that have a score at least
S.
10. Determine statistical significance of each
remaining match (from last time).
11. Try to extend the HSPs if possible.
12. Show Smith-Waterman local alignments.
Information theory
Shanon Entropy and information
Entropy
 X: discrete Random Variable (RV), p(X)
 Entropy (or self-information)
H(p)  H(X)    p(x)log2p(x)
xX
 Entropy measures the amount of information in a
RV
Entropy (cont)
H(X)    p(x)log2p(x)
xX
1
  p(x)log2
p(x)
xX

1 

 E log2

p(x)


H(X)  0
H(X)  0  p(X)  1
i.e when the value of X
is determinate, hence
providing no new
information
Joint Entropy
 The joint entropy of 2 RV X,Y is the amount
of the information needed on average to
specify both their values
H(X, Y)    p(x, y)log p(X, Y) 
xX yY
Conditional Entropy
 The conditional entropy of a RV Y given another X,
expresses how much extra information one still
needs to supply on average to communicate Y given
that the other party knows X
H(Y | X)   p(x)H(Y | X  x)
xX
   p(x)  p(y | x)log p(y | x) 
xX
yY
    p(x, y)log p(y | x)    Elog p(Y | X) 
xX yY
Chain Rule
H(X, Y)  H(X)  H(Y | X)
H(X1,..., Xn )  H(X1 )  H(X2 | X1 )  ....  H(Xn | X1,...Xn1 )
Mutual Information
H(X, Y)  H(X)  H(Y | X)  H(Y)  H(X | Y)
H(X) - H(X | Y)  H(Y) - H(Y | X)  I(X, Y)
 I(X,Y) is the mutual information between X and Y.
It is the reduction of uncertainty of one RV due to
knowing about the other, or the amount of
information one RV contains about the other
Mutual Information (cont)
I(X, Y)  H(X) - H(X | Y)  H(Y) - H(Y | X)
 I is 0 only when X,Y are independent:
H(X|Y)=H(X)
 H(X)=H(X)-H(X|X)=I(X,X) Entropy is the
self-information
Kullback-Leibler Divergence
 Relative entropy or KL (Kullback-Leibler)
divergence
p(x)
D(p|| q)   p(x)log
q(x)
xX

p(X) 

 Ep log
q(X) 

Scoring matrices
Identity
PAM
BLOSUM
Scoring Matrices Types
• Identity matrix – exact matches receive one score and
non-exat matches a different score (say 1 and 0, or 6 and
–1 for local alignment.).
• Mutation data matrix – a scoring matrix compiled based
on observation of protein point mutation (PAM,
BLOSUM).
• Physical properties matrix – amino acids with with
similar properties (e.G. hydrophobicity ) receive high
score.
• Genetic code matrix – amino acids are scored based on
similarities in the coding triple (codons).
Substitution Matrix
 Amino acids substitute easily for another due to similar
physicochemical properties

Isoleucine for Valine (both small, hydrophobic)
Serine for Threonine (both polar)

Such changes – “conservative”

 Thus, need a way to increase sensitivity of the alignment
algorithm

Solution – substitution matrix
 Therefore, we need a range of values that depend on the
nature of sequences being compared
 Identical amino acids > Conservative substitutions >
Nonconservative substitutions
Choice of scoring matrix is
dictated by the alignment goals
• Two proteins are homologous if (and only if) they are
evolutionarily related (have a common ancestor)
• Homologous proteins are likely to have related functions
(and have the same fold)
• Scoring matrices must in some way model our
understanding of protein evolution.
• Based on the result of the search we have to be able to
decide if the discovered sequence similarity could happen by
chance or is a signature of likely homology.