Transcript 3 -2 -1 -2
Chap. 5 Substitution Matrix
DNA Bases
R
Y
A
G
T
C
W
M
A
G
T
C
K
A
G
T
C
B
A
G
T
C
s
V
A
G
T
C
D
A
G
T
C
H
N
A
G
T
C
A
G
T
C
Models of nucleotide substitution
Purine =>
A
transition
G
transversion
Pyrimidine =>
transversion
T
C
transition
Jukes-Cantor (JC)
Kimura 2P
Kimura
Point Mutation (Substitution)
Point mutation – simplest form of mutation and occurs all over DNA
sequences
Transition – mutation within purine (A,G) or pyrimidine (C,T/U)
Transversion – mutation between nt groups
Effects depend on where mutations occur
Non-coding region – no effect on proteins, and neutral
But may have significant effects if occurring in control region
Coding region
Synonymous substitution when a mutation does not change AA
Non-synonymous
AA is replaced by another
stop codon is introduced
Other Mutations
Indel mutation
Small indels of a single base of a few bases are frequent
Particularly frequent with repeated sequences
GCGC…: insertion of extra GC or deletion cause slight slippage
CAG repeated region in huntingtin protein can expand, causing
Huntington’s disease
Indels can cause frame shift, if indels are not multiples of three
Gene inversion
Whole genes are copied to offspring in reverse direction
Translocation
Whole genes can be deleted from one genome and inserted into
another
Protein Substitution Rates
Example
Six letters: I, K, L, Q, T, V
Seven sequences
A:
B:
C:
D:
E:
F:
G:
TL
TL
TL
I I
T I
TL
TL
KK VQ KT
KK VQ KT
KK I Q KQ
TK LQ KQ
TK LQ KQ
TK I Q KQ
TQ I Q KQ
Protein Substitution Rates
Determine AAij
Count AA j being substituted by i
I
K
L
Q
T
V
I K
- - 2 - 1
1 1
1 -
L
2
-
Q T
1
1 1
1
1 -
V
1
-
A:
B:
C:
D:
E:
F:
G:
TL
TL
TL
I I
T I
TL
TL
KK VQ KT
KK VQ KT
KK I Q KQ
TK LQ KQ
TK LQ KQ
TK I Q KQ
TQ I Q KQ
Protein Substitution Rates
Form an evolutionary tree
A:
B:
C:
D:
E:
F:
G:
TL
TL
TL
I I
T I
TL
TL
KK VQ KT
KK VQ KT
KK I Q KQ
TK LQ KQ
TK LQ KQ
TK I Q KQ
TQ I Q KQ
AA Substitutions
Margaret Dayhoff, 1978
– Studied 71 groups of related proteins
– Tabulated 1,572 changes in Aas
– Accepted Point Mutations (PAM)
• Based on empirically observed AA substituions
• In phylogenetic analysis from inferred common
ancestors, not in AA comparisons
Dayhoff counting
Most freq. subs.: Glu to Asp (both acidic)
Sub. Freq to Score Matrix
1. AA mutation prob.
– Mij : Prob. of original AA j mutating to AA i in one PAM
distance
• PAM distance: unit of evolutionary divergence in which 1% of
AA's have changed between two protein sequences
– Mij = Aij /Ni (Ni count of amino acid i) -- normalized by
the prob. of AA i occurring
– Pij(t) : prob. that a site has AA i at time t when it had j at
time 0
• Pij(dt) = Mij
Mutation Prob. Matrix
•
•
Each entry is scaled by 105
Two most freq. substitutions are highlighted
Sub. Frequency to Score Matrix
2. Mutation Prob. Matrix to Log-Odds Scoring
qij : Prob. of aligning j to i
pi: prob. of observing AA i by chance
Odds – related to probability
odds= prob/(1-prob); prob = odds/(1+odds)
prob = 0: odds = 0
prob = 0.5: odds = 1
prob =0.75: odds = 3 (75:25)
Log-odd: sij = 10*log(qij/ pi)
e.g., sED = 10*log(0.00398/0.062)
Can add log-odd scores
PAM matrix
assumptions
Most important assumption – each AA replacement is independent of
previous mutations at the same position
Matrix can be extrapolated into predicted substitution frequencies
at longer evolutionary distances
PAM1 -- # of changes in 100 mutations
PAM2 -- # of changes in 200 mutations, …
PAM1 multiplied by itself 100 times can represent what one would
expect if there were 100 AA changes per 100 residues – PAM100
All sites are equally mutable independent of neighboring residues
No consideration of conserved blocks or motifs
Forces responsible for sequence evolution over shorter time span are
identical to those for longer time spans
1992 – Jones, Taylor, Thornton (JTT)
59,190 substitutions in all sequences in Swiss-Prot
PAM1 matrix
Mutation Prob. Matrix has Pij(t)
PAM1 matrix for related proteins with 1% mutation
= 99% identical between two sequences
For distantly related proteins, other PAM matrices are used by
successively multiplying PAM1
PAM
% identity
0
100
30
75
80
60
110
50
200
25
250
20
Pairwise Alignment vs. PAM
Distance
• Two sequences 100 AAs
– After 80 PAM distances
(80 mutations), 50 AAs
are different
– After 250 hits, 20 AAs
remain the same
PAM matrices
Closely related: Human vs. Chimpanzee (100% AA identical)
Distantly related: HBA vs. HBB (43% AA identical)
BLOSUM
S.Henikoff and J.G. Henikoff
Devised to perform best in identifying distant relationships
Based on BLOCKS database of aligned protein sequences
BLOcks Substitution Matrix
(# of observed pairs of AA at any position)/(# of pairs expected
from the overall AA frequencies) is computed from regions of
closely related proteins alignable without gaps
To avoid overweighting closely related sequences, groups of
proteins with sequence identities higher than a threshold are
replaced by either a single representative or a weighted average
BLOSUM 62
BLOSUM Threshold set at 62
Protein sequences sharing less than 62% identity
Default BLAST
BLOSUM62
Most popular
diagonal
Score for exact match
W-W: score 11:
because alignment
of W between two
sequences is rare
Off-diagonal
W (tryptophan) – Y (tyrosine): score 2
Positive score – occur more often than by chance, but
replacement is not as good as if W is preserved (2 < 11) or if
Y is preserved (2 < 7)
W – V (Valine): score -3
PAM vs. BLOSUM
Chap. 4: Multiple Sequence Alignment
Pairwise Alignment
Dynamic Programming
Multi-sequence Alignment
FASTA (Fast Alignment)
BLAST (Basic Local Alignment Search Tool)
FASTA
Pearson and Lipman, 1988, Fast Alignment
http://www.ebi.ac.uk/Tools/sss/fasta/
Steps
Perform exact match of a subsequences in the query sequence of at least
length ktup to subsequences of database sequences
ktup: default – 2 AA
Search diagonal regions in the alignment matrix that contain as many of
subsequence matches with small distances between subsequences
Then, see if initial regions can be joined by allowing gaps
Time saved
By performing dynamic programming on initially filtered sequences which
are already similar
Only considers pathways through the alignment matrix that remain within a
band centered around the highest-scoring initial regions
Blast
Exact methods are good and can pick up very distant
relationships
Approximate methods can detect only close
relationships well
Maybe OK when the probe sequence is fairly similar to one or
more sequences in the databank
Take a small k residues in the probe sequence, find all
instances of the k-tuple in the database
For the selected candidate sequences, approximate optimal
alignment is performed
Particularly useful in multi-sequence alignments
Blast (Basic Local Alignment Search Tool)
Program Input
Database
1
blastn
blastp
blastx
tblastn
tblastx
DNA
1
protein
6
DNA
6
protein
36
DNA
DNA
protein
protein
DNA
DNA
DNA can encode six Proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Blast
Detect
the best region of local alignment between a query
and the target
and if there are other plausible alignments
Computational efficiency comes from “seeding” the
search with a small subset of substrings in the query
Substrings from two sequences may be highly
conserved in biological applications
Temple Smith and Michael Waterman, 1981
Biologically relevant diagonal matches are likely to have a
higher score
Word Length and Threshold
Select word length, w (similar to ktup, default is 3 AA),
and a threshold T
1.
Given word length of w, scan database for words of length w that score
higher than a threshold T
Example: for a human RBP query
…FSGTWYA… (query word in red)
A list of words (w=3) is:
FSG SGT GTW TWY WYA
YSG TGT ATW SWY WFA
FTG SVT GSW TWF WYS
According to Blosum62
neighborhood
word hits
> threshold
(T=11)
neighborhood
word hits
< below threshold
GTW
GSW
ATW
NTW
GTY
GNW
GAW
6,5,11
6,1,11
0,5,11
0,5,11
6,5,2
10
9
22
18
16
16
13
Effect of Threshold
You can modify the threshold parameter.
The default value for blastp is 11.
To change it, enter “-f 16” or “-f 5” in the
advanced options of BLAST+.
(To find BLAST+ go to BLAST help download.)
2. Scan the database for entries matching compiled list
3. In each direction, extension terminates when the score falls more than
a certain distance below the best score reached so far
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
extend
Hit!
extend
Identify all exact matches of
k-tuple words (no gaps, no
mismatches)
Extend exact matches in
both directions (no gaps, no
mismatches)
Put extended matches
together with mismatches
and gaps, only in limited
regions containing
preliminary matches
Blast, 1997
• Refined to require two independent hits
• The hits must occur in close proximity to each other.
• With this modification, only one seventh as many
extensions occurred
• greatly speeding the time required for a search
Comparison of Search
Methods
•
FASTA vs. BLAST
– FASTA is more sensitive for DNA-DNA searches, especially for highly diverged
sequences.
– BLAST is better at finding short regions of high similarity, while FAST is better at
finding long regions of lower similarity BLAST will miss similar sequences if they
do not have a single identical word
•
Protein similarity search can find more distant similarities
– DNA has four letters and thus the prob. of chancy matches is much greater
– Protein databanks are much smaller, and searches can be more sensitive
– William Pearson (FASTA author):
• “The number one thing that you should learn is that in general, you should
try not do DNA sequence comparison.”
– Protein-protein search from BLASTX produces more sensitive results than DNADNA search
Effect of word size
For blastn, the word size is typically 7, 11, or 15 (EXACT
match). Changing word size is like changing threshold of
proteins.
w=15 gives fewer matches and is faster than w=11 or w=7.
For megablast, the word size is 28 and can be adjusted
to 64. What will this do? Megablast is VERY fast for
finding closely related DNA sequences!
Word-matching problem
Consider two sequences of lengths N and M
A simple local alignment algorithm looks for the longest exactly matching
word
Score for alignment = the length of the longest match (l)
e.g., l=6
GGATATCCAGCGCTCCT
ATCCGATATCTTGG
Let n denote the length of matching words starting at two random sites (in
red, for example, n = 3 for ‘TCC’)
GGATATCCAGCGCTCCT
ATCCGATATCTTGG
Then, l=max(n)
F(l) has EVD distribution
Word-matching problem -theory
Prob[two bases at randomly selected sites are equal] = a
N: length of matching bases starting at these random sites
P[n ≥ l] = al
Define = -ln(a) = ln(1/a) => P[n ≥ l] = e -l
Compute E[l]
N ways of choosing the first starting sites, and M ways for the 2nd
If NM selected sites are independent, E[l] = NM e -l
However, words starting at different sites overlap, and are not independent
Thus, E[l] = kNM e -l with k < 1
Since P[n < l] = 1- e -l, from the exponential distribution,
P[n = l] = e -l
Prob. of longest matching words of length l:
F(l) = e-(l-u)exp(e-(l –u) ), u = ln(kMN)/
EVD (Gumbel) Distribution
• Thus, F(mmax) does NOT have the same distribution as Gaussian
–
–
–
–
F(mmax) is of an extreme value distribution (EVD) or Gumbel distribution
F(mmax) = *e- (mmax-u)exp(e- (mmax - u))
Single peak skewed to one side
EVD arises whenever dealing with the maximum value taken from a large
number of independent alternatives
• Thus, it is likely to be considerably higher than the value of m obtained
from just two typical sequences
e.g., # of sequences, S=2000; length
of seq., N=200, only two bases C and
G with equal prob.
m has a binomial distribution
mmax is EVD distributed
EVD distribution
•
•
A r.v. x with distribution P(x), and a large sample of S
Find F(xmax)
x
max
– Prob[x <xmax] = - P(x) dx
– F(xmax) = Prob[choose one with xmax and the rest < xmax ]
= S*P(xmax)*{Prob[x <xmax]}S-1
– When P(x) = e-x, Prob[x <xmax] = 1 – e- xmax,
• F(mmax) = S e- xmax (1-e- xmax ) S-1
= Se- xmaxe- Sxmax
((1-a)n exp(-na))
• Set u = ln(S)/ , S = exp( u)
• F(mmax) = e-(xmax-u)exp(e-(xmax –u))
• Single peak at xmax = u
• Width of the peak is controlled by
Significance of matches
• Consider the significance of a match
– The observed value of the top-hit score for the query is mobs
– Prob. of obtaining a value mmax ≥ mobs by chance is given by the area under
the tail of the distribution p(mobs) = 1 – exp(-exp(- (mobs - u)))
– Small p implies that it is less likely the match is to arise by chance (greater
the significance)
– In the same example, mobs = 130, p=3.3% (just big enough to be significant)
• Significance increases as S increases as F(mmax) shifts to the right
Alignment stats (reality)
• BLAST, etc. works by looking for high-scoring local alignments
– When gaps are not allowed, pairwise local alignment scores are shown to
be EVD distributed (Karlin and Altschul, 1990)
– With gaps, scores are believed to be also EVD distributed
– But EVD parameters and u are not known, and has to be computed
empirically
• Once the scoring system and EVD parameters for a given search
algorithm are known, one can estimate the significance of a match
– E: expected number of sequences with a score ≥ observed score S
– E(S) = kMNexp(- S) (N: length of query sequence; M: total length of all the
sequences in the database)
Blast Result: E value
• The expect value E is the number of alignments with scores greater
than or equal to score S that are expected to occur by chance in a
database search.
• An E value is related to a probability value p.
• The key equation describing an E value is:
– E = Kmn e-S
E = Kmn e-S
• This equation is derived from a description of the extreme value
distribution
• S = the score
• E = the expect value = the number of high-scoring segment pairs
(HSPs) expected to occur with a score of at least S
• m, n = the length of two sequences
• , K = Karlin Altschul statistics
Properties
• The value of E decreases exponentially with increasing S (higher S
values correspond to better alignments). Very high scores
correspond to very low E values.
• The E value for aligning a pair of random sequences must be
negative! Otherwise, long random alignments would acquire great
scores
• Parameter K describes the search space (database).
• For E=1, one match with a similar score is expected to occur by
chance. For a very much larger or smaller database, you would
expect E to vary accordingly
From Raw scores to Bit
scores
•
There are two kinds of scores:
– raw scores (calculated from a substitution matrix) and
– bit scores (normalized scores)
• Bit scores are comparable between different searches because they
are normalized to account for the use of different scoring matrices
and different database sizes
• S’ = bit score = (S - lnK) / ln2
• The E value corresponding to a given bit score is:
• E = mn 2 -S’
• Bit scores allow you to compare results between different database
searches, even using different scoring matrices.
E and p
• The expect value E is the number of alignments with scores
greater than or equal to score S that are expected to occur
by chance in a database search.
• A p value is a different way of representing the significance
of an alignment.
• p = 1 - e -E
• Default value of E: 10
• p>0.05 is considered to be significant
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to interpret
than corresponding p values.
E
10
5
2
1
0.1
0.05
0.001
0.0001
p
0.99995460
0.99326205
0.86466472
0.63212056
0.09516258 (about 0.1)
0.04877058 (about 0.05)
0.00099950 (about 0.001)
0.0001000
Two problems standard BLAST cannot solve
1. Use human beta globin as a query against human RefSeq
proteins, and blastp does not “find” human myoglobin. This is
because the two proteins are too distantly related. PSI-BLAST
at NCBI as well as hidden Markov models easily solve this
problem.
2. How can we search using 10,000 base pairs as a query, or
even millions of base pairs? Many BLAST-like tools for
genomic DNA are available such as PatternHunter,
Megablast, BLAT, and BLASTZ.
Page 141
Position specific iterated BLAST:
PSI-BLAST
The purpose of PSI-BLAST is to look deeper into
the database for matches to your query protein
sequence by employing a scoring matrix that is
customized to your query.
Page 146
PSI-BLAST
PSI (Position Specific Iterated) BLAST – Altschul et al., 1997
Use the original BLAST algorithm and retrieve database sequences with
significant matches (p < 0.01)
Multiple alignment is performed
Place all locally aligned sections of the database sequences below the
query sequence
When a gap is inserted in the query seq., corresponding residue in
database seq. is removed, so that all seq.’s are of the same length =>
for speed and simplicity
Multiple sequences are then used as input to the 2nd run
Use PSSM (Position Specific Scoring Matrix) for scoring
Score is dependent on the frequencies of the residues n the column of
the alignment (V is more likely to be aligned with the column with
many V’s or other hydrophobic)
Repeat the process until no more new sequences are added
Search results
• During iteration in PSI Blast,
– New distantly related sequences can be found, which are not detected in
straightforward Blast
– Due to the extra info in the aligned group of sequences not in any one sequence
– On the other hand, by adding sequences, the range of sequence becomes too
broad
• Alignment may end up having little relationship to the original query
• Ranking of hits gives some info about the degree of relatedness
– But, the top hits are not necessarily the most meaningful in terms of evolution
– Several top hits to human genes were from bacteria, which led to claims of
horizontal gene transfer (dis-proved by phylogenetic methods)
– Frequently top hits are not the closest relatives
PSI-BLAST is performed in five steps
1. Select a query and search it against a protein database
2. PSI-BLAST constructs a multiple sequence alignment then
creates a “profile” or specialized position-specific
scoring matrix (PSSM)
Page 146
Inspect the blastp output to identify empirical “rules” regarding amino acids
tolerated at each position
R,I,K
C
D,E,T
K,R,T
N,L,Y,G
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
2
0
0
-3
-2
4
R N D C Q E G H I L K M F P
-2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3
1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1
-3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4
20 amino acids
-3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3
-3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4
-2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1
-2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3
-3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3
-3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3
-2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3
-2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1
-2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1
all the amino acids
-3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3
from
position
-2
-1 -2
-1 -1 -21 to
4 -2 -2 -2 -1 -2 -3 -1
-1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1
the end of your PSI-2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1
BLAST query
-1 0 -1 -1 0 0
protein
-3 -1 -2 -3 -2 -2
-1 0 -1 -1
-3 -4 -5 -3
-2 -2 -3 -3
-2 -2 -2 -1
-1
-2
-2
-1
-1
-3
-2
-1
0
6
-2
-3
-3
0
-1
-2
-2
-3
2
-2
-2
-4
-1
-3
-2
-2
-3
-4
-1
-2
-1
-2
0
-2
-1
-3
-2
-1
-2
-3
-1
-2
-1
-1
-3
-4
-2
1
3
-3
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
R
-2
1
-3
-3
-3
-2
-2
-3
-3
-2
-2
-2
-3
-2
-1
-2
N
-2
0
-4
-3
-4
-2
-4
-3
-4
-4
-2
-2
-4
-1
0
-1
D
-3
1
-5
-4
-5
-2
-4
-4
-4
-4
-2
-2
-4
-2
-1
-2
C Q E G H I L K M
-2 -1 -2 -3 -2 1 2 -2 6
-4 2 4 -2 0 -3 -3 3 -2
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -3 -3 -4 -4 3 1 -3 1
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -1 -1 0 -2 -2 -2 -1 -1
-1 -2 -3 -4 -3 2 4 -3 2
-1 -3 -3 -4 -3 2 2 -3 1
-1
-2 -3
-4 a
-3given
2 4 -3 2
note
that
-1 -2 -3 -4 -3 2 4 -3 2
amino
as-1
-1
-1 -1 acid
0 -2 (such
-2 -2 -1
-1
-1 -1 0in
-2your
-2 -2 -1 -1
alanine)
-2 -2 -3 -4 -3 1 4 -3 2
query
-1
-1 -2 protein
4 -2 -2 can
-2 -1 -2
-2
2 0 different
2 -1 -3 -3 0 -2
receive
-1 -1 -1 3 -2 -2 -2 -1 -1
F
0
-4
1
-1
1
-3
0
3
0
0
-3
-3
1
-3
-3
-3
P
-3
-1
-4
-3
-4
-1
-3
-3
-3
-3
-1
-1
-3
-1
-1
-1
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
2
0
0
-3
-2
4
-1
-3
-1
-3
-2
-2
0
-1
0
-4
-2
-2
-1
-2
-1
-5
-3 -3 -2 -2 -3 2 -2 -1 -2 -1
-2 -1 -1 -1 0 -2 -2 -2 -1 -1
-3
-4
-2
1
3
-3
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
scores for matching
-1
0 0 0 -1 -2 -3 0 -2
alanine—depending
-3 -2 -2 6 -2 -4 -4 -2 -3
on-1the
in-1 -1
-1
-1 position
-2 -2 -1 -1
-3
-2protein
-3 -3 -3 -3 -2 -3 -2
the
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
R
-2
1
-3
-3
-3
-2
-2
-3
-3
-2
-2
-2
-3
-2
-1
-2
N
-2
0
-4
-3
-4
-2
-4
-3
-4
-4
-2
-2
-4
-1
0
-1
D
-3
1
-5
-4
-5
-2
-4
-4
-4
-4
-2
-2
-4
-2
-1
-2
2
0
0
-3
-2
4
-1
-3
-1
-3
-2
-2
0
-1
0
-4
-2
-2
-1
-2
-1
-5
-3
-2
C Q E G H I L K M
-2 -1 -2 -3 -2 1 2 -2 6
-4 2 4 -2 0 -3 -3 3 -2
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -3 -3 -4 -4 3 1 -3 1
-3 -2 -3 -3 -3 -3 -2 -3 -2
-1 -1 -1 0 -2 -2 -2 -1 -1
-1 -2 -3 -4 -3 2 4 -3 2
-1 -3 -3 -4 -3 2 2 -3 1
-1 note
-2 -3 that
-4 -3a given
2 4 -3 2
-1 -2 -3 -4 -3 2 4 -3 2
-1 amino
-1 -1 0acid
-2 -2(such
-2 -1 as
-1
-1 tryptophan)
-1 -1 0 -2 -2in
-2your
-1 -1
-2 -2 -3 -4 -3 1 4 -3 2
can
-1 query
-1 -2 protein
4 -2 -2 -2
-1 -2
-2 receive
2 0 2 different
-1 -3 -3 0 -2
-1 -1 -1 3 -2 -2 -2 -1 -1
F
0
-4
1
-1
1
-3
0
3
0
0
-3
-3
1
-3
-3
-3
scores for matching
-1 tryptophan—
0 0 0 -1 -2 -3 0 -2 -3
-3 -2 -2 6 -2 -4 -4 -2 -3 -4
on-1the
-1 depending
-1 -1 -2 -2 -1
-1 -1 -2
-3 position
-2 -3 -3 -3
-2 -3 -2 1
in-3
the
-3 -2 -2 -3 2 -2 -1 -2 -1 3
-1 protein
-1 -1 0 -2 -2 -2 -1 -1 -3
P
-3
-1
-4
-3
-4
-1
-3
-3
-3
-3
-1
-1
-3
-1
-1
-1
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
PSI-BLAST is performed in five steps
1. Select a query and search it against a protein database
2. PSI-BLAST constructs a multiple sequence alignment then creates a
“profile” or specialized position-specific scoring matrix (PSSM)
3. The PSSM is used as a query against the database
4. PSI-BLAST estimates statistical significance (E values)
1. Repeat steps [3] and [4] iteratively, typically 5 times.
At each new search, a new profile is used as the query
Page 146
Multiple Sequence Alignment Example: k=3
With three sequences v, w, and
u and 3D δ (score of a column
with x,y, and z)
In global alignment
si, j
si-1, j + δ(vi, -)
= max [ si, j-1 + δ(-, wj) ]
si-1, j-1 + δ(vi, wj)
In multiple sequences
si, j,k
si-1, j,k + δ(vi, -, -)
si, j-1,k + δ(-, wj , -)
si, j,k-1 + δ(-, -, uk)
= max [ si-1, j-1,k + δ(vi, wj, -) ]
si-1, j,k-1 + δ(vi, -, uk)
si, j-1,k-1 + δ(vi, wj, -)
si-1, j-1,k-1 + δ(vi, wj, uk)
Multiple Sequence Alignment
Time complexity is O((2n)k)
Heuristics 1
Compute all (k2) optimal pairwise alignments, and combines them
Does not work all the time
Multiple Sequence Alignment
Heuristics 2: Greedy progressive multiple alignment
Select two string with greatest similarities
Merge the two into a new string
Works well for very close sequences
Maybe dependent upon two seed sequences
Clustal uses this approach
For next week
Read
www.cs.uml.edu/~kim/580/review_Annotation.pdf
www.cs.uml.edu/~Kim/580/review_polII_11_Kadonaga.pdf