Slide - Biomedical Informatics

Download Report

Transcript Slide - Biomedical Informatics

Lecture 3: Scoring Matrices and
Multiple Sequence Alignment
BMI/IBGP 730
Kun Huang
Department of Biomedical Informatics
Ohio State University
Review of sequence alignment
Scoring matrices
Multiple sequence alignment
Tools and examples
Dynamic programming
•
The name comes from an operations research task, and
has nothing to do with writing programs. Programming –
use tabular structure for computing.
3
18
20
5
A
2
11
25
17
B
Dynamic programming matrix
•
•
Each cell has the score for the
best aligned sequence prefix up
to that position.
Example:
ATGCT vs. ACCGCT
Match: +2, mismatch: 0,
gap: -1
Gap
A
T
G
C
T
Gap
0
-1
-1
-1
-1
-1
A
-1
2
0
0
0
0
C
-1
0
0
0
2
0
C
-1
0
0
0
2
0
G
-1
0
0
2
0
0
C
-1
0
0
0
2
0
T
-1
0
2
0
0
2
Matching matrix, NOT the
dynamical programming matrix!
Dynamic programming matrix
0
0
A
A
AT
AC
AT
AC
A_ T
AC_
AT
A_
AT _
A_C
AT
AC
A_
AT
1
2
3
4
5
6
Gap
A
T
G
C
T
1
Gap
0
-1
-2
-3
-4
-5
2
A
-1
2 (2)
1(0)
0
0
0
3
C
-2
1 (0)
2 (0)
1
2
1
4
C
-3
0
1 (0)
2
3
2
5
G
-4
0
0
3 (2)
2
3
6
C
-5
0
0
2
5 (2)
4
7
T
-6
0
2
1
4
7 (2)
Optimal alignment by traceback
• We “traceback” a path that gets us the highest score. If
we don't have “end gap” penalties, then take
any path from the
0
1
2
3
4
5
6
last row or column
Gap
A
T
G
C
T
0
to the first.
0
-1
-2
-3
-4
-5
1 Gap
• Otherwise we need
A
-1
2 (2)
1(0)
0
-1
-2
2
to include the top
and bottom corners
C
-2
1 (0) 2 (0)
1
2
1
3
AT - GCT
ACCGCT
A - TGCT
ACCGCT
4
C
-3
0
1 (0)
2
3
2
5
G
-4
-1
0
3 (2)
2
3
6
C
-5
-2
-1
2
5 (2)
4
7
T
-6
-3
0
1
4
7 (2)
Dynamic programming
•
Global alignment – an alignment of two or more sequences
that matches as many characters as possible in all of the
sequences. Needleman-Wunch algorithm
•
Local alignment – an alignment that includes only the best
matching, highest-scoring regions in two or more
sequences. Smith-Waterman algorithm
•
Difference – all the scores are kept in the dynamical
programming matrix for global alignment; only the positive
scores are kept in the dynamical programming matrix for
local alignment, the negative ones are converted to zero.
BLAST – Algorithm Intuition
The BLAST algorithm.The BLAST algorithm is a heuristic search method that seeks
words of length W (default = 3 in blastp) that score at least T when aligned with the
query and scored with a substitution matrix. Words in the database that score T or
greater are extended in both directions in an attempt to find a locally optimal ungapped
alignment or HSP (high scoring pair) with a score of at least S or an E value lower than
the specified threshold. HSPs that meet these criteria will be reported by BLAST,
provided they do not exceed the cutoff value specified for number of descriptions and/or
alignments to report.
BLAST – Algorithm Intuition
Databases are pre-indexed
by the words.
Without gaps:
Altschul, S. F., Gish, W.,
Miller, W., Myers, E. W.,
Lipman, D. J., J. Mol. Biol.
(1990) 215:403-410
With gaps:
Altschul, S. F., Madden, T.
L., Schaffer, A. A., Zhang,
J., Zhang, Z., Miller, W.,
Lipman, D. J., Nucleic Acids
Research (1997)
25(17):3389-3402
http://www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure10.pdf
An alignment score
•
•
An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted for
each gap.
Gap penalties are usually "affine" meaning that the
penalty for a gap computed as a linear function of its
length x:
gap penalty = ax + b.
Match Gap start + continuation
a b c - - d
penalty
score
a c c e f d
9 2 7
6 => 24 - (10 + 2) = 12
Alignment
Score
How we pick substitution scores?
• Problem: how to score each residue pair in two aligned
sequences?
– intuitively the score should reflect the likelihood that two
sequences are aligned because they related as
opposed to a random alignment
• We assume the letter a occurs independently with
frequency qa
– random match scenario: the likelihood that the letters a
and b occur by chance is qaqb
– true match scenario: the aligned pair a,b occurs with
joint probability pab
Substitution scores
• The likelihood that the residue pair (a, b)
occurs as an aligned pair, as opposed to
an unaligned pair, is pab/qaqb
• If we use the logarithm of this ratio (logodds ratio), the score for the entire
alignment is additive
p

P( x, y | Match)

P( x, y | Random )  q  q
xi yi

i
xi
i
i
yi
i
pxi yi
qxi q yi
 pab 

s(a, b)  log 
 qa qb 
S   s( xi , yi )
i
Substitution matrices
• As we wanted, the total score is the sum of individual
scores s(a, b) for each aligned pair
• The s(a, b) can be arranged in a matrix called
substitution or score matrix
– for proteins, this is a 20 x 20 matrix
• A number of such matrices have been proposed, each
computed according to different strategies
Alignment as An Optimization Problem
Optimization criteria / cost function
•
•
What sort of alignment should be considered
Scoring system
• Additive model
• Based on probability compared with random
sequence (PAM, BLOSUM)
• Assumption of independence
• More complicated cases
Gap penalty – linear (s = -gd)
affine (s = -d – (g-1)e)
Clusters of amino acids
Hydrophobic
Polar
A
V
G
P
A
I
L
Q
T
S
N
D
E
Acid
Y
F
C
M
W
H
Aromatic
K
R
Basic
Match scores
• Match scores are often
•
calculated on the basis
of the frequency of
particular mutations in
very similar sequences.
We can then transform
substitution frequencies
into log odds scores,
and then normalize.
Example of construction of a
phylogenetic tree
Organism A
Organism B
Organism C
A W T V A A A V R T S I
A Y T V A A A V R T S I
A W T V A A A V L T S I
A
B
C
W to Y
L to R
• Assumptions:
– high similarity => each position should have only
changed once during evolution
– number of changes proportional to evolutionary
distance
The PAM (point accepted mutation)
approach
• Amino acid substitutions were estimated using 1572 changes
in 71 groups of protein sequences with at least 85% similarity
– reasons for requiring such high degree of similarity:
• to exclude possibility that observed substitutions reflect
two successive changes instead of just one
• to make possible building phylogenetic trees for each
group and reconstruct probable history of amino acid
changes
• The number of changes of every amino acid a into every b
was the base for computing the scores
– two-step procedure:
• # of changes -> relative mutability (relative number of
changes)
• relative mutability -> scores
Computing relative mutability in PAM
• Problem: how to account for differences across groups in
sequence composition, mutation rate?
– a normalization factor called exposure to mutation for
each amino acid was computed for each group
– the # of changes per amino acid (in each group) was
divided by this factor
– the relative frequencies thus obtained were then
summed for all groups (sum called relative mutability)
• Exposure to mutation of a defined as Pa * K
– Pa : frequency of occurrences of a in group
– K : total # of all changes in group per 100 sites
Deriving scores in PAM
• Relative mutability used to compute PAM1
– ma = fa/(100*pa*K)
– Mab = (fab/fa) * ma
– PPhe<->Tyr = 260/1572 * relative mutability of Phe *
fraction of Phe<->Tyr out of 260 Phe changes
– Repeat for all other 18 PPhe<->whatever
– The resulting 20 transition probabilities were
normalized so that their sum was 1
– To obtain the scores, each Pab probability was
• divided by Pb (likelihood, odds ratio)
• the log taken, multiplied by 10 and rounded
PAM matrices and evolutionary distance
• All these computations were based on an amount of
evolution corresponding to 1 mutation in 100 amino
acids on average
– called a 1 PAM evolutionary distance
• Assuming a Markov model for amino acid substitutions
the matrices for an evolutionary distances of N can be
computed as PAM1N
– Markov model assumes that every change is
independent of the previous changes at that site
– example: PAM250 is computed as PAM1250
Alignment as An Optimization Problem
Optimization criteria / cost function
•
•
What sort of alignment should be considered
Scoring system
• Additive model
• Based on probability compared with random
sequence (PAM, BLOSUM)
• Assumption of independence
• More complicated cases
Gap penalty – linear (s = -gd)
affine (s = -d – (g-1)e)
The PAM250 scoring matrix
BLOSUM
• The BLOSUM matrix approach is based on the study of
~2000 conserved amino acid patterns
– protein were grouped in 500+ families based on similarity
of biochemical function
– the patterns were found by looking for motifs of type d1 aa1
d2 aa2 d3 aa3, where di are stretches of up to 24 amino
acids located in all sequences
– These initial patterns were organized in larger ungapped
regions (blocks) of size between 3 and 60 amino acids
– The multiple alignment of these blocks is the basis for
determining the probabilities of amino acid changes
– To better balance the count, sequences with more than
K% similarity were grouped together in one sequence
before counting the amino acid substitutions in the multiple
alignment
• The version of the matrix thus derived is called
BLOSUMK
BLOSUM62
•
•
BLOSUM62 is widely used
for protein sequence
alignments
Differences with PAM:
– no assumptions on
evolutionary model
– based on scoring at
conserved positions in
blocks, rather than in
whole sequences
– based on analysis of
larger number of families
Which scoring matrix to use?
• For protein match scores, two main options
–
–
PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives.
BLOSUM based on local alignments in much more
diverse sequences
• Picking the right distance is important, and
may be hard to do. BLOSUM seems to
work better for more evolutionarily distant
sequences. BLOSUM62 is a good default.
Picking gap penalties
• Many different possible forms:
– Most common is affine
(gap open + gap continue penalities)
– More complex penalties have been proposed.
• Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences the
gap penalty
• Most alignment programs suggest appropriate
penalties for each match score option.
The significance of an alignment
• Significance testing is the branch of statistics that is
concerned with assesing the probability that a
particular result could have occurred by chance.
• How do we calculate the probability that an alignment
occurred by chance?
– Either with a model of evolution, or
– Empirically, by scrambling our sequences and
calculating scores on many randomized sequences.
Multiple Sequence Alignment
• Often applied to proteins
• Proteins that are similar in sequence are
often similar in structure and function
• Sequence changes more rapidly in
evolution than does structure and function.
Overview of Methods
• Dynamic programming – too computationally expensive
to do a complete search; uses heuristics
• Progressive – starts with pair-wise alignment of most
similar sequences; adds to that
• Iterative – make an initial alignment of groups of
sequences, adds to these (e.g. genetic algorithms)
• Locally conserved patterns
• Statistical and probabilistic methods
Dynamic Programming
• Computational complexity – even worse than for pairwise alignment because we’re finding all the paths
through an n-dimensional hyperspace (We can picture
this in 2 or 3 dimensions.)
• Can align about 7 relatively short (200-300) protein
sequences in a reasonable amount of time; not much
beyond that
A Heuristic for Reducing the Search
Space in Dynamic Programming
• Let’s picture this in 3 dimensions. It generalizes to n.
• Consider the pair-wise alignments of each pair of
sequences.
• Create a phylogenetic tree from these scores.
• Consider a multiple sequence alignment built from the
phylogenetic tree.
• These alignments circumscribe a space in which to
search for a good (but not necessarily optimal) alignment
of all n sequences.
Phylogenetic Tree as a Guilding Tree
• Dynamic programming uses a phylogenetic tree
to build a “first-cut” MSA.
• The tree shows how protein could have evolved
from shared origins over evolutionary time.
Dynamic Programming -- MSA
• Create a phylogenetic tree based on pair-wise
alignments (Pairs of sequences that have the best
scores are paired first in the tree.)
• Do a “first-cut” msa by incrementally doing pair-wise
alignments in the order of “alikeness” of sequences as
indicated by the tree. Most alike sequences aligned first.
• Use the pair-wise alignments and the “first-cut” msa to
circumscribe a space within which to do a full msa that
searches through this solution space.
• The score for a given alignment of all the sequences is
the sum of the scores for each pair, where each of the
pair-wise scores is multiplied by a weight є indicating
how far the pair-wise score differs from the first-cut msa
alignment score.
Heuristic Dynamic Programming Method
for MSA
• Does not guarantee an optimal alignment of all
the sequences in the group.
• Does get an optimal alignment within the space
chosen.
Progressive Methods
• Similar to dynamic programming method in that
it uses the first step (i.e., it creates a
phylogenetic tree, aligns the most-alike pair, and
incrementally adds sequences to the alignment
in order of “alikeness” as indicated by the tree.)
• Differs from dynamic programming method for
MSA in that it doesn’t refine the “first-cut” MSA
by doing a full search through the reduced
search space. (This is the computationally
expensive part of DP MSA in that, even though
we’ve cut down the search space, it’s still big
when we have many sequences to align.)
Progressive Method
• Generally proceeds as follows:
– Choose a starting pair of sequences and align them
– Align each next sequence to those already aligned,
one at a time
• Heuristic method – doesn’t guarantee an optimal
alignment
• Details vary in implementation:
– How to choose the first sequence to align?
– Align all subsequence sequences cumulatively or in
subfamilies?
– How to score?
ClustalW
• Based on phylogenetic analysis
• A phylogenetic tree is created using a pairwise distance
matrix and nearest-neighbor algorithm
• The most closely-related pairs of sequences are aligned
using dynamic programming
• Each of the alignments is analyzed and a profile of it is
created
• Alignment profiles are aligned progressively for a total
alignment
• W in ClustalW refers to a weighting of scores depending
on how far a sequence is from the root on the
phylogenetic tree (See p. 154 of Bioinformatics by
Mount.)
Problems with Progressive Method
• Highly sensitive to the choice of initial pair
to align. If they aren’t very similar, it
throws everything off.
• It’s not trivial to come up with a suitable
scoring matrix or gap penalties.
ClustalW
http://www.ebi.ac.uk/clustalw/
http://align.genome.jp
ClustalW
>query
MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVW
VHTELGYFSGEAVPS NGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRV
TDVIITHAHADRIGGMKTLKERGIKAHSTALT AELAKKNGYEEPLGDLQSVTNLK
FGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNV ADAYV
NEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK
>gi|2984094
MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAY
FYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFR
EVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQ
VGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWL
VCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVE
CVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE
>gi|115023|sp|P10425|
MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVW
VHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTD
VIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTK
VETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIE
NMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK
>gi|115030|sp|P25910|
MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGM
IVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQR
KGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIV
VWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGH
GDYGGTELIEHTKQIVNQYIESTSKP
>gi|282554|pir||S25844
MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWV
DKLAAGPGRTVVNTH FHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWP
RVDWGEIELRPPNVTFRDRLTLHVGERQVE LICVGPAHTDHDVVVWLPEERVLFAGD
VVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWV
QRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAM
EI FAELVAYNGGQLPTCLA