Multiple Sequence Alignments

download report

Transcript Multiple Sequence Alignments

Multiple Sequence
Alignments
Profiles and Progressive
Alignment
Profiles for families of sequences
can be built from MSAs
1
1
2
3
C
G
A
A
2
3
A
50% 75% 25%
—
C
25%
A
T
T
0%
A
A
G
0% 25%
— A
—
—
25%
0%
0%
0% 25%
0%
0% 50%
Note: While profiles can be used for any kind of
sequence data, we’ll focus on protein sequences
Profiles
• Profile: A table that lists the frequencies of each
amino acid in each position of protein sequence.
• Frequencies are calculated from a MSA containing a
domain of interest
• Allows us to identify consensus sequence
• Derived scoring scheme allows us to align a new
sequence to the profile
– Profile can be used in database searches
– Find new sequences that match the profile
• Profiles also used to compute multiple alignments
heuristically
– Progressive alignment
Profiles: Position-Specific
Scoring Matrix (PSSM)
• To compare a sequence to a profile, need to
assign a score for each amino acid
• The score the profile for amino acid a at
position p is
20
M ( p, a )  f (p, b )  s (a , b )
b 1
where
– f(p,b) = frequency of amino acid b in position p
– s(a,b) is the score of (a,b) (from, e.g., BLOSUM
or PAM)
Profiles: PSSM
Insertion/deletion penalty
Gribskov et al. PNAS. 84 (13): 4355 (1987)
Profiles: Consensus Sequence
• A consensus residue C(p) is generated at each
position of the profile to aid the display of
alignments of target sequences with the
profile.
• The consensus residue c is the amino acid at p
that has the highest score M(p,c).
– c is the amino acid most mutationally similar to all
the aligned residues of the probe sequences at p,
rather than the most common one
Aligning a sequence to a profile
K
K
K
M
L
L
M
L
M
K
M
–
–
L
L
L
New sequence:
K K L L
K
K
–
M
1
2
3 4
5
K .75
.25
.75
L
.75
.75
M .25 .25 .50
.25
.25 .25 .25
M
Align with profile:
K K L - L M
1 - 2 3 4 5
K
K
K
K
M
K
-
L
L
L
M
L
M
K
M
–
L
–
L
L
L
M
K
K
–
M
Scoring a sequence-to-profile
alignment
• Score each column separately according
to PSSM
• Each character contributes to score,
weighed by its frequency
1
2
3 4
5
K .75
.25
.75
L
.75
.75
M .25 .25 .50
.25
-
.25 .25 .25
K
1
K
-
L
2
3
L
4
M
5
Column 1 score:
0.75 s(K,K) + 0.25 s(K,M)
Profile-to-sequence alignments
• Optimum alignment can be found by
dynamic programming
– Extension of Needleman-Wunsch
• Spaces are only added to msa – never
removed
– Once a gap, always a gap
• Can align profiles to profiles
Evolutionary Profiles
• Profiles just seen are called average profiles
• Generally perform well, but disregard some of
the biology
– How did each position evolve?
– Amount of conservation varies from position to
position
– Type of conservation varies from position to
position
• Alternative: Evolutionary profiles
– Gribskov, M. and Veretnik, S., Methods in
Enzymology 266, 198-212, 1996
Evolutionary Profiles
• Idea: Fit a different model at each position
• For each position i :
– For each possible ancestor b for position i
• Try various evolutionary distances x (assume PAM
model), and choose the one that minimizes cross
entropy
20
where
H  fa ln pa
a 1
– fa = observed frequency of a
– pa= predicted frequency of a assuming b is the ancestor
and x is the distance
• This generates 20 distributions for position i
Evolutionary Profiles
• For each position i
– Compute “mixture coefficient,” Wai,
measuring likelihood that the residue a
generated observed distribution (see text)
– Profile is given by
where
• paij = frequency of residue j in the ancestral
residue distribution a at position i
• prandom j = frequency of residue j in the database
Progressive multiple alignment
• Feng & Doolittle 1987, Higgins and
Sharp 1988
• Idea: Sequences to be aligned are
phylogenetically related
– these relationships are used to guide the
alignment
• Popular implementations: CLUSTALW,
PILEUP, T-Coffee
CLUSTALW
1. Perform pair-wise alignments between all
pairs of sequences (n x (n-1)/2 possibilities)
2. Generate distance matrix.
•
Distance between a pair = number of mismatched
positions in alignment divided by total number of
matched positions
3. Generate a Neighbor-Joining ‘guide tree’
from distance table
4. Use guide tree to progressively align
sequences in pairs from tips to root of tree.
•
•
Actually, align profiles
“Once a gap, always a gap”
CLUSTALW
CLUSTALW Tree
Tree calculated from an alignment of more than 1100 ring finger
domains, using ClustalW 1.83.
CLUSTALW heuristics
1. Individual weights are assigned to each sequence in a
partial alignment in order to downweight similar
sequences and up-weight highly divergent ones.
2. Varying substitution matrices at different alignment
stages according to sequence divergence.
3. Gaps
• Positions in early alignments where gaps have been opened
receive locally reduced gap penalties
• Residue-specific gap penalties and locally reduced gap
penalties in hydrophilic regions encourage new gaps in
potential loop regions rather than regular secondary
structure.
Progressive Alignment: Discussion
• Strengths:
– Speed
– Progression biologically sensible (aligns
using a tree)
• Weaknesses:
– No objective function.
– No way of quantifying whether or not the
alignment is good
Problems with CLUSTALW
• Local minimum problem:
– Alignment depends on sequence addition order.
– With each alignment some proportion of residues
are misaligned
• Worse for divergent sequences
– Errors get “locked in” and propagate as sequences
are added
– Can result in arbitrary and incorrect alignments
• Clustal uses global alignment … may not be
accurate for all parts of the sequence
– T-Coffee considers local similarity as well as global
Iterative alignment
• To avoid local minima, realign subgroups of
sequences and then incorporate them into a
growing multiple sequence alignment
– Improves overall alignment score.
– May involve rebuilding the guide tree
– May be randomized
• Programs:
– MultAlin
– PRRP
– DIALIGN
Phylogenetic Alignment
Given a tree for a set of species S, find
ancestral species such that total distance is
minimized.
GTGG
CTGG
CTGG
GTGG
CCGG CTAA GTAA CTTC