MultipleSequenceAlignment
Download
Report
Transcript MultipleSequenceAlignment
Multiple Sequence
Alignment
CSC391/691 Bioinformatics
Spring 2004
Fetrow/Burg/Miller
(Slides by J. Burg)
Why do we care about sequence
alignment?
It can tell us something about the evolution of organisms.
We can see which regions of a gene (or its derived
protein) are susceptible to mutation and which can have
one residue replaced by another without changing
function.
Homologous genes (genes with share evolutionary
origin) have similar sequences.
Orthologs are genes that are evolutionarily related, have
a similar function, but now appear in different species.
Paralogs are evolutionarily related (share an origin) but
no longer have the same function.
You can uncover either orthologs or paralogs through
sequence alignment.
Multiple Sequence Alignment
Often applied to proteins
Proteins that are similar in sequence are
often similar in structure and function
Sequence changes more rapidly in
evolution than does structure and function.
Overview of Methods
Dynamic programming – too computationally
expensive to do a complete search; uses
heuristics
Progressive – starts with pair-wise alignment of
most similar sequences; adds to that
Iterative – make an initial alignment of groups of
sequences, adds to these (e.g. genetic
algorithms)
Locally conserved patterns
Statistical and probabilistic methods
Dynamic Programming
Computational complexity – even worse
than for pair-wise alignment because
we’re finding all the paths through an ndimensional hyperspace (We can picture
this in 2 or 3 dimensions.)
Can align about 7 relatively short (200300) protein sequences in a reasonable
amount of time; not much beyond that
A Heuristic for Reducing the
Search Space in
Dynamic Programming
Let’s picture this in 3 dimensions (pp. 146-157 in
book). It generalizes to n.
Consider the pair-wise alignments of each pair
of sequences.
Create a phylogenetic tree from these scores.
Consider a multiple sequence alignment built
from the phylogenetic tree.
These alignments circumscribe a space in which
to search for a good (but not necessarily
optimal) alignment of all n sequences.
Phylogenetic Tree
Dynamic programming uses a
phylogenetic tree to build a “first-cut” msa
The tree shows how protein could have
evolved from shared origins over
evolutionary time.
See page 143 in Bioinformatics by Mount.
Chapter 6 goes into detail on this.
Dynamic Programming -- MSA
Create a phylogenetic tree based on pair-wise
alignments (Pairs of sequences that have the best
scores are paired first in the tree.)
Do a “first-cut” msa by incrementally doing pair-wise
alignments in the order of “alikeness” of sequences as
indicated by the tree. Most alike sequences aligned first.
Use the pair-wise alignments and the “first-cut” msa to
circumscribe a space within which to do a full msa that
searches through this solution space.
The score for a given alignment of all the sequences is
the sum of the scores for each pair, where each of the
pair-wise scores is multiplied by a weight є indicating
how far the pair-wise score differs from the first-cut msa
alignment score.
Heuristic Dynamic Programming
Method for MSA
Does not guarantee an optimal alignment
of all the sequences in the group.
Does get an optimal alignment within the
space chosen.
Progressive Methods
Similar to dynamic programming method in that
it uses the first step (i.e., it creates a
phylogenetic tree, aligns the most-alike pair, and
incrementally adds sequences to the alignment
in order of “alikeness” as indicated by the tree.)
Differs from dynamic programming method for
MSA in that it doesn’t refine the “first-cut” MSA
by doing a full search through the reduced
search space. (This is the computationally
expensive part of DP MSA in that, even though
we’ve cut down the search space, it’s still big
when we have many sequences to align.)
Progressive Method
Generally proceeds as follows:
Choose a starting pair of sequences and align them
Align each next sequence to those already aligned,
one at a time
Heuristic method – doesn’t guarantee an optimal
alignment
Details vary in implementation:
How to choose the first sequence to align?
Align all subsequence sequences cumulatively or in
subfamilies?
How to score?
ClustalW
Based on phylogenetic analysis
A phylogenetic tree is created using a pairwise distance
matrix and nearest-neighbor algorithm
The most closely-related pairs of sequences are aligned
using dynamic programming
Each of the alignments is analyzed and a profile of it is
created
Alignment profiles are aligned progressively for a total
alignment
W in ClustalW refers to a weighting of scores depending
on how far a sequence is from the root on the
phylogenetic tree (See p. 154 of Bioinformatics by
Mount.)
Problems with Progressive Method
Highly sensitive to the choice of initial pair
to align. If they aren’t very similar, it
throws everything off.
It’s not trivial to come up with a suitable
scoring matrix or gap penaties.
Iterative Methods for Multiple
Sequence Alignment
Get an alignment.
Refine it.
Repeat until one msa doesn’t change
significantly from the next.
An example is genetic algorithm approach.
Genetic Algorithms
A general problem solving method
modeled on evolutionary change.
Create a set of candidate solutions to your
problem, and cause these solutions to
evolve and become more and more fit
over repeated generations.
Use survival of the fittest, mutation, and
crossover to guide evolution.
Evolutionary Change in
Genetic Algorithms
survival of the fittest – the best solutions
survive and reproduce to the next
generation
mutation – some solutions mutate in
random ways (but they must always
remain viable solutions)
crossover – solutions “exchange parts”
Laying Out the Problem
What would a candidate solution look like
in a multiple sequence alignment
program? (an msa of ~20 proteins)
How many candidate solutions should
there be? (~100)
Evolving to a Next Generation
Which candidate solutions should survive
to the next generation?
First,
take the top half based on best sum of
pairs scores
Then randomly select second half, giving
more chance to an msa’s being selected in
proportion to how good its score is
How would mutation work?
Can’t change a sequence in the msa.
Otherwise you would be created a solution
that isn’t really a solution.
You can only insert or rearrange gaps.
How would crossover work?
See page 160 in Bioinformatics by Mount.
Profiles and Motifs
A sequence motif is a relatively short pattern that
appears consistently with a family of proteins.
(Motifs can also appear in families of DNA or
RNA molecules.)
Frequently, motif-based analysis is used to
detect patterns of amino acids in proteins that
correspond to structural or functional features.
Motifs are generated during multiple sequence
alignment. They can be displayed as patterns of
amino acids, as sequence logos, or as profile
scoring matrices.