Multiple Alignment and Phylogenetic Trees

Download Report

Transcript Multiple Alignment and Phylogenetic Trees

Multiple Alignment and
Phylogenetic Trees
Csc 487/687 Computing for
Bioinformatics
Multiple Sequence Alignment
• One amino acid sequence plays coy; a
pair of homologous sequences whisper;
many aligned sequences shout out loud.
• Very informative
Definition
• A global alignment of a set of sequences is
obtained by
– inserting into each sequence gap characters
• so that
– the resulting sequences are of the same
length
• and so that
– no “column” has only gap characters
Example: Chromo domains aligned
Use of alignments
• High sequence similarity usually means significant
structural and/or functional similarity. The reverse does
not need to be true
• Homolog proteins (common ancestor) can vary
significantly in large parts of the sequences, but still
retain common 2D-patterns, 3D-patterns or common
active site or binding site.
• Comparison of several sequences in a family can reveal
what is common for the family. Something common for
several sequences can be significant when regarding all
of the sequences, but need not if regarding only two.
• Multiple alignment can be used to derive evolutionary
history.
Use of alignments
• Predict features of aligned objects
– conserved positions
• structurally/functionally important
Conserved positions
Use of alignments
• Predict features of aligned objects
– conserved positions
• structurally/functionally important
– patterns of hydrophobicity/hydrophilicity
• secondary structure elements
Helix pattern
Use of alignments
• Predict features of aligned objects
– conserved positions
• structurally/functionally important
– patterns of hydrophobicity/hydrophilicity
• secondary structure elements
– “gappy” regions
• loops/variable regions
Loop?
Loop?
Loop?
Use of Alignments
- make patterns/profiles
• Can make a profile or a pattern that can
be used to match against a sequence
database and identify new family members
• Profiles/patterns can be used to predict
family membership of new sequences
• Databases of profiles/patterns
– PROSITE
– PFAM
– PRINTS
– ...
Prosite: Motifs for
classification
Protein sequence
Prosite
pattern 1
Prosite
pattern 2
Prosite
pattern n
Family 1
Family 2
Family n
Pattern
Regular expression
Profile
Pattern from alignment
[FYL]-x-[LIVMC]-[KR]-W-x-[GDNR]-[FYWLE]-x(5,6)-[ST]-W-[ES]-[PSTDN]-x(3)-[LIVMC]
Alignment problem
Given a set of sequences, produce a
multiple alignment which corresponds as
well as possible to the biological
relationships between the corresponding
bio-molecules
For homologous proteins
• Two residues should be aligned (on top of
each other)
– if they are homologous (evolved from the
same residue in a common ancestor protein)
– if they are structurally equivalent
Automatic approach
• Need a way of scoring alignments
– fitness function which for an alignment
quantifies its “goodness”
• Need an algorithm for finding alignments
with good scores
• Not all methods provide a scoring function
for the final alignment!
Analysis of fitness function
• One can test whether the alignments
optimal under a given fitness function
correspond well to the biological
relationships between the sequences
• For example, if the structure of (some of)
the proteins are known.
Align by use of dynamic programming
• Dynamic programming finds best alignment of k sequences with
given scoring scheme
• For two sequences there are three different column types
• For three sequences there are seven different column types
x means an amino acid, - a blank
Sequence1
x - x x - - x
Sequence2
x x - x - x Sequence3
x x x - x - x
• Time complexity of O(nk) (sequence lengths = n)
Use of dynamic programming
•
Dynamic programming finds best
alignment of k sequences given
scoring scheme
Algorithm for dynamic programming