MSA - Liisa Holm`s Bioinformatics Group

Download Report

Transcript MSA - Liisa Holm`s Bioinformatics Group

Multiple sequence alignment
(MSA)
Usean sekvenssin rinnastus
Petri Törönen
Help contributed by: Liisa Holm & Ari Löytynoja
What is MSA?
• MSA is an alignment generated from three
or more sequences.
• MSA is usually a more global alignment,
i.e., the aim is to align homologous residues
(nucleotides or amino acids) in columns
across the length of the whole sequences.
GA--GTACA
CAC-GTATA
CACGGTATG-CGGTCTA
What is MSA?
Picture shows protein multiple sequence alignment
http://en.wikipedia.org/wiki/Multiple_sequence_alignment
Why MSA
• ”MSA emphasises signal observed in the
pairwise alignment” (Liisa Holm)
• Improved alignments!!
• Alignment of more distant sequences with
the help from intermediate sequences
• Highlight the conserved regions in
sequences
http://ekhidna.biocenter.helsinki.fi/users/petri/public/opetus_jutut/Bioinf_Per_Lects/urease_output.txt
Why MSA
MSA is input to many analysis tasks:
• Detection of active site
• Generation sequence profiles
• Detection of protein domains and motifs
• Phylogenetics
…
Remember
• First step of MSA:
• Good selection of sequences to the
analysis
• Sequences need to be
functionally/evolutionarily related
• Sometimes it is good to have some variation
in the sequences (depends on the analysis
task)
• Alternative: Rubbish in → Rubbish out
MSA methods
• Finding optimal multiple sequence
alignment is computationally hard task
• “Correct” answer would always come by
extending dynamic algorithm to multiple
sequences
• In practice dynamic algorithm cannot be
applied to MSA problems
• We need approximate solutions (heuristics)
http://en.wikipedia.org/wiki/Multiple_sequence_alignment#Dynamic_programming_and_
computational_complexity
MSA methods: heuristics
•
•
•
•
Progressive Alignment (not much used)
Iterative Alignment (most popular)
Hidden Markov Models
Pattern Based methods
Progressive alignment
• Divide unsolvable task into subtasks that
can be solved
• Align first most similar pairs of sets of
sequences
– Sequence sets can have 1 or many sequences
– First the sets include only single sequences
• Move progressively to more bigger sets and
to more difficult pairs of sets
• Always align only two pairs of sets at the
time
Progressive alignment
• Produce pairwise alignments between all
the sequences you want to align with MSA.
– Dynamic programming, ktup-methods..
• Produce a “guide tree” on the basis of the
pairwise distances calculated from pairwise
alignments
– UPGMA, neighbor joining
• Produce an MSA using the “guide tree”.
– Sequences are aligned in the same order as the
guide tree instructs.
Set of sequences
Get pairwise similarities from alignments
Create a cluster tree from similarities
All against all pairwise alignment
Here demonstrated for 1. sequence
Join sequences in the order obtained
From the cluster tree
Guide tree construction: UPGMA
• Unweighted Pair Group Method with
Arithmetic mean
• One of the fastest tree construction methods
An example: Pairwise alignments
Pairwise distances, based on
pairwise alignments
Number of
nucleotide
differences
Absolute distances,
used in Pileup/
Clustal
JC-distance
UPGMA based on JC-distances*
0,107 / 2
JC-distances = Jukes-Cantor distances. The observed distances, D, are corrected
for multiple substitutions via correction function –(3/4)*ln(1-(4/3)D)
UPGMA, distance updates
d(human,chimp),gorilla = [d(human, gorilla) + d(chimp, gorilla)] / 2 =
[0,383 + 0,232] / 2 = 0,3075
UPGMA
UPGMA
UPGMA
d(human & chimp),U =
0,3923/2 = 0,1962
0,1962 - 0,0537 = 0,1426
U
0,1962 - 0,116 = 0,080
d(gorilla & orangutan),U
= 0,3923/2 = 0,1962
UPGMA
0,3541 - 0,1426 - 0,0537
or
0,3541 - 0,080 - 0,116
0.7083 / 2
Constructing MSA
human
chimp
ACGTACGTCC
ACCTACGTCC
gorilla
ACCACCGTCC
orangutan ACCCCCCTCC
human
ACGTACGTCC
chimp
ACCTACGTCC
gorilla
ACCACCGTCC
orangutan ACCCCCCTCC
human
chimp
gorilla
orangutan
maqaque
ACGTACGTCC
ACCTACGTCC
ACCACCGTCC
ACCCCCCTCC
CCCCCCCCCC
Alignment score
•
•
•
•
1234
ACGT
ACGA
AGGA
•
•
•
•
1: A-A + A-A + A-A = 1+1+1 = 3
2: C-C + C-G + C-G =1+0+0 = 1
3: G-G + G-G + G-G = 1+1+1 = 3
4: T-A + T-A + A-A = 0+0+1 =1
match=1
mismatch=0
• S(alignment) = S(1) + S(2) + S(3) + S(4) = 3+1+3+1 = 8
• The higher the score, the better the alignment
Progressive alignment - pros and cons
• Pros:
– Fast
• Cons:
– Once gaps are opened they can never be closed
– Errors in the alignment of the first few
sequences can have catastrophic effects on
the whole alignment
– Not much used (to my knowledge)
Iterative alignment
• Create a progressive alignment
• After obtaining the alignment calculate a
quality score
• REPEAT THE FOLLOWING STEPS:
– Redo the cluster tree
– Realign the sequences using the new cluster
tree
– Calculate a quality score
• Loop above can be stopped when a
maximum number is reached or when
quality score is not improved
Iterative alignment
• Allows correction of errors that was not
possible in progressive alignment
• Very popular among the MSA methods
• Increases the running time of the method
Iterative alignment
Iteration loop
Diagram of typical iterative MSA program workflow.
Figure from Do & Katoh 2008 http://ai.stanford.edu/~chuongdo/papers/alignment_review.pdf
What MSA program(s) to use?
• Depends on the application
– Phylogenetic studies
– Structure based studies
• Depends on the size of the data
– Some programs cannot handle large dataset
• Remember to evaluate the alignment by eye
What MSA program(s) to use?
• Collection of MSA programs at EBI
• http://www.ebi.ac.uk/Tools/msa/
Summary of MSA
• MSA is relevant for many analysis tasks
– Improved signal from the alignment
• Solving MSA requires heuristics
• Selection of MSA methods depends on the
application
• Results should be evaluated by eye
– And the errors should be corrected with MSA
editors
Manual editing of MSAs?
• Let’s say that your performed an MSA witn
computer. However, biologically, it has some faults
- needs manual editing ->
• Editors: Jalview and Seaview
http://www.csc.fi/english/research/sciences/bioscience/programs/index_html
• Input data can be in any of the most common MSA
formats (Mase, Phylip, Clustal, MSF, Fasta,
NEXUS, PIR and BCL)