Heuristic pairwise sequence alignment

Download Report

Transcript Heuristic pairwise sequence alignment

Multiple sequence alignment (msa)
Lecture 8 CS566
1
Motivation
• “Two swallows do not make a summer”
• Discover conserved regions
– Predict important regions of the protein
– Discover domains
• Search for additional members of a protein
family (profile-based searching)
• Build phylogenetic trees
Lecture 8 CS566
2
Topics
• Scoring schemes
– Pairwise
– N-way
• Optimal
– Multidimensional dynamic programming
• Heuristic algorithms
– Progressive
– Iterative
Lecture 8 CS566
3
Scoring schemes
• Alignment score = l Cl
• Column Score Cl
– Ideally
• Based on n-way joint probability (n-generalized AAS)
– Sum of Pairs
•
•
•
•
i<j sij Based on amino acid substitution matrices
Gap-gap = 0; Gap-char = -g
Commonest scheme used
Fallacious:
– Assumes only 2-way and not n-way joint probabilities
– Score not proportional to number of sequences in alignment
– N-way sums
• Need to know central point of reference (ancestral sequence)
Lecture 8 CS566
4
Multidimensional Dynamic
Programming
• Line up n sequences in a grid having n dimensions
• Score each cell as the maximum of
– Lining up all corresponding characters AND
– All possible combinations of gaps and characters
•
•
•
•
•
Note choice made
Reconstruct alignment by traceback
Global or Local dynamic programming?
Space complexity?
Time complexity?
Lecture 8 CS566
5
MSA – Efficient Multidimensional
Dynamic Programming
• Carillo-Lipman MSA algorithm
– Uses pair-wise dynamic programming to identify
sub-matrix regions of near-optimality
– n-dimensional dynamic programming carried out
within space of intersection of near-optimal
regions
– Still limited to only a few sequences
– Is this an optimal algorithm or not?
Lecture 8 CS566
6
Progressive alignment
• New concepts
– Consider aligning alignments to
alignments/sequences en bloc
– Hierarchical/Sequential order of alignment
(“Once a cobbler, always a cobbler”)
• Heuristic
• Fast
Lecture 8 CS566
7
Progressive alignment - Clustal
•
•
•
•
Compute all pairwise alignments
Convert alignment scores into distances
Build guide tree (phylogenetic tree)
Align sequences in order suggested by ‘guide
tree’
• Position specific scoring system used
– Gap costs depend on position
• Composition based scoring system used
– Percentage similarity dictates choice of scoring matrix
– Weighting based on composition bias
• Only ‘cross-terms’ (profile-profile) used in
scoring
Lecture 8 CS566
8
Progressive alignment - Clustal
• ClustalV (Now history!)
• ClustalW (Takes weighting into account for
composition bias)
• ClustalX (Graphical interface)
Lecture 8 CS566
9
Iterative refinement-1
• “Once a cobbler, now a king!”
• Iterative algorithm:
– Compute all pairwise similarities
– Start with best pair
– Add ‘most-similar’ sequence to profile
successively till none left
– Remove and re-align each sequence till
convergence
Lecture 8 CS566
10
Iterative refinement-2
• Genetic programming-based msa
–
–
–
–
Create initial random alignment
Score alignment
Retain better scoring half of alignment
Mutate remaining half of alignment with ideas from
genetic recombination
• Random gap insertion
• En bloc shifts
• Probabilistic order of alignment
– Score resulting alignment
– Iterate till convergence
Lecture 8 CS566
11