Transcript Slides
BNFO 240
Usman Roshan
Last time
• Traceback for alignment
• How to select the gap penalties?
• Benchmark alignments
– Structural superimposition
– BAliBASE
Database searching
• Suppose we have a set of 1,000,000
sequences
• You have a query sequence q and want
to find the m closest ones in the
database---that means 1,000,000
pairwise alignments!
• How to speed up pairwise alignments?
FASTA
• FASTA was the first
software for quick
searching of a
database
• Introduced the idea
of searching for kmers
• Can be done quickly
by preprocessing
database
FASTA: combine high scoring hits into
diagonal runs
BLAST
Key idea: search for k-mers (short matchig substrings)
quickly by preprocessing the database.
BLAST
This key idea can also be used for speeding up pairwise
alignments when doing multiple sequence alignments
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular
• PAM was developed by Margaret
Dayhoff and co-workers in 1978 by
examining 1572 mutations between 71
families of closely related proteins
• BLOSUM is more recent and computed
from blocks of sequences with sufficient
similarity
PAM
• We need to compute the probability transition
matrix M which defines the probability of
amino acid i converting to j
• Examine a set of closely related sequences
which are easy to align---for PAM 1572
mutations between 71 families
• Compute probabilities of change and
background probabilities by simple counting
PAM
• In this model the unit of evolution is the amount
of evolution that will change 1 in 100 amino
acids on the average
The scoring matrix Sab is the ratio of Mab to pb
PAM Mij matrix (x10000)
Multiple sequence alignment
• “Two sequences whisper, multiple
sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
Formally…
Multiple sequence alignment
Unaligned sequences
Aligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_
Conserved regions help us
to identify functionality
Sum of pairs score
Sum of pairs score
• What is the sum of
pairs score of this
alignment?
Profiles
• Before we see how to construct multiple
alignments, how do we align two
alignments?
• Idea: summarize an alignment using its
profile and align the two profiles
Profile alignment
Iterative alignment
(heuristic for sum-of-pairs)
• Pick a random sequence from input set S
• Do (n-1) pairwise alignments and align to
closest one t in S
• Remove t from S and compute profile of
alignment
• While sequences remaining in S
– Do |S| pairwise alignments and align to closest
one t
– Remove t from S
Iterative alignment
• Once alignment is computed randomly
divide it into two parts
• Compute profile of each sub-alignment
and realign the profiles
• If sum-of-pairs of the new alignment is
better than the previous then keep,
otherwise continue with a different
division until specified iteration limit
Progressive alignment
• Idea: perform profile alignments in the
order dictated by a tree
• Given a guide-tree do a post-order
search and align sequences in that
order
• Widely used heuristic