No Slide Title

Download Report

Transcript No Slide Title

Multiple sequence alignment
Arthur W. Chou
Fall, 2005
Multiple sequence alignment: definition
Given:
• Set of sequences
• Similarity score matrix
• Gap penalties
Find:
Alignment of sequences such that optimal score
is achieved.
Result: a collection of three or more protein or nucleic
acid sequences that are partially or completely aligned,
such that homologous residues are aligned in columns
across the length of the sequences.
Why do we care about protein MA?
1. Useful way to summarize the sequences of related
proteins.
What do globin sequences look like?
4mbn
1myt
2hhb
2mhb
1pbx
2hhb
2mhb
2lhb
1mba
1sdh
1lh1
1hlb
1ith
1ecd
2hbg
.
.
A
A
A
B
B
.
.
A
.
.
A
.
.
----------VLSEGEWQLVLHVWAKVE--ADVAGH
--------------ADFDAVLKCWGPVE--ADYTTM
----------VLSPADKTNVKAAWGKVG--AHAGEY
----------VLSAADKTNVKAAWSKVG--GHAGEY
----------SLSDKDKAAVRALWSKIG--KSADAI
---------VHLTPEEKSAVTALWGKV----NVDEV
---------VQLSGEEKAAVLALWDKV----NEEEV
-PIVDTGSVAPLSAAEKTKIRSAWAPVY--STYETS
----------SLSAAEADLAGKSWAPVFA--NKNAN
--PSVYDAAAQLTADVKKDLRDSWKVIGS--DKKGN
---------GALTESQAALVKSSWEEFN--ANIPKH
GGTLAIQAQGDLTLAQKKIVRKTWHQLMRN--KTSF
----------GLTAAQIKAIQDHWFLNI-KGCLQAA
-----------LSADQISTVQASFDKVK------GD
----------GLSAAQRQVIAATWKDIAGADNGAGV
Why do we care about protein MA?
2. Useful way to find important functional amino acids
by assessing conservation over many sequences.
What is conserved?
DRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKG
PKFAGI-AQADIAGNAAISAHGATVLKKLGELLKAKG
PHF-DLSH-----GSAQVKGHGKKVADALTNAVAHVD
PHF-DLSH-----GSAQVKAHGKKVGDALTLAVGHLD
SHWPDVTP-----GSPHIKAHGKKVMGGIALAVSKID
ESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLD
DSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLD
PKFKGLTTADELKKSADVRWHAERIINAVDDAVASMD
ADFKGKSVAD-IKASPKLRDVSSRIFTRLNEFVNNAA
KRLGNVS---QGMANDKLRGHSITLMYALQNFIDQLD
SFLKGT--SEVPQNNPELQAHAGKVFKLVYEAAIQLE
PQMAGM-SASQLRSSRQMQAHAIRVSSIMSEYVEELD
HKFS-SVPLYGLRSNPAYKAQTLTVINYLDKVVDALG
TQFAG-KDLESIKGTAPFETHANRIVGFFSKIIGELP
GFSGA--------SDPGVAALGAKVLAQIGVAVSHLG
Why do we care about protein MA?
3. Establish evolutionary relationships between sequences.
What was sequence of events leading to current species?
4mbn
1myt
2hhb
2mhb
1pbx
2hhb
2mhb
2lhb
1mba
1sdh
1lh1
1hlb
1ith
1ecd
2hbg
.EAIIHVLHSRHPGDFGADAQGAMNKA
.EVLVKVMHEKAGLD--AGGQTALRNV
AHCLLVTLAAHLPAEFTPAVHASLDKF
AHCLLSTLAVHLPNDFTPAVHASLDKF
AHCILVVISTMFPKEFTPEAHVSLDKF
BNVLVCVLAHHFGKEFTPPVQAAYQKV
BNVLVVVLARHFGKDFTPELQASYQKV
.AVIADTVAAG---------DAGFEKL
.SMFPGFVASVAA--PPAGADAAWTKL
AGPIKKVLASK---NFGDKYANAWAKL
.EAILKTIKEVVGAKWSEELNSAWTIA
.MEALQAELGSD---FNEKTRDAWAKA
AKLVGGVFQEE--FSADPTTVAAWGDA
.AGFVSYMKAHT--DF-AGAEAAWGAT
.ASLLSAMEHRIGGKMNAAAKDAWAAA
Why do we care about protein MA?
4. More precisely understand how to model 3D structures.
What other amino acids are acceptable in this structure?
4mbn
1myt
2hhb
2mhb
1pbx
2hhb
2mhb
2lhb
1mba
1sdh
1lh1
1hlb
1ith
1ecd
2hbg
.EAIIHVLHSRHPGDFGADAQGAMNKA
.EVLVKVMHEKAGLD--AGGQTALRNV
AHCLLVTLAAHLPAEFTPAVHASLDKF
AHCLLSTLAVHLPNDFTPAVHASLDKF
AHCILVVISTMFPKEFTPEAHVSLDKF
BNVLVCVLAHHFGKEFTPPVQAAYQKV
BNVLVVVLARHFGKDFTPELQASYQKV
.AVIADTVAAG---------DAGFEKL
.SMFPGFVASVAA--PPAGADAAWTKL
AGPIKKVLASK---NFGDKYANAWAKL
.EAILKTIKEVVGAKWSEELNSAWTIA
.MEALQAELGSD---FNEKTRDAWAKA
AKLVGGVFQEE--FSADPTTVAAWGDA
.AGFVSYMKAHT--DF-AGAEAAWGAT
.ASLLSAMEHRIGGKMNAAAKDAWAAA
What is the protein MA Gold Standard?
Structural Alignment
If sequences can be aligned, the alignment should
reflect structural similarities.
Thus, the alignment should lead to “match” of
common structural and functional elements.
Aligning non-coding DNA sequences
• Conserved signals in DNA for control of
expression
• Can infer evolutionary relationships
• Can identify Important functional regions
• A much harder problem!
Methods for Multiple Alignment
1. Exhaustive search: extension of DP to multiple
dimensions. E.g. MSA algorithm
2. Progressive alignment: compute tree of sequences,
based on hierarchical clustering, and then merge closest
first, greedily. E.g. ClustalW
3. Anchor on locally conserved blocks: find highly
conserved regions and then grow alignment around these
regions. E.g. BLAST
4. Iterative search: based on genetic algorithm search
5. Probabilistic/statistical: E.g. Gibbs Sampling, HMM
How to score a Multiple Alignment?
Sum of Pairs = SP
Compute the pairwise score of all pairs of characters and
then sum them up, for each aligned column of the
sequences, :
SP-score ( I , - , I, V ) = s(I, -) + s(I, I) + s(I, V) +
s(-, l) + s(-, V) + s(I, V)
Note that s( - , - ) = 0
Gap penalty: can be constant or linear
MSA algorithm uses constant
Multidimensional Dynamic Programming
Why not just use same technique as for
pairwise alignment?
Instead of 2-dimensional matrix, use N-dimensional;
N = the number of sequences.
Complexity increases with the number of
sequences, so only N < 10 and lengths ~ 200 can be
accommodated.
Dynamic Programming with scores and penalties

from ‘i-th’ pos. in A and ‘j-th’ pos. in B, ‘k-th’ pos. in C
onward
SP-score (A[i] , B[j], c[k]) + S[i+1, j+1, k+1]
S[i , j, k] = max
best score
from
i, j, k onward
max { S[i+x, j, k] – w( x ); }
max { S[i, j+y, k] – w( y ); }
max { S[i, j, k+z] – w( z ); }
max { S[ i+x, j+y, k ] – w( x ) – w( y ); }
.............
MSA Algorithm Based on dynamic programming
concept, using some bounds :
1. Compute optimal pairwise alignments to get an
upper bound on any pair of alignments. MSA can’t do
any better than sum of optimal pairwise alignments.
2. Create heuristic multiple alignment in ad hoc
fashion to create a lower bound on MA score (using a
guide tree).
3. Search N-dimensional scoring matrix for the best
score including i-th element of sequence 1, j-th of
sequence 2, k-th of sequence 3, …, etc.
AGT
A-T
-GT
A-T
-GT
Problem of Sequence Weights
The available sequences are not randomly sampled,
but reflect biases in how we collect sequences.
If weight everything equally, then closely related
sequences will be allowed to dominate the multiple
alignment. As a result, conclusions about
1) conservation
2) evolutionary distance
3) reliability of predictions
will be wrong.
Sequence Weighting Example
CYEGNGHF
CYEGNGDF
CYHGNGDS
CYHGNGQS
CFNGNGHS
Human-1
Human-2
Mouse
Rat
Fruitfly
Solutions: don’t weight the two humans equally with
the others. Use a measure of similarity to down-weight
their influence on the multiple alignment.
Feng-Doolittle Progressive MSA
1. Do global pairwise alignments
(Needleman and Wunsch) for every pair of
sequences
2. Create a guide tree based on them
(e.g., neighbor joining)
3. Progressively align the sequences with
weights from the guide tree
Progressive MSA stage 1 of 3:
generate global pairwise alignments
five distantly
related lipocalins
best score
Number of pairwise alignments needed
For N sequences, (N-1)(N)/2
For 5 sequences, (4)(5)/2 = 10
~ N2 / 2
Feng-Doolittle stage 2: guide tree
• Convert similarity scores to distance scores
• Use some clustering algorithm to construct the
guide tree (UPGMA)
• A tree shows the distance between objects
• A guide tree is not a phylogenetic tree
Progressive MSA stage 2 of 3:
generate a guide tree calculated from
the distance matrix
4
1
2
3
5
Feng-Doolittle stage 3: progressive alignment
•
Make successive alignment based on the order in the
guide tree
•
Start with the two most closely related sequences
•
Then add the next closest sequence (or cluster)
•
Continue until all sequences are added
•
Rule: “once a gap, always a gap.”
Progressive MSA stage 3 of 3:
progressively align the sequences
Why “once a gap, always a gap”?
•
Where gaps are added is a critical question
•
Gaps are often added to the first two (closest)
sequences
•
To change the initial gap choices later on would be
to give more weight to distantly related sequences
•
To maintain the initial gap choices is to trust
that those gaps are most believable
Problem with Progressive algorithms
1. Dependence of the ultimate MSA on the
initial pairwise sequence alignment with
the highest score
2. Errors in initial alignments are propagated
3. Gaps can proliferate, if not careful
4. Gaps can be amino-acid specific, so that
you penalize introduction of gaps into
segments that are less likely to have gaps
(e.g. hydrophobic core)
Multiple sequence alignment to profile HMMs
• Hidden Markov models (HMMs) are “states”
that describe the probability of having a
particular amino acid residue at arranged
In a column of a multiple sequence alignment
• HMMs are probabilistic models
• Like a hammer is more refined than a blast,
an HMM gives more sensitive alignments
traditional techniques such as
progressive alignments