Transcript Slide

Multiple Sequence
Alignment
Biology 224
Instructor: Tom Peavy
October 18 & 20, 2010
<Images adapted from Bioinformatics
and Functional Genomics by Jonathan Pevsner>
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid)
sequences that are partially or completely aligned
• Homologous residues are aligned in columns
across the length of the sequences
• residues are homologous in an evolutionary sense
• residues are homologous in a structural sense
Multiple sequence alignment: properties
• not necessarily one “correct” alignment of a protein family
• protein sequences evolve...
• ...the corresponding three-dimensional structures
of proteins also evolve
• may be impossible to identify amino acid residues
that align properly (structurally) throughout a multiple
sequence alignment
• for two proteins sharing 30% amino acid identity,
about 50% of the individual amino acids
are superposable in the two structures
Multiple sequence alignment: features
• some aligned residues, such as cysteines that form
disulfide bridges, may be highly conserved
• there may be conserved motifs such as a
transmembrane domain
• there may be conserved secondary structure features
• there may be regions with consistent patterns of
insertions or deletions (indels)
Multiple sequence alignment: methods
There are two main ways to make
a multiple sequence alignment:
(1) Progressive alignment (Feng & Doolittle).
(e.g. ClustalW)
(2) Iterative approaches.
Use Clustal W to do a progressive MSA
http://www2.ebi.
ac.uk/clustalw/
Feng-Doolittle MSA occurs in 3 stages
[1] Do a set of global pairwise alignments
(Needleman and Wunsch)
[2] Create a guide tree
[3] Progressively align the sequences
Progressive MSA stage 1 of 3:
generate global pairwise alignments
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:
Sequences (1:3) Aligned. Score:
Sequences (1:4) Aligned. Score:
Sequences (1:5) Aligned. Score:
Sequences (2:3) Aligned. Score:
Sequences (2:4) Aligned. Score:
Sequences (2:5) Aligned. Score:
Sequences (3:4) Aligned. Score:
Sequences (3:5) Aligned. Score:
Sequences (4:5) Aligned. Score:
84
84
91
92
99
86
85
85
84
96
five closely
related lipocalins
best
score
Number of pairwise alignments needed
For N sequences, (N-1)(N)/2
For 5 sequences, (4)(5)/2 = 10
Feng-Doolittle stage 2: guide tree
•
Convert similarity scores to distance scores
•
A tree shows the distance between objects
•
Distance methods used (i.e. Neighbor joining)
•
ClustalW provides a syntax to describe the tree
•
A guide tree is not a phylogenetic tree
Progressive MSA stage 2 of 3:
generate guide tree
((Human RBP:0.04284,(Mouse RBP:0.00075, Rat RBP:0.00423) :0.10542)
:0.01900, Pig RBP:0.01924, Bovine RBP:0.01902);
3 (rat RBP)
2 (murine RBP)
4 (porcine RBP)
5 (bovine RBP)
five closely
related lipocalins
1 (human RBP)
Feng-Doolittle stage 3: progressive alignment
•
Make a MSA based on the order in the guide tree
•
Start with the two most closely related sequences
•
Then add the next closest sequence
•
Continue until all sequences are added to the MSA
•
Rule: “once a gap, always a gap”
Clustal W alignment of 5 closely related lipocalins
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486
gi|132403|sp|P18902|RETB_BOVIN
gi|5803139|ref|NP_006735.1|
gi|6174963|sp|Q00724|RETB_MOUS
gi|132407|sp|P04916|RETB_RAT
MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP
------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP
MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP
MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP
MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP
********************:* ***:*****
50
32
48
50
50
gi|89271|pir||A39486
gi|132403|sp|P18902|RETB_BOVIN
gi|5803139|ref|NP_006735.1|
gi|6174963|sp|Q00724|RETB_MOUS
gi|132407|sp|P04916|RETB_RAT
EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED
EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED
EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED
EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED
EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED
*********:*******.*:************.**:**************
100
82
98
100
100
gi|89271|pir||A39486
gi|132403|sp|P18902|RETB_BOVIN
gi|5803139|ref|NP_006735.1|
gi|6174963|sp|Q00724|RETB_MOUS
gi|132407|sp|P04916|RETB_RAT
PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS
PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS
PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS
PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS
PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS
****************:*******:****:*:* ****** *********
150
132
148
150
150
Why “once a gap, always a gap”?
•
There are many possible ways to make a MSA
•
Where gaps are added is a critical question
•
Gaps are often added to the first two (closest) sequences
•
To change the initial gap choices later on would be
to give more weight to distantly related sequences
•
To maintain the initial gap choices is to trust
that those gaps are most believable
Multiple sequence alignment to profile HMMs
• Hidden Markov models (HMMs) are “states”
that describe the probability of having a
particular amino acid residue at arranged
in a column of a multiple sequence alignment
• HMMs are probabilistic models
• Like a hammer is more refined than a blast,
an HMM gives more sensitive alignments
than traditional techniques such as
progressive alignments
An HMM is constructed from a MSA
Example: five lipocalins
GTWYA (hs RBP)
GLWYA (mus RBP)
GRWYE (apoD)
GTWYE (E Coli)
GEWFS (MUP4)
GTWYA
GLWYA
GRWYE
GTWYE
GEWFS
Prob. 1
p(G) 1.0
p(T)
p(L)
p(R)
p(E)
p(W)
p(Y)
p(F)
p(A)
p(S)
2
3
4
0.4
0.2
0.2
0.2
5
0.4
1.0
0.8
0.2
0.4
0.2
GTWYA
GLWYA
GRWYE
GTWYE
GEWFS
G:1.0
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
T:0.4
L:0.2
R:0.2
E:0.2
W:1.0
Y:0.8
F:0.2
E:0.4
A:0.4
S:0.2
Databases of multiple sequence alignments
BLOCKS (HMM)
CDD (HMM)
DOMO (Gapped MSA)
INTERPRO
iProClass
MetaFAM
Pfam (profile HMM library)
PRINTS
PRODOM (PSI-BLAST)
PROSITE
SMART
CDD uses RPS-BLAST: reverse position-specific
Query = your favorite protein
Database = set of many PSSMs
CDD is related to PSI-BLAST, but distinct
CDD searches against profiles generated
from pre-selected alignments
Purpose: to find conserved domains
in the query sequence
You can access CDD via DART at NCBI
Multiple sequence alignment algorithms
Local
Progressive
Iterative
PIMA
DIALIGN
Global
CLUSTAL
PileUp
other
SAGA
Multiple sequence alignment programs
AMAS
CINEMA
ClustalW
ClustalX
DIALIGN
HMMT
Match-Box
MultAlin
MSA
Musca
PileUp
SAGA
T-COFFEE
Clustal X
GCG
PileUp
Boxshade Alignment (“Pretty Shading”)
Boxshade server= http://www.ch.embnet.org/software/BOX_form.html
Assessment of alternative
multiple sequence alignment algorithms
[1] As percent identity among proteins drops,
performance (accuracy) declines also. This is
especially severe for proteins < 25% identity.
Proteins <25% identity: 65% of residues align well
Proteins <40% identity: 80% of residues align well
[2] “Orphan” sequences are highly divergent members of a family.
Surprisingly, orphans do not disrupt alignments. Also surprisingly,
global alignment algorithms outperform local.