No Slide Title

Download Report

Transcript No Slide Title

Multiple sequence alignment
Multiple sequence alignment: outline
[1] Introduction to MSA
Exact methods
Progressive (ClustalW)
Iterative (MUSCLE)
Consistency (ProbCons)
Structure-based (Expresso)
Conclusions: benchmarking studies
[3] Hidden Markov models (HMMs), Pfam and CDD
[4] MEGA to make a multiple sequence alignment
[5] Multiple alignment of genomic DNA
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid)
sequences that are partially or completely aligned
• homologous residues are aligned in columns
across the length of the sequences
• residues are homologous in an evolutionary sense
• residues are homologous in a structural sense
Page 320
ClustalW
Note how the region of a conserved histidine (▼) varies
depending on which algorithm is used
Praline
MUSCLE
Probcons
TCoffee
Multiple sequence alignment: properties
• not necessarily one “correct” alignment of a protein family
• protein sequences evolve...
• ...the corresponding three-dimensional structures
of proteins also evolve
• may be impossible to identify amino acid residues
that align properly (structurally) throughout a multiple
sequence alignment
• for two proteins sharing 30% amino acid identity,
about 50% of the individual amino acids
are superposable in the two structures
Page 320
Multiple sequence alignment: features
• some aligned residues, such as cysteines that form
disulfide bridges, may be highly conserved
• there may be conserved motifs such as a
transmembrane domain
• there may be conserved secondary structure features
• there may be regions with consistent patterns of
insertions or deletions (indels)
Page 320
Multiple sequence alignment: uses
• MSA is more sensitive than pairwise alignment
to detect homologs
• BLAST output can take the form of a MSA,
and can reveal conserved residues or motifs
• Population data can be analyzed in a MSA (PopSet)
• A single query can be searched against
a database of MSAs (e.g. PFAM)
• Regulatory regions of genes may have consensus
sequences identifiable by MSA
Page 321
Multiple sequence alignment: outline
[1] Introduction to MSA
Exact methods
Progressive (ClustalW)
Iterative (MUSCLE)
Consistency (ProbCons)
Structure-based (Expresso)
Conclusions: benchmarking studies
[3] Hidden Markov models (HMMs), Pfam and CDD
[4] MEGA to make a multiple sequence alignment
[5] Multiple alignment of genomic DNA
[6] Introduction to molecular evolution and phylogeny
Multiple sequence alignment: methods
Exact methods: dynamic programming
Instead of the 2-D dynamic programming matrix in the
Needleman-Wunsch technique, think about a 3-D,
4-D or higher order matrix.
Exact methods give optimal alignments but are not
feasible in time or space for more than ~10 sequences.
Still an extremely active field.
Multiple sequence alignment: outline
[1] Introduction to MSA
Exact methods
Progressive (ClustalW)
Iterative (MUSCLE)
Consistency (ProbCons)
Structure-based (Expresso)
Conclusions: benchmarking studies
[3] Hidden Markov models (HMMs), Pfam and CDD
[4] MEGA to make a multiple sequence alignment
[5] Multiple alignment of genomic DNA
[6] Introduction to molecular evolution and phylogeny
Multiple sequence alignment: methods
Progressive methods: use a guide tree (a little like a
phylogenetic tree but NOT a phylogenetic tree) to
determine how to combine pairwise alignments one by one
to create a multiple alignment.
Making multiple alignments using trees was a very
popular subject in the ‘80s. Fitch and Yasunobu (1974)
may have first proposed the idea, but Hogeweg and
Hesper (1984) and many others worked on the topic before
Feng and Doolittle (1987)—they made one
important contribution that got their names attached to this
alignment method.
Examples: CLUSTALW, MUSCLE
Multiple sequence alignment: methods
Example of MSA using ClustalW: two data sets
Five distantly related lipocalins (human to E. coli)
Five closely related RBPs
When you do this, obtain the sequences of
interest in the FASTA format!
(You can save them in a Word document)
Page 321
Multidimensional Dynamic Programming
• Example: in 3D (three
sequences):
• 7 neighbors/cells
F(i,j,k)
= max{ F(i-1,j-1,k-1)+S(xi, xj, xk),
F(i-1,j-1,k )+S(xi, xj, - ),
F(i-1,j ,k-1)+S(xi, -, xk),
F(i-1,j ,k )+S(xi, -, - ),
F(i ,j-1,k-1)+S( -, xj, xk),
F(i ,j-1,k )+S( -, xj, xk),
F(i ,j ,k-1)+S( -, -, xk) }
17
HOW CAN I ALIGN MANY SEQUENCES
2 Globins =>1 Min
HOW CAN I ALIGN MANY SEQUENCES
3 Globins =>2 hours
HOW CAN I ALIGN MANY SEQUENCES
4 Globins => 10 days
HOW CAN I ALIGN MANY SEQUENCES
5 Globins => 3 years
HOW CAN I ALIGN MANY SEQUENCES
6 Globins =>300 years
HOW CAN I ALIGN MANY SEQUENCES
7 Globins =>30. 000 years
Solidified Fossil,
Old stuff
HOW CAN I ALIGN MANY SEQUENCES
8 Globins =>3 Million years
• The Choice of an objective function
Biological problem that lies in the definition of
correctness
– Sum of pair, Entropy score, Consistency based, …
• The Optimization of that function
– Exact Algorithms (Dynamic Programming)
– Progressive alignment (ClustalW)
– Iterative approaches (SA, GA, …)
Alignment Costs
Traditional
A
A
A
C
A
C
Traditional (SP)
Input seq
A
A
A
A
C
C
C
A
A
C
Tree-Alignment
A
C
A
Star-Alignment
A, A, A, C, C
A, A, A, C, C
A, A, A, C, C
Reconstructed
seq
--
A, A, C
A
Missmatches
6
1
2
The Progressive
Multiple Alignment
Algorithm
(Clustal W)
Making An Alignment
Any Exact Method would be TOO SLOW
We will use a Heuristic Algorithm.
Progressive Alignment Algorithm is the most Popular
-ClustalW
-Greedy Heuristic (No Guarranty).
-Fast
Progressive Alignment
Feng and Dolittle, 1988; Taylor 1989
Clustering
Progressive Alignment
Dynamic Programming Using A Substitution Matrix
Progressive Alignment
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
•Tree making Algorithm.
Example : Progressive
alignment
Pairwise Alignment
1+2
1+3
Guide Tree
1
1+4
2
2+3
3
2+4
3+4
4
MSA by adding
sequences
2
3
4
1
Progressive alignment (cont.)
Sequence
1
1
2
3
4
2
Guide Tree
3
4
5
1
Distance Matrix:
2
displays distances
of all sequence
pairs.
5
4
5
3
D=1-S
UPGMA (unweighted pair group method of arithmetic averages)
or Neighbour-Joining method
UPGMA Clustering (Guide
Tree)
d ij1 2 3 4 5
1 0 2 6 29 7
2 0 5 07 7
3
0 5 4
4
0 3
5
0
d ij u 3 4 5
u 0 5 8 .75
3 0 5 4
4
0 3
5
0
d iju 3 v
u 0 5 7 .5 .5
4. 5
3 0 4
0
v
0
3
0
d iju w
u 06 6
w 00
. 85
3
1
3
3
3
1
5
2
4
1
5
2
4
1
5
2
4
5
2
4
Progressive alignment (cont.)
Guide Tree
1
Alignment of alignments
2
4
5
2
3
•
•
•
•
1
Columns - once aligned - are never changed. . . and new gaps are inserted.
Depend strongly on pairwise alignments and the intitial starting sequences
No guarantee that the global optimal solution will be found.
In case of sequences identity less than 25-30%, this approach become much
less reliable.
Progressive Alignment
When Doesn’t It Work
CLUSTALW (Score=20, Gop=-1, Gep=0, M=1)
SeqA
SeqB
SeqC
SeqD
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
LAST
FAST
VERY
----
FA-T
---FAST
FA-T
CAT
CAT
CAT
CAT
CORRECT (Score=24)
SeqA
SeqB
SeqC
SeqD
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
GARFIELD THE LAST FAT CAT
GARFIELD THE LAST FAT CAT
GARFIELD THE FAST CAT ---
GARFIELD THE FAST CAT
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
GARFIELD THE VERY FAST CAT
GARFIELD THE VERY FAST CAT
-------- THE ---- FA-T CAT
THE FAT CAT
Iterative alignment
A
B
C
D
E
Pairwise distance
table
A
B
C
D
A
Guide tree
A
B
C
D
E
B
11
C
3
1
D
2
2
10
E
1
1
1
1
E
Iterate until
the MSA
doesn’t
change
(convergence
)
MSA
41
The input for ClustalW: a group of sequences
(DNA or protein) in the FASTA format
Get sequences from Entrez Protein (or HomoloGene)
You can display sequences from Entrez Protein
in the fasta format
When you get a DNA sequence from Entrez
Nucleotide, you can click CDS to select only the
coding sequence.
This is very useful for phylogeny studies.
HomoloGene: an NCBI resource to obtain
multiple related sequences
[1] Enter a query at NCBI such as globin
[2] Click on HomoloGene (left side)
[3] Choose a HomoloGene family, and
view in the fasta format
Use ClustalW to do a progressive MSA
http://www2.ebi.
ac.uk/clustalw/
Fig. 10.1
Page 321
Feng-Doolittle MSA occurs in 3 stages
[1] Do a set of global pairwise alignments
(Needleman and Wunsch’s dynamic programming
algorithm)
[2] Create a guide tree
[3] Progressively align the sequences
Page 321
Progressive MSA stage 1 of 3:
generate global pairwise alignments
five distantly
related lipocalins
best score
Fig. 10.2
Page 323
Progressive MSA stage 1 of 3:
generate global pairwise alignments
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:
Sequences (1:3) Aligned. Score:
Sequences (1:4) Aligned. Score:
Sequences (1:5) Aligned. Score:
Sequences (2:3) Aligned. Score:
Sequences (2:4) Aligned. Score:
Sequences (2:5) Aligned. Score:
Sequences (3:4) Aligned. Score:
Sequences (3:5) Aligned. Score:
Sequences (4:5) Aligned. Score:
five closely
related lipocalins
84
84
91
92
99
86
85
85
84
96
best
score
Fig. 10.4
Page 325
Number of pairwise alignments needed
For n sequences, (n-1)(n) / 2
For 5 sequences, (4)(5) / 2 = 10
Page 322
Feng-Doolittle stage 2: guide tree
•
Convert similarity scores to distance scores
•
A tree shows the distance between objects
•
Use UPGMA (defined in the phylogeny lecture)
•
ClustalW provides a syntax to describe the tree
•
A guide tree is not a phylogenetic tree
Page 323
Progressive MSA stage 2 of 3:
generate a guide tree calculated from
the distance matrix
Fig. 10.2
Page 323
Progressive MSA stage 2 of 3:
generate guide tree
(
(
gi|5803139|ref|NP_006735.1|:0.04284,
(
gi|6174963|sp|Q00724|RETB_MOUS:0.00075,
gi|132407|sp|P04916|RETB_RAT:0.00423)
:0.10542)
:0.01900,
gi|89271|pir||A39486:0.01924,
gi|132403|sp|P18902|RETB_BOVIN:0.01902);
five closely
related lipocalins
Fig. 10.4
Page 325
Feng-Doolittle stage 3: progressive alignment
•
Make a MSA based on the order in the guide tree
•
Start with the two most closely related sequences
•
Then add the next closest sequence
•
Continue until all sequences are added to the MSA
•
Rule: “once a gap, always a gap.”
Page 324
Progressive MSA stage 3 of 3:
progressively align the sequences
following the branch order of the tree
Fig. 10.3
Page 324
Progressive MSA stage 3 of 3:
CLUSTALX output
Note that you can download CLUSTALX locally, rather than
using a web-based program!
Clustal W alignment of 5 closely related lipocalins
CLUSTAL W (1.82) multiple sequence alignment
gi|89271|pir||A39486
gi|132403|sp|P18902|RETB_BOVIN
gi|5803139|ref|NP_006735.1|
gi|6174963|sp|Q00724|RETB_MOUS
gi|132407|sp|P04916|RETB_RAT
MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP
------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP
MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP
MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP
MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP
********************:* ***:*****
50
32
48
50
50
gi|89271|pir||A39486
gi|132403|sp|P18902|RETB_BOVIN
gi|5803139|ref|NP_006735.1|
gi|6174963|sp|Q00724|RETB_MOUS
gi|132407|sp|P04916|RETB_RAT
EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED
EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED
EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED
EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED
EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED
*********:*******.*:************.**:**************
100
82
98
100
100
gi|89271|pir||A39486
gi|132403|sp|P18902|RETB_BOVIN
gi|5803139|ref|NP_006735.1|
gi|6174963|sp|Q00724|RETB_MOUS
gi|132407|sp|P04916|RETB_RAT
PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS
PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS
PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS
PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS
PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS
****************:*******:****:*:* ****** *********
150
132
148
150
150
* asterisks indicate identity in a column
Fig. 10.5
Page 326
Progressive MSA stage 3 of 3:
progressively align the sequences
following the branch order of the tree:
Order matters
THE LAST FAT CAT
THE FAST CAT
THE VERY FAST CAT
THE FAT CAT
THE LAST FAT CAT
THE FAST CAT --THE LAST FA-T CAT
THE FAST CA-T --THE VERY FAST CAT
Adapted from C. Notredame, Pharmacogenomics 2002
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
Progressive MSA stage 3 of 3:
progressively align the sequences
following the branch order of the tree:
Order matters
THE FAT CAT
THE FAST CAT
THE VERY FAST CAT
THE LAST FAT CAT
THE FA-T CAT
THE FAST CAT
THE ---- FA-T CAT
THE ---- FAST CAT
THE VERY FAST CAT
Adapted from C. Notredame, Pharmacogenomics 2002
THE
THE
THE
THE
------VERY
LAST
FA-T
FAST
FAST
FA-T
CAT
CAT
CAT
CAT
Why “once a gap, always a gap”?
•
There are many possible ways to make a MSA
•
Where gaps are added is a critical question
•
Gaps are often added to the first two (closest)
sequences
•
To change the initial gap choices later on would be
to give more weight to distantly related sequences
•
To maintain the initial gap choices is to trust
that those gaps are most believable
Page 324
Additional features of ClustalW improve
its ability to generate accurate MSAs
•
Individual weights are assigned to sequences;
very closely related sequences are given less weight,
while distantly related sequences are given more weight
•
Scoring matrices are varied dependent on the presence
of conserved or divergent sequences, e.g.:
PAM20
PAM60
PAM120
PAM350
•
80-100% id
60-80% id
40-60% id
0-40% id
Residue-specific gap penalties are applied
MEGA version 4:
Molecular Evolutionary Genetics Analysis
Download from www.megasoftware.net
MEGA version 4:
Molecular Evolutionary Genetics Analysis
MEGA version 4:
Molecular Evolutionary Genetics Analysis
1
2
Two ways to create a multiple sequence alignment
1. Open the Alignment Explorer, paste in a FASTA MSA
2. Select a DNA query, do a BLAST search
Once your sequences are in MEGA, you can run ClustalW
then make trees and do phylogenetic analyses
[1] Open the
Alignment Explorer
[2] Select “Create
a new alignment”
[3] Click yes (for DNA)
or no (for protein)
[4] Find, select, and copy a
multiple sequence alignment
(e.g. from Pfam; choose
FASTA with dashes for gaps)
[5] Paste it into MEGA
[6] If needed, run
ClustalW to align the
sequences
[7] Save (Ctrl+S) as .mas
then exit and save as .meg