Bioinformatics in Drug Design
Download
Report
Transcript Bioinformatics in Drug Design
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to bioinformatics
2008
Lecture 7
Substitution Matrices
And
Multiple Sequence Alignment (I)
[1] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Sequence Analysis
Finding relationships between genes and gene products of different species,
including those at large evolutionary distances
Many evolutionary biologists do not like
talking about prokaryotes and eukaryotes
anymore
[2] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Archaea
Domain Archaea is mostly composed of cells that live in
extreme environments. While they are able to live
elsewhere, they are usually not found there because
outside of extreme environments they are competitively
excluded by other organisms.
Species of the domain Archaea are
•not inhibited by antibiotics,
•lack peptidoglycan in their cell wall (unlike bacteria,
which have this sugar/polypeptide compound),
•and can have branched carbon chains in their
membrane lipids of the phospholipid bilayer.
[3] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Archaea (Cnt.)
• It is believed that Archaea are very similar to prokaryotes (e.g.
bacteria) that inhabited the earth billions of years ago. It is also
believed that eukaryotes evolved from Archaea, because they
share many mRNA sequences, have similar RNA polymerases,
and have introns.
• Therefore, it is generally assumed that the domains Archaea and
Bacteria branched from each other very early in history, after
which membrane infolding* produced eukaryotic cells in the
archaean branch approximately 1.7 billion years ago.
There are three main groups of Archaea:
1. extreme halophiles (salt),
2. methanogens (methane producing anaerobes),
3. and hyperthermophiles (e.g. living at temperatures >100º C!).
*Membrane infolding is believed to have led to the nucleus of eukaryotic cells, which is a
membrane-enveloped cell organelle that holds the cellular DNA. Prokaryotic cells are more
C E N T R E F O R I N T E G R A T I V E
primitive and do not have a nucleus.
[4] Substitution matrices – Sequence analysis 2006
B I O I N F O RM A T I C S V U
Example of nucleotide sequence database entry
for Genbank
LOCUS
DEFINITION
ACCESSION
KEYWORDS
SOURCE
ORGANISM
REFER ENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
COMMENT
FEATURES
source
mRNA
gene
CDS
BASE COUNT
ORIGIN
DRODPPC
4001 bp
INV
15-MAR-1990
D.melanogaster decapentaplegic gene complex (DPP-C), complete cds.
M30116
.
D.melanogaster, cDNA to mRNA.
Drosophila melanogaster
Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda;
Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha;
Ephydroidea; Drosophilidae; Drosophilia.
1 (bases 1 to 4001)
Padgett, R.W., St Johnston, R.D. and Gelbart, W.M.
A transcript from a Drosophila pattern gene predicts a protein
homologous to the transforming growth factor-beta family
Nature 325, 81-84 (1987)
87090408
The initiation codon could be at either 1188-1190 or 1587-1589
Location/Qualifiers
1..4001
/organism=“Drosophila melanogaster”
/db_xref=“taxon:7227”
<1..3918
/gene=“dpp”
/note=“decapentaplegic protein mRNA”
/db_xref=“FlyBase:FBgn0000490”
1..4001
/note=“decapentaplegic”
/gene=“dpp”
/allele=“”
/db_xref=“FlyBase:FBgn0000490”
1188..2954
/gene=“dpp”
/note=“decapentaplegic protein (1188 could be 1587)”
/codon_start=1
/db_xref=“FlyBase:FBgn0000490”
/db_xref=“PID:g157292”
/translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA
SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR
……………………
LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL
NDQSTBVVLKNYQEMTBBGCGCR”
1170 a
1078 c
956 g
797 t
1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc
61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg
121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc
181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc
241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg
301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca
361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca
421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac
481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca
541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat
601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa
………………………….
3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc
3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta
3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g
//
[5] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Example of protein sequence database entry for
SWISS-PROT (now UNIPROT)
ID
AC
DT
DT
DT
DE
GN
OS
OC
RN
RP
RM
RA
RL
RN
RP
RM
RA
RL
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
DECA_DROME
STANDARD;
PRT;
588AA.
P07713;
01-APR-1988 (REL. 07, CREATED)
01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE)
01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)
DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN).
DPP.
DROSOPHILA MELANOGASTER (FRUIT FLY).
EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA.
[1]
SEQUENCE FROM N.A.
87090408
PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.;
NATURE 325:81-84 (1987)
[2]
CHARACTERIZATION, AND SEQUENCE OF 457-476.
90258853
PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.;
MOL. CELL. BIOL. 10:2669-2677(1990).
-!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE
EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL
VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS.
-!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.
-!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY.
EMBL; M30116; DMDPPC.
PIR; A26158; A26158.
HSSP; P08112; 1TFG.
FLYBASE; FBGN0000490; DPP.
PROSITE; PS00250; TGF_BETA.
GROWTH FACTOR; DIFFERENTIATION; SIGNAL.
SIGNAL
1
?
POTENTIAL.
PROPEP
?
456
CHAIN
457
588
DECAPENTAPLEGIC PROTEIN.
DISULFID
487
553
BY SIMILARITY.
DISULFID
516
585
BY SIMILARITY.
DISULFID
520
587
BY SIMILARITY.
DISULFID
552
552
INTERCHAIN (BY SIMILARITY).
CARBOHYD
120
120
POTENTIAL.
CARBOHYD
342
342
POTENTIAL.
CARBOHYD
377
377
POTENTIAL.
CARBOHYD
529
529
POTENTIAL.
SEQUENCE 588 AA; 65850MW; 1768420 CN;
MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG
ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN
KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV
LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP
KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH
HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG
QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH
VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR
KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV
NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR
[6] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Definition of substitution
matrix
• Two-dimensional matrix with score
values describing the probability of one
amino acid or nucleotide being replaced
by another during sequence evolution.
[7] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Scoring matrices for
nucleotide sequences
• Can be simple:
• e.g. positive value
for match and zero
for mismatch.
• frequencies of
mutation are equal
for all bases.
[8] Substitution matrices – Sequence analysis 2006
• Can be more
complicated:
• taking into account
transitions and
transversions
(e.g. Kimura model)
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Scoring matrices for
nucleotide sequences
• Simple model
A
C
T
G
A
1
0
0
0
C
0
1
0
0
T
0
0
1
0
G
0
0
0
1
• Kimura
purines
[9] Substitution matrices – Sequence analysis 2006
pyrimidines
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
What is better to align?
DNA or protein sequences?
1.
Many mutations within DNA are synonymous
divergence overestimation
[10] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
2.
Evolutionary relationships can be more accurately expressed
using a 20×20 amino acid exchange table
3.
DNA sequences contain non-coding regions, which should
be avoided in homology searches.
4.
Still an issue when translating
into (six) protein sequences
through a codon table.
5.
Searching at protein level: frameshifts can occur, leading to
stretches of incorrect amino acids and possibly elongation.
However, frameshifts
normally result
in stretches of highly
unlikely amino acids.
[11] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
So?
Rule of thumb:
if ORF exists,
then align at protein level
[12] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Scoring matrices for
amino acid sequences
• Are complicated, scoring has to reflect:
• Physio-chemical properties of aa’s
• Likelihood of residues being substituted among truly
homologous sequences
• Certain aa with similar properties can be more
easily substituted: preserve structure/function
• “Disruptive” substitution is less likely to be selected
in evolution (e.g. rendering non-functional proteins)
[13] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Scoring matrices for
amino acid sequences
Main
chain
[14] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Example: Cysteines are very common
in metal binding motifs
cysteine
Zn
histidine
[15] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Now let’s think about alignments
• Lets consider a simple alignment: ungapped global alignment of two
(protein) sequences, x and y, of length n.
• In scoring this alignment, we would like to assess whether these two
sequences have a common ancestor, or whether they are aligned by
chance.
P r(x, y | M )
P r(x, y | R)
sequences have common ancestor
sequences are aligned by chance
• We therefore want our amino acid substitution table (matrix) to score
an alignment by estimating this ratio (= improvement over random).
• In brief, each substitution score is the log-odds probability that
amino acid a could change (mutate) into amino acid b through
evolution, based on the constraints of our evolutionary model.
[16] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Target and background probabilities
• Background probability
If qa is the frequency of amino acid a in one sequence and qb is the
frequency of amino acid b in another sequence, then the probability
of the alignment being random is given by:
Pr( x, y | R) q xi q yi
i
i
A A R S
V V K S
• Target probability
If pab is now the probability that amino acids a and b have derived
from a common ancestor, then the probability that the alignment is
due to common ancestry is is given by:
Pr( x, y | M ) p xi yi
i
[17] Substitution matrices – Sequence analysis 2006
A A R S
V V K S
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Source of target and background
probabilities: high confidence alignments
• Target frequencies
• The “evolutionary true” alignments allow us to get biologically
permissible amino acid mutations and derive the frequencies of
observed pairs.
These are the TARGET frequencies (20x20 combinations).
• Background frequencies
• The BACKGROUND frequencies are simply the frequency at which
each amino acid type is observed in these “trusted” data sets (20
values).
[18] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Log-odds
• Substitution matrices apply logarithmic conversions
to describe the probability of amino acid substitutions
• The converted values are the so-called
log-odds scores
• So they are simply the logarithmic ratios of the
observed mutation frequency divided by the
probability of substitution expected by random
chance (target – background)
[19] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Formulas
• Odds-ratio of two probabilities
pxi yi
pxi yi
Pr(x, y | M )
i
i
Pr(x, y | R) qxi q yi qxi q yi
i
i
i
• Log-odds probability of an alignment being
random is therefore given by
pxi yi
Pr(x, y | M )
log
log
qx q y
Pr(x, y | R)
i i
log x log x
i i
[20] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Logarithmic functions
Logarithms to various bases:
red is to base e, green is to
base 10, and purple is to base
1.7. Each tick on the axes is
one unit. Logarithms of all
bases pass through the point
(1, 0), because any number
raised to the power 0 is 1, and
through the points (b, 1) for
base b, because any number
raised to the power 1 is itself.
The curves approach the y axis
but do not reach it, due to the
singularity of a logarithm at x
= 0.
http://en.wikipedia.org/wiki/Logarithm
[21] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
So… for a given substitution matrix:
• a positive score
means that the frequency of amino acid substitutions
found in the high confidence alignments is greater than
would have occurred by random chance
• a zero score
… that the freq. is equal to that expected by chance
• a negative score
… that the freq. is less than that expected by chance
[22] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Alignment score
• The alignment score S is given by the sum of all
amino acid pair substitution scores:
Pr(x, y | M )
S sxi , yi log
Pr(x, y | R)
i
• Where the substitution score for any amino acid pair
[a,b] is given by:
pab
sa, b log
qa qb
[23] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Alignment score
• The total score of an alignment:
EAAS
VF-T
• would be:
S s( E,V ) s( A, F ) (1) s(S , T )
[24] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Empirical matrices
• Are based on surveys of actual amino
acid substitutions among related proteins
• Most widely used: PAM and BLOSUM
[25] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The PAM series
• The first systematic method to derive amino acid
substitution matrices was done by Margaret Dayhoff et al.
(1978) Atlas of Protein Structure.
• These widely used substitution matrices are frequently
called Dayhoff, MDM (Mutation Data Matrix), or PAM
(Point Accepted Mutation) matrices.
• Key idea: trusted alignments of closely related
sequences provide information about biologically
permissible mutations.
[26] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The PAM design
• Step 1. Dayhoff used 71 protein families, made hypothetical
phylogenetic trees and recorded the number of observed
substitutions (along each branch of the tree) in a 20x20 target matrix.
[27] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
• Step 2. The target matrix was then converted to frequencies by
dividing each cell (a,b) by the sum of all other substitutions of a.
Aab
Pr(b | a)
Aac
c
• Step 3. The target matrix was normalized so that the expected
number of substitutions covered 1% of the protein (PAM-1).
Pr(b | a, t 1)
• Step 4. Determine the final substitution matrix.
pab
P(b | a, t )
s(a, b | t ) log
log
qa qb
qb
[28] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
PAM units
• One PAM unit is defined as 1% of the amino acids
positions that have been changed
• E.g. to construct the PAM1 substitution table, a group of
closely related sequences with mutation frequencies
corresponding to one PAM unit is chosen. One PAM
corresponds to about 1 million years of evolutionary time.
[29] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
But there is a whole series of
matrices: PAM10 … PAM250
• These matrices are extrapolated from PAM1 matrix
(by matrix multiplication)
A
R
N
A
2
R
-2
6
0
0
2
0 -1
2
N
D
C
Q
E
G
D
0
C
Q
I
L
K
M
P
S
1
2 -5
4
3 -5
2
W
Y
V
A
0
1 -2
3
5
-2
6
0
0
2
0 -1
2
G
6
5
2
3
M
-1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F
-4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
0 -5
1
0 -2
0 -1 -1
X
6
0 -2 -3
5
9
P
1
0 -1 -1 -3
0 -2 -3 -1 -2 -5
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
3
1 -1
0
0 -2 -1
0
0 -1
0 -1 -2
0
1
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
6
3
W
-6
0 -6 -2 -5
17
Y
V
-3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
0
-6
10
-2
4
N
2
R
E
1 -3 -1
1 -3
-2 -3 -3 -4 -6 -2 -3 -4 -2
-1
R
A
Q
4
0
2
-1 -2 -2 -2 -2 -2 -2 -3 -2
L
T
T
C
K
1
F
N
1
2
H
D
1
1 -3
-1
I
G
4
0 -1
H
E
4
-2 -4 -4 -5
D
0
C
Q
I
L
K
M
P
S
1
2 -5
4
3 -5
2
W
Y
V
A
0
1 -2
3
5
-2
6
0
0
2
0 -1
2
G
6
5
2
3
M
-1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F
-4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
0 -5
1
0 -2
0 -1 -1
X
6
0 -2 -3
5
9
P
1
0 -1 -1 -3
0 -2 -3 -1 -2 -5
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
3
1 -1
0
0 -2 -1
0
0 -1
0 -1 -2
0
1
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
6
3
W
-6
0 -6 -2 -5
17
Y
V
-3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
0
-6
10
-2
4
N
2
R
E
1 -3 -1
1 -3
-2 -3 -3 -4 -6 -2 -3 -4 -2
-1
R
A
Q
4
0
2
-1 -2 -2 -2 -2 -2 -2 -3 -2
L
T
T
C
K
1
F
N
1
2
H
D
1
1 -3
-1
I
G
4
0 -1
H
E
4
-2 -4 -4 -5
D
0
C
Q
I
L
K
M
P
S
1
2 -5
4
3 -5
2
W
Y
V
A
0
1 -2
3
5
-2
6
0
0
2
0 -1
2
G
6
5
2
3
M
-1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F
-4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
0 -5
1
0 -2
0 -1 -1
6
0 -2 -3
9
P
1
0 -1 -1 -3
0 -2 -3 -1 -2 -5
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
3
1 -1
0
0 -2 -1
0
0 -1
0 -1 -2
0
1
0 -2
=
5
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
6
3
W
-6
0 -6 -2 -5
17
Y
V
-3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
0
-6
10
-2
4
N
2
R
E
1 -3 -1
1 -3
-2 -3 -3 -4 -6 -2 -3 -4 -2
-1
R
A
Q
4
0
2
-1 -2 -2 -2 -2 -2 -2 -3 -2
L
T
T
C
K
1
F
N
1
2
H
D
1
1 -3
-1
I
G
4
0 -1
H
E
4
-2 -4 -4 -5
D
0
C
Q
1
1
2 -5
4
1
3 -5
2
1 -3
-1
I
2
G
H
I
L
K
M
P
S
1 -3 -1
0
1 -3
1 -2
3
W
Y
V
5
6
5
-2 -3 -3 -4 -6 -2 -3 -4 -2
2
-1
3
M
-1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F
-4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
0 -5
1
0 -2
0 -1 -1
6
0 -2 -3
5
9
P
1
0 -1 -1 -3
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
3
1 -1
0
0 -2 -1
0
0 -1
0 -1 -2
0
1
T
T
4
0
2
-1 -2 -2 -2 -2 -2 -2 -3 -2
L
K
1
F
4
0 -1
H
E
4
-2 -4 -4 -5
0 -2 -3 -1 -2 -5
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
6
3
W
-6
0 -6 -2 -5
17
Y
V
-3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0
0
-6
10
-2
4
Multiply Matrices N times to make PAM ‘N’; then take the Log
• So: a PAM is a relative measure of evolutionary distance
• 1 PAM = 1 accepted mutation per 100 amino acids
• 250 PAM = 250 mutations per 100 amino acids, so 2.5
accepted mutations per amino acid
[30] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
PAM numbers vs. observed am.ac.
mutational rates
PAM
Number
Observed
Mutation Rate (%)
Sequence
Identity (%)
0
0
100
1
1
99
30
25
75
80
50
50
110
40
60
200
75
25
250
80
20
Note Think about intermediate “substitution” steps …
[31] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The PAM250 matrix
A
2
R -2
6
N
0
0
2
D
0 -1
2
4
C -2 -4 -4 -5 12
Q
0
1
1
2 -5
4
E
0 -1
1
3 -5
2
4
G
1 -3
0
1 -3 -1
0
2
1 -3
1 -2
H -1
2
3
5
6
I -1 -2 -2 -2 -2 -2 -2 -3 -2
5
L -2 -3 -3 -4 -6 -2 -3 -4 -2
2
1
0 -5
1
0 -2
W- R exchange is too large
(due to paucity of data)
6
K -1
3
M -1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F -4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
5
9
P
1
0 -1 -1 -3
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
2
T
1 -1
0
0 -2 -1
0
0 -1
0 -1 -3
0
1
W -6
0 -1 -1
0 -2 -3
0 -2 -3 -1 -2 -5
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
Y -3 -4 -2 -4
0 -4 -4 -5
0 -1 -1 -4 -2
0 -6 -2 -5 17
7 -5 -3 -3
B
0 -1
2
3 -4
1
2
0
1 -2 -3
1 -2 -5 -1
0
0 -5 -3 -2
2
Z
0
0
1
3 -5
3
3 -1
2 -2 -3
0 -2 -5
0 -1 -6 -4 -2
2
A
R
N
D
Q
E
H
I
L
K
2 -1 -1 -1
0 10
0 -2 -2 -2 -2 -2 -2 -1 -2
G
2 -2
3
V
C
4
6
M
F
0
P
[32] Substitution matrices – Sequence analysis 2006
S
0 -6 -2
T
W
4
Y
V
3
B
Z
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
PAM model
• The scores derived through the PAM model are an accurate
description of the information content (or the relative entropy)
of an alignment (Altschul, 1991).
• PAM1 corresponds to about 1 million years of evolution.
• PAM120 has the largest information content of the
PAM matrix series: “best” for general alignment.
• PAM250 is the traditionally most popular matrix:
“best” for detecting distant sequence similarity.
[33] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Summary Dayhoff’s PAM-matrices
• Derived from global alignments of closely related sequences.
• Matrices for greater evolutionary distances are extrapolated
from those for smaller ones.
• The number with the matrix (PAM40, PAM100) refers to the
evolutionary distance; greater numbers are greater distances.
• Attempts to extend Dayhoff's methodology or re-apply her analysis
using databases with more examples:
•
Jones, Thornton and coworkers used the same methodology as
Dayhoff but with modern databases (CABIOS 8:275)
•
Gonnett and coworkers (Science 256:1443) used a slightly different
(but theoretically equivalent) methodology
[34] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The BLOSUM series
• BLOSUM stands for: BLOcks SUbstitution Matrices
• Created by Steve Henikoff and Jorja Henikoff (PNAS 89:10915).
• Derived from local, un-gapped alignments of distantly related
sequences.
• All matrices are directly calculated; no extrapolations are used.
• Again: compare observed freqs of each pair to expected freqs
Then: Log-odds matrix.
[35] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The Blocks database
• The Blocks Database contains multiple alignments of conserved
regions in protein families.
• Blocks are multiply aligned un-gapped segments corresponding to the
most highly conserved regions of proteins.
• The blocks for the BLOCKS database are made automatically by
looking for the most highly conserved regions in groups of proteins
represented in the PROSITE database. These blocks are then
calibrated against the SWISS-PROT database to obtain a measure of
the random distribution of matches. It is these calibrated blocks that
make up the BLOCKS database.
• The database can be searched to classify protein and nucleotide
sequences.
[36] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The Blocks database
Gapless
alignment
blocks
[37] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The BLOSUM series
• BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90.
• The number after the matrix (BLOSUM62) refers to the minimum
percent identity of the blocks (in the BLOCKS database) used to
construct the matrix (all blocks have >=62% sequence identity);
• No extrapolations are made in going to higher evolutionary distances
• High number - closely related sequences
Low number - distant sequences
• BLOSUM62 is the most popular: best for general alignment.
[38] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The log-odds matrix for BLOSUM62
[39] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
PAM versus BLOSUM
• Based on an explicit
evolutionary model
• Based on empirical
frequencies
• Derived from small, closely
related proteins with ~15%
divergence
• Uses much larger, more
diverse set of protein
sequences (30-90% ID)
• Higher PAM numbers to
detect more remote
sequence similarities
• Lower BLOSUM numbers to
detect more remote
sequence similarities
• Errors in PAM 1 are scaled
250X in PAM 250
• Errors in BLOSUM arise
from errors in alignment
[40] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Comparing exchange matrices
• To compare amino acid exchange matrices, the "Entropy" value can
be used. This is a relative entropy value (H) which describes the
amount of information available per aligned residue pair.
H sij log2 (sij / pi p j )
[41] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Evolution and Matrix “landscape”
• Recent evolution
identity matrix
[42] Substitution matrices – Sequence analysis 2006
• Ancient evolution
convergence to
random model
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
A note on reliability
• All these matrices are designed using
standard evolutionary models.
• Circular problem
alignment
matrix
• It is important to understand that evolution is
not the same for all proteins, not even for the
same regions of proteins.
[43] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
…
• No single matrix performs best on all sequences.
Some are better for sequences with few gaps,
and others are better for sequences with fewer
identical amino acids.
• Therefore, when aligning sequences, applying a
general model to all cases is not ideal.
Rather, re-adjustment can be used to make the
general model better fit the given data.
[44] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Pair-wise alignment quality
versus sequence identity
• Vogt et al., JMB 249, 816-831,1995
Pairwise alignments
were made of
sequence pairs for
which the
‘true’alignment was
known from 3Dstructural information,
so the correctness of
the alignments could
be checked
Twilight
zoneanalysis 2006
[45] Substitution matrices
– Sequence
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Take-home messages - 1
• If ORF exists, then align at protein level.
• Amino acid substitution matrices reflect the log-odds ratio
between the evolutionary and random model and can
therefore help in determining homology via the alignment
score.
• The evolutionary and random models depend on
generalized data sets used to derive them. This not an
ideal solution.
[46] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Take-home messages - 2
• Apart from the PAM and BLOSUM series, a great number
of further matrices have been developed.
• Matrices have been made based on DNA, protein
structure, information content, etc.
• For local alignment, BLOSUM62 is often superior; for
distant (global) alignments, BLOSUM50, GONNET, or
(still) PAM250 work well.
• Remember that gap penalties are always a problem:
unlike the matrices themselves, there is no formal way to
calculate their values -- you can follow recommended
settings, but these are based on trial and error and not on
a formal framework.
[47] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to
bioinformatics 2008
Multiple Sequence Alignment
(I)
[48] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Biological definitions for
related sequences
Homologues are similar sequences in two different
organisms that have been derived from a common ancestor
sequence. Homologues can be described as either
orthologues or paralogues.
Orthologues are similar sequences in two different
organisms that have arisen due to a speciation event.
Orthologs typically retain identical or similar functionality
throughout evolution.
Paralogues are similar sequences within a single organism
that have arisen due to a gene duplication event.
Xenologues are similar sequences that do not share the
same evolutionary origin, but rather have arisen out of
horizontal transfer events through symbiosis, viruses, etc.
Vertical transfer is caused by (normal)
heredity
[49] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
So this means …
Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
[50] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Information content of a
multiple alignment
• Sequences can be conserved across species and
perform similar or identical functions
– hold information about which regions have high
mutation rates over evolutionary time and which are
evolutionarily conserved
– identification of regions or domains that are critical to
functionality
• Sequences can be mutated or rearranged to
perform an altered function
– which changes in the sequences have caused a change in
the functionality
[51] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Multiple alignment idea
• Take three or more related sequences
and align them so that the greatest
number of similar characters are aligned
in the same column of the alignment.
Ideally, the sequences are orthologous,
but often include paralogues.
[52] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Scoring a multiple alignment
• You can score a multiple alignment by
taking all the pairs of aligned sequences
and add up the pairwise scores:
Sa,b =
s (a , b ) -
l
i
j
k
Nk gp(k )
•This is referred to as the Sumof-Pairs score
[53] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Multiple sequence
alignment
Why?
• It is the most important means to assess
relatedness of a set of sequences
• Gain information about the
structure/function of a query sequence
(conservation patterns)
• Construct a phylogenetic tree
• Putting together a set of sequenced
fragments (Fragment assembly)
• Many bioinformatics methods depend
on it (e.g. secondary/tertiary
structure prediction)
[54] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Information content of a
multiple alignment
[55] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
What to ask yourself
• How do we get a multiple alignment?
(three or more sequences)
• What is our aim?
–Do we go for max accuracy?
–Least computational time?
–Or the best compromise?
• What do we want to achieve each
time?
[56] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Multiple alignment methods
Multi-dimensional dynamic programming
> extension of pairwise sequence alignment.
Progressive alignment
> incorporates phylogenetic information to guide the
alignment process
Iterative alignment
> correct for problems with progressive alignment by
repeatedly realigning subgroups of sequence
[57] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Exhaustive & Heuristic
algorithms
• Exhaustive approaches
– Examine all possible aligned positions simultaneously
– Look for the optimal solution by (multi-dimensional) DP
– Very (very) slow
• Heuristic approaches
• Strategy to find a near-optimal solution
(by using rules of thumb)
• Shortcuts are taken by reducing the search space according to
certain criteria
• Much faster
[58] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Simultaneous multiple
alignment
Multi-dimensional dynamic programming
Combinatorial explosion
DP using two sequences of length n
n2 comparisons
Number of comparisons increases
exponentially
i.e. nN where n is the length of the
sequences, and N is the number of
sequences
Impractical even for small numbers of short
sequences
[59] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Sequence-sequence
alignment by Dynamic
Programming sequence
sequence
[60] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Multi-dimensional dynamic
programming (Murata et al., 1985)
Sequence 2
Sequence 1
[61] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The MSA approach
Lipman et al. 1989
• Key idea: restrict the computational
costs by determining a minimal region
within the n-dimensional matrix that
contains the optimal path
[62] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The MSA method in detail
Let’s consider 3 sequences
Calculate all pair-wise alignment
scores by Dynamic programming
3.
Use the scores to predict a tree
4.
Produce a heuristic multiple align.
based on the tree (quick & dirty)
5.
Calculate maximum cost for each
sequence pair from multiple
alignment (upper bound) &
determine paths with < costs.
6.
Determine spatial positions that
must be calculated to obtain the
optimal alignment (intersecting
areas or ‘hypersausage’ around
matrix diagonal)
7.
Perform multi-dimensional DP
Note Redundancy caused by highly
correlated sequences is avoided
1.
2.
1.
2.
[63] Substitution matrices – Sequence analysis 2006
3.1
3
4.2
5.
1
.
.
3
1
2
1
3
1
3
2
2
.
.
.
1
3
6.
.
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
2
3
The DCA (Divide-and-Conquer) approach
Stoye et al. 1997
•
Each sequence is cut in two behind a
suitable cut position somewhere close
to its midpoint.
•
This way, the problem of aligning one
family of (long) sequences is divided
into the two problems of aligning two
families of (shorter) sequences.
•
This procedure is re-iterated until the
sequences are sufficiently short.
•
Optimal alignment by MSA.
•
Finally, the resulting short alignments
are concatenated.
[64] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
So in effect …
[65] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Multiple alignment methods
Multi-dimensional dynamic programming
> extension of pairwise sequence alignment.
Progressive alignment
> incorporates phylogenetic information to guide the
alignment process
Iterative alignment
> correct for problems with progressive alignment by
repeatedly realigning subgroups of sequence
[66] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
The progressive alignment
method
Underlying idea: usually we are interested in aligning
families of sequences that are evolutionary related.
Principle: construct an approximate phylogenetic tree
for the sequences to be aligned and than to build up the
alignment by progressively adding sequences in the
order specified by the tree.
But before going into details, some notices of multiple
alignment profiles …
[67] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Making a guide tree
1
2
1
Score 1-2
Score 1-3
3
4
5
Score 4-5
Similarity criterion
Scores
5×
5
Pairwise
alignments
(allagainstall)
Similarity
matrix
Guide tree
[68] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Progressive multiple alignment
1
2
1
Score 1-2
Score 1-3
3
4
5
Score 4-5
Scores
5×
5
Scores to distances
Guide
tree
[69] Substitution matrices – Sequence analysis 2006
Similarity
matrix
Iteration possibilities
Multiple alignment
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
General progressive multiple
alignment technique (follow generated tree)Align
these
two
d
1
3
1
3
2
5
These two
are aligned
1
3
2
5
root
1
3
2
5
[70] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
PRALINE progressive strategy
d
1
3
1
3
2
1
3
2
5
4
1
3
2
5
4
N T R E F O R I N(sequence-sequence,
T E G R A T I V E
At each step, Praline checks which of the pair-wiseC Ealignments
B I O I N F O RM A T I C S V U
sequence-profile,
profile-profile)
[71] Substitution
matrices – Sequence
analysis 2006has the highest score – this one gets selected
Progressive alignment strategy
All individual pairwise
alignment and construction
of distance matrix
A
B
C
D
E
A
A
—
B
C
D
B
11
—
C
20
30
D
27
36
9
—
E
30
33
20
27
E
—
—
Calculating a guide tree;
C & D the closest pair;
A & B the next closest pair
Figure adapted from Xiong, J.
“Essential
Bioinformatics”
[72] Substitution
matrices
– Sequence analysis 2006
A
C
B
D
C
D
A
B
E
Aligning C/D and A/B
separately Cusing
dynamic
E N T R E F O R I N T E G R A T
programming B I O I N F O R M A T
I V E
I C S V U
But how can we align blocks of
sequences ?
A
B
C
D
E
C
D
?
A
B
• The dynamic programming algorithm performs well for
pairwise alignment (two axes).
• So we should try to treat the blocks as a “single”
sequence …
[73] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
How to represent a block of
sequences ?
• Historically: consensus sequence
single sequence that best represents the
amino acids observed at each alignment
position.
• Modern methods: alignment profile
representation that retains the information
about frequencies of amino acids
observed at each alignment position.
[74] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Consensus sequence
Sequence 1
F A T N M G T S D P P T H T R L R K L V S Q
Sequence 2
F V T N M N N S D G P T H T K L R K L V S T
Consensus
F * T N M * * S D * P T H T * L R K L V S *
• Problem: loss of information
• For larger blocks of sequences it
“punishes” more distant members
[75] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Alignment profiles
• Advantage: full representation of the
sequence alignment (more information
retained)
• Not only used in alignment methods, but
also in sequence-database searching (to
detect distant homologues)
• Also called PSSM (Position-specific scoring
matrix)
[76] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Multiple alignment profiles
Core region
Gapped region
Core region
frequencies
i
A
C
D
W
Y
-
fA..
fA..
fA..
fC..
fC..
fC..
fD..
fD..
fD..
fW..
fW..
fW..
fY..
fY..
fY..
Gapo, gapx Gapo, gapx Gapo, gapx
Position-dependent gap penalties
[77] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Profile building
Example: each aa is represented as a frequency and gap
penalties as weights.
i
A 0.5
C 0
D 0
W 0
Y 0.5
0.3
0.1
0
0.3
0.3
0
0.5
0.2
0.1
0.2
Gap
penalties 1.0
0.5
1.0
Position dependent gap penalties
C E N T R E F O R I N T E G R A T I V E
[78] Substitution matrices – Sequence analysis 2006
B I O I N F O RM A T I C S V U
Profile-sequence alignment
sequence
ACD……VWY
[79] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Sequence to profile alignment
A
A
V
V
L
0.4 A
0.2 L
0.4 V
Score of amino acid L in a sequence that is aligned
against this profile position:
Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 *
s(L, V)
[80] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Profile-profile alignment
profile
A
C
D
.
.
Y
profile
ACD……VWY
[81] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Profile to profile alignment
0.4 A
0.75 G
0.2 L
0.25 S
0.4 V
Match score of these two alignment columns using the a.a frequencies
at the corresponding profile positions:
Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) +
+ 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S)
s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62)
for amino acid pair (x,y)
[82] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
So, for scoring profiles …
Think of sequence-sequence alignment.
Same principles but more information for each position.
Reminder:
The sequence pair alignment score S comes from the
sum of the positional scores M(aai,aaj) (i.e. the
substitution matrix values at each alignment position
minus penalties if applicable)
Profile alignment scores are exactly the same, but the
positional scores are more complex
[83] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
General function for profile-profile
scoring
Profile 1
A
C
D
.
.
Y
Profile 2
A
C
D
.
.
Y
At each position (column) we have different residue
frequencies for each amino acid (rows)
SO:
Instead of saying S=M(aa1, aa2) (one residue pair)
For comparing two profile positions we take:
20 20
S faai faaj M(aai , aaj )
i
j
[84] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U
Progressive alignment strategy
1.
2.
3.
4.
Perform pair-wise alignments of all of the sequences (all against
all);
Use the alignment scores to make a similarity (or distance) matrix
Use that matrix to produce a guide tree;
Align the sequences successively, guided by the order and
relationships indicated by the tree.
Methods:
Biopat
(Hogeweg and Hesper 1984 -- first integrated method
ever)
MULTAL
(Taylor 1987)
DIALIGN
(1&2, Morgenstern 1996)
PRRP
(Gotoh 1996)
ClustalW
(Thompson et al 1994)
PRALINE
(Heringa 1999)
T
Coffee (Notredame 2000)
POA
(Lee 2002)
MUSCLE
(Edgar 2004)
PROBSCONS
(Do, 2005)
[85] Substitution matrices – Sequence analysis 2006
C E N T R E F O R I N T E G R A T I V E
B I O I N F O RM A T I C S V U