Outline: pairwise alignment

Download Report

Transcript Outline: pairwise alignment

Introduction to Bioinformatics
Pairwise sequence alignment
Xiaolong Wang
Ocean University of China
Outline: pairwise alignment
1. Overview and examples
2. Definitions: homologs, paralogs, orthologs
3. Alignment algorithms:
I.
Global alignment: Needleman-Wunsch
II. Local alignment: Smith-Waterman
4. Assigning scores to aligned amino acids:
Dayhoff’s PAM matrices
5. Statistical significance of pairwise alignments
beta globin (NP_000509)
alpha globin (NP_000508 OR NP_000549)
# Program: needle
Global alignment
alpha globin
beta globin
#Program: water
Local alignment
alpha globin
beta globin
#Program: water Local alignment
alpha globin
beta globin
alpha globin
beta globin
# Program: needle Global alignment
Standard Smith-Waterman (SW)
search gives only the best alignment.
Looking for suboptimal
alignments is especially important
when comparing multimodule
proteins. The LAlign program was
used to find the three best local
alignments of the human coagulation
factor IX and factor XII proteins.
The second and third alignments
represent functional modules that
would have been missed by a
standard SW search, which would
have reported only the first (optimal)
alignment.
Outline: pairwise alignment
1. Overview and examples
2. Definitions: homologs, paralogs, orthologs
3. Alignment algorithms:
I.
Global alignment: Needleman-Wunsch
II. Local alignment: Smith-Waterman
4. Assigning scores to aligned amino acids:
Dayhoff’s PAM matrices
5. Statistical significance of pairwise alignments
Pairwise sequence alignment is the most
fundamental operation of bioinformatics
Sequence alignment, including pairwise and multiple
sequence alignment, are performed tens of thousands of
times per day by molecular biologists around the world:
• It is the basis of BLAST database searching
• to assemble sequences, contigs and genomes
• to construct phylogenic tree
• to decide if two proteins (or genes) are related structurally
or functionally
• to identify domains or motifs that are shared between
genes and proteins
• in the analysis of the structure and function of genes and
genomes
globins: a- b-
myoglobin
Early
example
of
sequence
alignment:
globins (1961)
H.C. Watson and J.C.
Kendrew,
1961.
The
amino-acid sequence of
sperm whale myoglobin.
Comparison Between the
Amino-Acid Sequences
of
Sperm
Whale
Myoglobin and of Human
Hemoglobin.”
Nature
190:670-672,
Why should we study the alignment algorithms?
The important things are always simple,
and simple things are always difficult.
• Researchers should have a basic understanding
of how the programs work so that appropriate
tools are chosen, and parameters can be
intelligently selected.
• In addition, they should be aware of potential
artifacts and know how to avoid them.
• Above all, it is important to apply the same
powers of observation and critical evaluation that
are used with any experimental method.
Online tools for pairwise sequence alignment
1. BLAST 2 Sequence
>SWISS-PROT P07146
Mouse trypsin
IVGGYNCEENSVPYQVSL
>SWISS-PROT P00765
Crayfish trypsin
IVGGTDAVLGEFPYQLSFQ
Pairwise alignment: BLAST 2 sequences
• Go to http://www.ncbi.nlm.nih.gov/BLAST
• Choose Align (BLAST 2 sequences)
• In the program,
[1] choose blastp or blastn
[2] paste in your accession numbers
(or use FASTA format)
[3] select optional parameters
--3 BLOSUM and 3 PAM matrices
--gap creation and extension penalties
--filtering
--word size
[4] click “align”
2. Lalign - pairwise sequence alignment
3. Emboss pairwise sequence alignment
4. SIM4 – cDNA/gDNA sequence alignment
Example
Spliced alignment. The sim4
program was used to align a
novel human mRNA (RefSeq
NM 015372) to the genomic
sequence of a cosmid from
chromosome 22 (EMBL
Z82248). Three exons were
found (the third one has been
truncated for brevity). The
‘‘>>>’’ symbols indicate splice
sites found at the exon/intron
boundaries.
Pairwise alignment: protein sequences
can be more informative than DNA
• Protein is more informative (20 vs. 4 symbols);
many amino acids share related biophysical properties
• Codons are degenerate: changes in the third position
often do not alter the amino acid that is specified
• Protein sequences offer a longer “look-back” time
• DNA sequences can be translated into protein,
and then used in pairwise alignments
An Example
3 Letter
Abbreviation
1 Letter
Abbreviation
Alanine
Ala
A, B
Arginine
Arg
R
AGA AGG CGA CGC CGG CGU
Aspartic Acid
Asp
D
GAC GAU
Asparagine
Asn
N
AAC AAU
Cysteine
Cys
C
UGC UGU
Glutamic Acid
Glu
E
GAA GAG
Glutamine
Gln
Q, o
CAA CAG
Glycine
Gly
G
GGA GGC GGG GGU
Histidine
His
H
CAC CAU
Isoleucine
Ile
I, j
AUA AUC AUU
Leucine
Leu
L
UUA UUG CUA CUC CUG CUU
Lysine
Lys
K
AAA AAG
Methionine
Met
M
AUG
Phenylalanine
Phe
F
UUC UUU
Proline
Pro
P
CCA CCC CCG CCU
Serine
Ser
S
AGC AGU UCA UCC UCG UCU
Threonine
Thr
T
ACA ACC ACG ACU
Tryptophan
Trp
W, z
UGG
Tyrosine
Tyr
Y, x
UAC UAU
Valine
Val
V, u
GUA GUC GUG GUU
Amino Acid
Codons
GCA GCC GCG GCU
Start
AUG
Stop
UAA UAG UGA
Pairwise alignment: protein sequences
can be more informative than DNA
• DNA can be translated into six potential proteins
1
2
3
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
4
5
6
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
• Many times, DNA alignments are appropriate
--to confirm the identity of a cDNA
--to study noncoding regions of DNA
--to study DNA polymorphisms
--Simple Sequence Repeats (SSR)
--Single nucleotide polymorphism (SNP)
--Indels
--when only mtDNA sequences are available.
--highly degraded samples
e.g.: --Neanderthal vs. modern human mtDNA
--forensic samples
--processed food samples
酶切位点分析及理论模拟图
restriction site
donkey
horse(mule)
pig
g/antc
HinfI
44 161
44 278
44 161
gg/cc
HaeIII
73 232
73 232 337
73
ag/ct
Alu I
0
303
189
114
/gatc
MboI
0
0
0
245
t/cga
TaqI
0
0
0
141
t/taa
MseI
149
0
0
67
149
1031
1031
900
900
800
800
700
700
600
600
500
500
0
73 226
400
400
359
300
200
198
100
50
117
marker
cow(cattle)
44
donkey
234
81
44
horse
Hif I
200
198
100
50
117
44
cattle
300
pig
marker
286
159
127
73
donkey
159
105
73
horse
Hae III
153
133
73
73
cattle
pig
Mitochondrial DNA sequence establishes the
authenticity of the remains of Tsar Nicholas Ⅱ
俄国末代沙皇尼古拉二世及其妻子,四个女儿和一个儿子(1910年)。
1. Ivanov PL, et al (1996). “Mitochondrial DNA sequence heteroplasmy in the
Grand Duke of Russia Georgij Romanov establishes the authenticity of the
remains of Tsar Nicholas II”. Nat. Genet. 1996, 12 (4): 417–20.
doi:10.1038/ng0496-417.
2. Coble MD, et al (2009). "Mystery solved: the identification of the two missing
Romanov children using DNA analysis". PLoS ONE 2009 , 4 (3): e4838.
doi:10.1371/journal.pone.0004838.
 Serre D, et al. (2004) No Evidence of Neandertal mtDNA Contribution to Early
Modern Humans. PLoS Biol 2(3): e57. doi:10.1371/journal.pbio.0020057
 Serre D, et al. (2004) No Evidence of Neandertal mtDNA Contribution to Early
Modern Humans. PLoS Biol 2(3): e57. doi:10.1371/journal.pbio.0020057
early modern human
• (A) Under the assumption of a constant effective population size of
10,000 for modern humans, contemporary mtDNAs trace back to
approximately five mtDNA lineages 25,000 years ago. The modern
human fossils represent five additional samples from around the time
of putative admixture (stars). The contemporary and early modern
human (EMH) samples reject a Neandertal contribution of 25% or
more to modern humans about 30,000 years ago (p ≤ 0.05).
Outline: pairwise alignment
1. Overview and examples
2. Definitions: homologs, paralogs, orthologs
3. Alignment algorithms:
I.
Global alignment: Needleman-Wunsch
II. Local alignment: Smith-Waterman
4. Assigning scores to aligned amino acids:
Dayhoff’s PAM matrices
5. Statistical significance of pairwise alignments
Definitions
Pairwise alignment
The process of lining up two sequences to achieve
maximal levels of identity (and conservation, in the
case of amino acid sequences) for the purpose of
assessing the degree of similarity and the possibility
of homology (belongs to a family or a superfamily).
Since biological relatedness is implied at the
superfamily level, we defined a homology family as
a collection of genes or proteins which have descended
from a common ancestor.
Definitions
Homology: Similarity attributed to descent
from a common ancestor.
Similarity: The extent to which (DNA or
amino acid) sequences are related. It is
based upon identity plus conservation.
Identity: The extent to which two sequences
are invariant (identical).
Conservation: Changes at a specific position
of an amino acid or DNA sequence that
preserve the physico-chemical properties of
the original residue.
retinol-binding protein 4
(NP_006735)
b-lactoglobulin
(P02754)
Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 BLG
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 BLG
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 BLG
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 BLG
Pairwise alignment of retinol-binding protein (RBP)
and b-lactoglobulin (BLG)
RBP
BLG
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG
||
||| |
.
|. . . | : .||||.:|
:
1 MK---CLLLALALTCGAQALIVT--QTMKGLDIQKVAGTWYSLAMAASD**
*** *
*
*
**** *
Identity
(bar)
Identity
Somewhat
similar
(one dot)
Very
similar
(two dots)
Conservation
Similarity  Homology?
Gaps
• Positions at which a letter is paired with a null
are called gaps.
• Gap scores are typically negative (gap penalty).
• Since a single mutational event may cause the
insertion or deletion of more than one residue, the
presence of a gap is ascribed more significance
than the length of the gap.
• In BLAST, it is rarely necessary to change gap
penalty values from the default (-11).
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL--------VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI------- 178 lactoglobulin
Internal
gap
Terminal
gap
Definitions: two types of homology
Orthologs
Homologous sequences in different species that
arose from a common ancestral gene during
speciation; may or may not be responsible for a
similar function.
Paralogs
Homologous sequences within a single species
that arose by gene duplication.
speciation
speciation
common carp
Orthologs:
zebrafish
rainbow trout
members of a
gene (protein)
family in various
organisms.
teleost
This tree shows
RBP orthologs.
African
clawed
frog
chicken
human
mouse
rat
horse
pig cow rabbit
10 changes
Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss)
1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48
::
|| || ||
.||.||. .| :|||:.|:.| |||.|||||
1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47
.
.
.
.
.
49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98
|||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || |
48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97
.
.
.
.
.
99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148
||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| |
98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147
.
.
.
.
.
149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199
|||:||| | || || |||| :..|:|
.|| : | |:|:
148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192
Multiple sequence alignment of
glyceraldehyde 3-phosphate dehydrogenases
fly
human
plant
bacterium
yeast
archaeon
GAKKVIISAP
GAKRVIISAP
GAKKVIISAP
GAKKVVMTGP
GAKKVVITAP
GADKVLISAP
SAD.APM..F
SAD.APM..F
SAD.APM..F
SKDNTPM..F
SS.TAPM..F
PKGDEPVKQL
VCGVNLDAYK
VMGVNHEKYD
VVGVNEHTYQ
VKGANFDKY.
VMGVNEEKYT
VYGVNHDEYD
PDMKVVSNAS
NSLKIISNAS
PNMDIVSNAS
AGQDIVSNAS
SDLKIVSNAS
GE.DVVSNAS
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNSITPVA
fly
human
plant
bacterium
yeast
archaeon
KVINDNFEIV
KVIHDNFGIV
KVVHEEFGIL
KVINDNFGII
KVINDAFGIE
KVLDEEFGIN
EGLMTTVHAT
EGLMTTVHAI
EGLMTTVHAT
EGLMTTVHAT
EGLMTTVHSL
AGQLTTVHAY
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TGSQNLMDGP
SGKLWRDGRG
SGKLWRDGRG
SMKDWRGGRG
SHKDWRGGRG
SHKDWRGGRT
NGKP.RRRRA
AAQNIIPAST
ALQNIIPAST
ASQNIIPSST
ASQNIIPSST
ASGNIIPSST
AAENIIPTST
fly
human
plant
bacterium
yeast
archaeon
GAAKAVGKVI
GAAKAVGKVI
GAAKAVGKVL
GAAKAVGKVL
GAAKAVGKVL
GAAQAATEVL
PALNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELQGKLTGM
PELEGKLDGM
AFRVPTPNVS
AFRVPTANVS
AFRVPTSNVS
AFRVPTPNVS
AFRVPTVDVS
AIRVPVPNGS
VVDLTVRLGK
VVDLTCRLEK
VVDLTCRLEK
VVDLTVRLEK
VVDLTVKLNK
ITEFVVDLDD
GASYDEIKAK
PAKYDDIKKV
GASYEDVKAA
AATYEQIKAA
ETTYDEIKKV
DVTESDVNAA
apolipoprotein D
Paralogs:
retinol-binding
protein 4
Complement
component 8
Alpha-1
Microglobulin
/bikunin
prostaglandin
D2 synthase
progestagenassociated
endometrial
protein
Odorant-binding
protein 2A
neutrophil
gelatinaseassociated
lipocalin
Lipocalin 1
10 changes
members of a
gene (protein)
family within a
species
Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 BLG
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 BLG
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 BLG
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 BLG
Multiple sequence alignment of
human lipocalin paralogs
~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM
LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF
TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR
VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV
VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF
LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF
VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL
VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW
PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD...
lipocalin 1
odorant-binding protein 2a
progestagen-assoc. endo.
apolipoprotein D
retinol-binding protein
neutrophil gelatinase-ass.
prostaglandin D2 synthase
alpha-1-microglobulin
complement component 8
Multiple sequence alignment allows us
to look back billions of years ago (BYA)
Origin of
life
BYA 4
Earliest
fossils
Origin of Eukaryote/
eukaryotes archaea
3
2
Fungi/animal
Plant/animal
1
insects
0
Background on sequence alignment
There are two main approaches to sequence
Alignment:
1. Global alignment (Needleman & Wunsch 1970)
using dynamic programming to find optimal
alignments between two sequences. (Although the
alignments are optimal, the search is not
exhaustive.)
Gaps are permitted in the alignments, and the total
lengths of both sequences are aligned (hence
“global”).
Global Alignment
A: acagtcgacgtgtcgatgtcgattagtcgatcgct
B: acagtcgacgtagtcgatcgcttgtcgatgtcgat
A: acagtcgacg------------tgtcgatgtcgattagtcgatcgct
||||||||||
|||||||||||||
B: acagtcgacgtagtcgatcgcttgtcgatgtcgat------------
2. Local alignment (Smith & Waterman, 1980).
The alignment may contain just a portion of either
sequence, and is appropriate for finding matched
domains between sequences.
S-W is guaranteed to find optimal alignments, but
it is computationally expensive (requires O (n2)
time).
BLAST and FASTA are heuristic approximations to
local alignment. Each requires only O (n2) /k time;
they examine only part of the search space.
Local Alignment
A: acagtcgacgtgtcgatgtcgattagtcgatcgct
B: acagtcgacgtagtcgatcgcttgtcgatgtcgat
A: acagtcgacgtgtcgatgtcgattagtcgatcgct
||||||||||
B: acagtcgacgtagtcgatcgcttgtcgatgtcgat
A:
acagtcgacgtgtcgatgtcgattagtcgatcgct
|||||||||||||
B: acagtcgacgtagtcgatcgcttgtcgatgtcgat
A: acagtcgacgtgtcgatgtcgattagtcgatcgct
||||||||||||
B:
acagtcgacgtagtcgatcgcttgtcgatgtcgat
Local alignment
Domain A
Domain B
Domain A
Domain B
实例:人类组织血纤蛋白溶酶原活化因子(PLAT)和凝血因子XII
(F12) 的模块结构的示意图。标记为Catalytic的模块在若干种凝
血蛋白中是常见的,F1和F2是较为常见的重复模块,首先在血
纤蛋白中被发现。E模块同表皮生长因子极为类似。通常称为
Kringle domain 的模块被标记为K。
http://www.uniprot.org/uniprot/P00748
http://www.uniprot.org/uniprot/P00750
Lalign Input
BLAST 2 Seq Output
Needle Input
Lalign Global alignment
Needle Global alignment
Lalign Local alignment
Waterman Local alignment
对人类凝血因子XII和组织血纤蛋白溶酶原活化因子的氨基酸序列进行打
点比较。DOTTER程序
General approach to pairwise alignment
① Choose two sequences
② Select an algorithm that generates a score
③ Allow gaps (insertions, deletions)
④ Scoring the alignment to reflect degree of
similarity
⑤ Estimate statistic significance of the
alignment (probability that the alignment
occurred by chance)
An alignment scoring system is required
to evaluate how good an alignment is
• Positive and negative values assigned
respectively for matches (identities and
conservations) and mismatches
• Gap creation and extension penalties
• Positive score for identities
• Some partial positive score for conservative
substitutions
• Global versus local alignment
• Use of a substitution matrix
Calculation of an alignment score
Global alignment with the algorithm
of Needleman and Wunsch (1970)
• Two sequences can be compared in a matrix along x- and
y-axes.
• If they are identical, a path along a diagonal can be drawn
• Find the optimal subpaths, and add them up to achieve
the best score. This involves
--adding gaps when needed
--allowing for conservative substitutions
--choosing a scoring system (simple or complicated)
• N-W is guaranteed to find optimal alignment(s)
Three steps to global alignment
with the Needleman-Wunsch algorithm
[1] Set up a matrix
[2] Score the matrix
[3] Identify the optimal alignment(s)
Four possible outcomes in aligning two sequences
1
2
[1] identical
(stay along a diagonal)
[2] mismatch
(stay along a diagonal)
[3] gap in sequence 1
(move vertically!)
[4] gap in sequence 2
(move horizontally!)
Start Needleman-Wunsch with an identity matrix
sequence 1
sequence 2
sequence 1 ABCNJRQCLCRPM
sequence 2 AJCJNRCKCRBP
Page 66
Start Needleman-Wunsch with an identity matrix
Fill in the matrix starting from the bottom right
i=2, j=3:
si,j = max
si-1,j-1 + si, j
si-1, j + si, j - Gap penalty (-1)
si, j-1 + si, j - Gap penalty (-1)
i=3, j=3:
si,j = max
si-1,j-1 + si, j
si-1, j + si, j - Gap penalty (-1)
si, j-1 + si, j - Gap penalty (-1)
i=3, j=4:
si,j = max
si-1,j-1 + si, j
si-1, j + si, j - Gap penalty (-1)
si, j-1 + si, j - Gap penalty (-1)
i=7, j=8:
si,j = max
si-1,j-1 + si, j
si-1, j + si, j - Gap penalty (-1)
si, j-1 + si, j - Gap penalty (-1)
After you’ve filled in the matrix, find the optimal
path(s) by a “Traceback” procedure.
sequence 1 ABCNJ-RQCLCR-PM
sequence 2 AJC-JNR-CKCRBPsequence 1 ABC-NJRQCLCR-PM
sequence 2 AJCJN-R-CKCRBP-
Needleman-Wunsch: dynamic programming
N-W is guaranteed to find optimal alignments,
although the algorithm does not search all
possible alignments (solution space).
It is an example of a dynamic programming
algorithm: an optimal path (alignment) is identified
by incrementally extending optimal subpaths.
Thus, a series of decisions is made at each step
of the alignment to find the pair of residues with
the best score.
全局比对算法
Smith-Waterman 算法
(1) 计算出两个序列的相似分值,存于一
个矩阵中。(edit matrix、DP矩阵)
(2) 根据edit matrix得到序列比对的最
优分值,通过回溯(Traceback)的
方法可获得序列的最优比对序列 。
(1) 计算edit matrix:
for i = 0 to n do
for j = 0 to m do
Calculate V (i, j) using V (i - 1, j - 1),
V (i, j - 1), V (i - 1, j)
end
(2) 回溯的算法:
for ( i = | S |,j = | T |; i > 0 && j > 0;) {
if (V[i, j ] = V[ i –1, j –1] + σ ( S[ i ], T[ j ]))
{ i – –; j – –; }
else if (V[i, j ] = V[ i –1, j] + σ ( S[ i ], – ))
{ i – –; insert( ‘–’, T, j ); }
else if (V[i, j ] = V[ i, j–1]
+ σ (–, T[ i ] ))
{ j – –; insert( ‘–’, S, i ); }
}
Outline: pairwise alignment
1. Overview and examples
2. Definitions: homologs, paralogs, orthologs
3. Alignment algorithms:
I.
Global alignment: Needleman-Wunsch
II. Local alignment: Smith-Waterman
4. Assigning scores to aligned amino acids:
Dayhoff’s PAM matrices
5. Statistical significance of pairwise alignments
Global alignment versus local alignment
Global alignment (Needleman-Wunsch) extends from
one end of each sequence to the other.
Local alignment finds optimally matching regions
within two sequences (“subsequences”).
Local alignment is almost always used for database
searches such as BLAST. It is useful to find domains
(or limited regions of homology) within sequences.
Smith and Waterman (1981) solved the problem of
optimal local sequence alignment. Other methods
(BLAST, FASTA) are faster but less thorough.
How the Smith-Waterman algorithm works
Set up a matrix between two sequences
(size m+1, n+1)
The score in each cell is the maximum of four values:
[1] s(i-1, j-1) + the new score at [i, j] (a match or mismatch)
[2] s(i,j-1) – gap penalty
[3] s(i-1,j) – gap penalty
[4] zero
No values in the scoring matrix can be negative
S>0
Possible Lancelet (amphioxus) Growth hormone
Danio GH1
1 --------------------------------------------------
0
BF_76953
1 MVSANRMIFGWKLLYLPVFCLLLAGFVVTVKGTPWLQVFATVKGTGQTAV
50
Danio GH1
1 --------------------------MARALVLLQLVV--VSLLVNQGKA
:.|..:||:... |.|:.| |:.
51 DAWNDNTGENVLHNKSPLVEQWGSLGIKRVKILLEASTGDVELIFN-GEN
22
72
BF_76953
23 SENQRLFNNAVIRVQHLHQLAAKMINDFEEGLMPEERRQLSKIFPLSFCN
::....|:.|.:.....:.:.|:..|.|.........|
100 TDKHNWFSLARLVSSPWNDVHAEPNNFFSIAGHDIHLR------------
137
Danio GH1
73 SDSIETPTGKDETQKSSMLKLLRISFRLIESWEFPSQTLSSTISNSLTIG
122
BF_76953
138 --------------------------------------------------
137
Danio GH1
123 NPNQITEKLADLKMGISVLIKGCLDGQPNMDDNDSLPLPFEDFYLTVGET
172
BF_76953
138 --------------------------------------------------
137
Danio GH1
173 SLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL
210
BF_76953
138 --------------------------------------
137
BF_76953
Danio GH1
99
Danio GH1
BF_72547
Danio GH1
BF_72547
Danio GH1
BF_72547
Danio GH1
BF_72547
Danio GH1
1 ---MARALVLLQLVV---------VSLLVNQGKASENQRLFNNA-VIRVQ
.|:.:|||...:
::|..:.|.........:|| .:.:|
1 MAITAQMIVLLVFCIWVDHTAAQRITLPYHPGSCCFGGESRSNADTVGLQ
37
50
38 HLHQLAAKMINDFEEGLMPEERRQLSKIFPLSFCNSDSIETPTGKDETQK
:.:.|.|......|.||..
|...:..:..||...:....
51 -------RRVADLEARQAQMEARQAQ----LEVSSRQAENTPPCPEPEIP
87
88 SSMLKLLRISFR----LIESWEFPSQTLSSTISNSLTIGNPNQITEKLAD
.....:.:|..:
:...|...|...:.....||:.| :.|:. |
90 CEWKLVFKIQAKAGPDIYPLWSSASVPANQPDDPSLSTG--HYISR---D
133
134 LKMGISVLIK----GCLDGQPNMDDNDSLPLPFEDFYLTVGETSLRESF:.|..|:.||
......|||:..| |.|.
|..|.::::
135 VGMWESLDIKKVKVSLYTFSPNMETRD---LIFN------GTGSNKDNWF
178
89
134
175
BF_72547
179 ---RLLACFKKDMHKVET-YLRVANCRRSLDSNCTL-------------||::....|:....| |..:|...|...|...|
176 SKDRLISSPWTDLKTEPTNYFSIAGYERRDSSESYLFRRFHINRSYAGCA
210
225
Danio GH1
211 --------------------------------------------------
210
BF_72547
226 ADLGWLVVLDGNSNVCPWERGTGNPRIMFSKLPTNVNFQQDPSNVGTADV
275
Danio GH1
211 ----------
210
BF_72547
276 MAIYIKTCDD
285
Practice
Choose two of your own protein sequences:
1. Try using needle to implement a Needleman-Wunsch
alignment algorithm to find the optimum global alignment
(including gaps):
2. Try using Water to implement a Smith-Waterman algorithm
to find the optimum local alignment (including gaps):
3. Try using Lalign to implement a Smith-Waterman algorithm
to find all possible local alignment (including gaps):
http://www.ebi.ac.uk/Tools/psa/
Outline: pairwise alignment
• Overview and examples
• Definitions: homologs, paralogs, orthologs
• Alignment algorithms:
• Needleman-Wunsch,
• Smith-Waterman
• Assigning scores to aligned amino acids:
Dayhoff’s PAM matrices
• Statistical significance of pairwise alignments
Calculation of an alignment score
Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html
PAM250 scoring matrix
Blosum62 scoring matrix
Gonnet et al. (1992) produced a new scoring matrix known as the Gonnet92 matrix that
was intended to be an updated version of the Dayhoff PAM250 scoring (see also
Benner et al. 1994). However, since widely divergent sequences were used to produce
this matrix, it is more comparable to the BLOSUM62 scoring matrix.
Why using scoring matrices?
It is assumed that the sequences being
compared have a common evolutionary
ancestral sequence. The best guess at the
actual path of evolution is the path that requires
the fewest evolutionary events.
All substitutions are not equally likely,
Insertions and deletions are less likely than
substitutions. and they should be weighted to
account for this. Thus matrices are used to
estimate how well two residues of given types
would match if they were aligned in a sequence
alignment.
Why using scoring matrices?
A substitution is more likely to occur between
amino acids with similar biochemical properties.
For example, the hydrophobic amino acids
Isoleucine (I) and valine (V) get a positive score
on matrices adding weight to the likeliness that
one will substitute for another.
While the hydrophobic amino acid isoleucine
(I) has a negative score with the hydrophilic
amino acid cystine (C) as the likeliness of this
substitution occurring in the protein is far less.
Importance of scoring matrices
• Scoring matrices appear in all analysis
involving sequence comparison.
• The choice of matrix can strongly influence
the outcome of the analysis.
• Scoring matrices implicitly represent a
particular theory of evolution.
• Understanding theories underlying a given
scoring matrix can aid in making proper
choice.
Guidelines for using Scoring matrices
Protein
Length
>300
85-300
50-85
>300
85-300
35-85
<=35
<=10
Matrix
Open Gap
Extend Gap
BLOSUM50
BLOSUM62
BLOSUM80
PAM250
PAM120
MDM40
MDM20
MDM10
-10
-7
-16
-10
-16
-12
-22
-23
-2
-1
-4
-2
-4
-2
-4
-4
How did the scoring matrix came from?
There have been extensive studies looking at the
frequencies in which amino acids substituted for each
other during evolution. The studies involved carefully
aligning all of the proteins in several families of proteins
and then constructing phylogenetic trees for each family.
Each phylogenetic tree can then be examined for the
substitutions found on each branch.
This can then be used to produce tables (scoring
matrices) of the relative frequencies with which amino
acids replace each other over a short evolutionary period.
Thus a substitution matrix describes the likelihood that
two residue types would mutate to each other in
evolutionary time.
Comparing two proteins with a PAM1 matrix
gives completely different results than PAM250!
Consider two distantly related proteins. A PAM40 matrix
is not forgiving of mismatches, and penalizes them
severely. Using this matrix you can find almost no match.
hsrbp, 136 CRLLNLDGTC
btlact,
3 CLLLALALTC
* ** * **
A PAM250 matrix is very tolerant of mismatches.
24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7%
rbp4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV
btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
*
**** *
* *
*
** *
rbp4 86 --CADMVGTFTDTEDPAKFKM
btlact 80 GECAQKKIIAEKTKIPAVFKI
**
* ** **
Differences between PAM and BLOSSUM
1.PAM matrices are based on an explicit evolutionary model
(that is, replacements are counted on the branches of a
phylogenetic tree), whereas the Blosum matrices are
based on an implicit rather than explicit model of evolution.
The sequence variability in the alignments used to count
replacements.
2.The PAM matrices are based on mutations observed
throughout a global alignment, this includes both highly
conserved and highly mutable regions. The Blosum
matrices are based only on highly conserved regions in
series of alignments forbidden to contain gaps.
3.The method used to count the replacements is different,
unlike the PAM matrix, the Blosum procedure uses groups
of sequences within which not all mutations are counted
the same.
BLOSUM Matrices
BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution matrix.
BLOSUM62 is a matrix calculated from comparisons of
sequences with no less than 62% divergence.
BLOSUM Matrices
All BLOSUM matrices are based on observed alignments;
they are not extrapolated from comparisons of closely
related proteins.
The BLOCKS database contains thousands of groups of
multiple sequence alignments.
BLOSUM62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of moderately distant
proteins, it performs well in detecting closer relationships.
A search for distant relatives may be more sensitive with a
different matrix.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1 5
-2 0 6
-2 -2 1 6
0 -3 -3 -3 9
-1 1 0 0 -3 5
-1 0 0 2 -4 2 5
0 -2 0 -1 -3 -2 -2 6
-2 0 1 -1 -3 0 0 -2 8
-1 -3 -3 -3 -1 -3 -3 -4 -3 4
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
-1 2 0 -1 -1 1 1 -2 -1 -3 -2 5
-1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I
L K M F P S T W Y
V
Blosum62 scoring matrix
Rat RBP
versus
Mouse RBP
Rat RBP
versus
bacterial
lipocalin
Outline: pairwise alignment
• Overview and examples
• Definitions: homologs, paralogs, orthologs
• Alignment algorithms:
• Needleman-Wunsch,
• Smith-Waterman
• Assigning scores to aligned amino acids:
Dayhoff’s PAM matrices
• Statistical significance of pairwise alignments
Statistical significance of pairwise alignment
• A basic question is how to determine whether a
particular alignment score is likely to have
occurred by chance.
• According to the null hypothesis, two aligned
sequences are not homologous (evolutionarily
unrelated).
• Can we reject the null hypothesis at a particular
significance level alpha (percent of confidence)?
Low-complexity regions (LCRs)
Both proteins and nucleic acids contain
regions of biased composition, which can lead
to confusing database search results. These
low-complexity regions (LCRs) include:
1.the obvious homopolymeric runs
2.short-period repeats
3.more subtle cases where one or a few
different residues may be overrepresented.
Low-complexity regions (LCRs)
Alignment of LCR-containing sequences is
problematic:
• They do not fit the model of residue-by-residue
sequence conservation.
• The functionally relevant attributes may be only the
periodicity or composition and not any specific
sequence.
• The statistical significance of matches can be
overestimated, as the methods for assessing the
significance are based on certain notions of
randomness which LCRs do not obey.
Consequently, many false positives may be
observed in the output of a database search with an
LCR-containing query sequence (Altschul et al.,
1994).
SEG shows the low-complexity sequences in lower-case letters on the
left and high-complexity in upper-case on the right.
human achaetescute protein (SWISS-PROT P50553)
Using the -x switch, SEG will mask low-complexity sequences
BLAST can perform the masking automatically.
Identifying low-complexity regions .
homologous
sequences
non-homologous
sequences
Sequences
reported
as related
True positives
False positives
Sequences
reported
as unrelated
False negatives
True negatives
Sensitivity:
ability to find
true positives
Specificity:
ability to minimize
false positives