or more (multiple sequence alignment)

Download Report

Transcript or more (multiple sequence alignment)

Sequence Analysis
Topics to be covered
•
•
•
•
What is sequence analysis?
Why we need it?
How we do it?
What are the tools available?
Sequence Databases
• GenBank at the National Center of Biotechnology
Information, National Library of Medicine, Washington, DC
accessible from:
http://www.ncbi.nlm.nih.gov/Entrez
• European Molecular Biology Laboratory (EMBL) Outstation
at Hinxton, England
http://www.ebi.ac.uk/embl/index.html
• DNA DataBank of Japan (DDBJ) at Mishima, Japan
http://www.ddbj.nig.ac.jp/
• Protein International Resource (PIR) database at the National
Biomedical Research Foundation in Washington, DC
http://www-nbrf.georgetown.edu/pirwww/
•
5.
Sequence Databases
• The SwissProt protein sequence database
at
ISREC,
Swiss
Institute
for
Experimental Cancer Research in
Epalinges/Lausanne
http://www.expasy.ch/cgibin/sprot-search-de
Servers for Sequence Alignment
• Bytes Block Aligner
http://www.wadsworth.org/res&res/bioinfo
• BCM Search Launcher: Pair-wise sequence alignment
http://searchlauncher.bcm.tmc.edu/seqsearch/alignment.html
• SIM—Local similarity program for finding alternative
alignments
http://www.expasy.ch/tools/sim.html
• Global alignment programs (GAP, NAP)
http://genome.cs.mtu.edu/align/align.html
• FASTA program suite
http://fasta.bioch.virginia.edu/fasta/fasta_list.html
• BLAST 2 sequence alignment (BLASTN, BLASTP)
http://www.ncbi.nlm.nih.gov/gorf/bl2.html
• Likelihood-weighted sequence alignment (lwa)
http://stateslab.bioinformatics.med.umich.edu/service/l
wa.html
ALIGNMENT
How do we tell whether
two macromolecules are
similar? Why?
Sequence Comparison
• Similarity between and among sequences
is at the heart of sequence comparison
methods.
• Q: What causes sequences to be similar?
• A: Evolutionary OR developmental causes.
Similarity, Homology, Orthology, Paralogy
• Similarity cannot be equated with Homology,
Orthology, or Paralogy.
• Different organisms may have similar
characteristics for different causes/reasons:
evolutionary OR developmental.
• We need to concern ourselves with one set only,
the set whose similarities are due to evolutionary
causes/events; in other words, we take into
account ONLY those characteristics that are
similar among organisms due to evolution
(common ancestry).
Similarity,Homology, Orthology, Paralogy
Homology
• Homology: designates a relationship of
common descent between any entities,
without further specification of the
evolutionary scenario.
• Accordingly, the entities related by
homology, in particular, genes, are called
homologs.
• Sequences are said to be homologous
if they are related by divergence from
a common ancestor
• Homology is not a measure of
similarity, but an absolute statement
that sequences have a divergent rather
than a convergent relationship.
Orthology
• Orthologs are homologous genes in
different species with analogous
functions.
Paralogy
• Paralogs are similar genes that are the
result of a gene duplication.
• Paralogs are therefore neither homologs
nor orthologs.
Orthology vs. Paralogy
A phylogeny that includes both orthologs and
paralogs is likely to be incorrect.
The figure above shows how gene duplication gave
rise to two paralogous branches, α and β, species
within each branch are orthologs of each other.
ALIGNMENT
• One-to-One
• One-to-Database
• Many-to-Many
Origins of Sequence Similarity
• Homology
– common evolutionary descent
• Similarity in function
– convergence
• Chance
Identity
GAACAAT
||||||| 7/7 OR 100%
GAACAAT
MISMATCH
GAACAAT
||| ||| 6/7 OR 84%
GAATAAT
Mismatches
GAACAAT
||| ||| 6/7 OR 84%
GAATAAT
GAACAAT
||| ||| 6/7 OR 84%
GAAGAAT
Terminal Mismatch
GAACAATttttt
||| |||
aaaccGAATAAT 6/7 OR 84%
Types of Sequence Analysis
Knowledge-based single/multiple sequence
analysis for sequence characteristics
•
Pair-wise sequence alignment
• Multiple sequence alignment
~ Sequence motif discovery in multiple
sequence alignment
~Phylogenetic analysis
Sequence Alignment
• Procedure for comparing two
(pair-wise alignment) or more
(multiple sequence alignment)
sequences by searching for a
series of individual characters or
character patterns that are in the
same order in the sequences.
Sequence Alignment
I
Fundamental
to
inferring
(common ancestry) and function.
homology
II
If two sequences are in alignment  part or
all of the pattern of nucleotides and
polypeptides match, then they are similar and
can be said to be homologous.
III If the sequence of a protein or other molecule
‘significantly’ matches the sequence of a
protein with a known structure and function,
then the molecules may share structure and
function.
Pair-wise Sequence Alignment
• One pair of elements at a time
• Challenge – Find optimum alignment of 2
sequences with some degree of similarity
• Optimality is based on SCORE
• Score reflects the no. of paired characters in
the 2 sequences and the no. and length of gaps
introduced to adjust the sequences so that max
no. of characters are in alignment
Pair-wise Sequence Alignment
Example 1
Consider the ideal case of 2 identical
nucleotide sequences
AT T CGGCAT T CAGT GCT AGA
AT T CGGCAT T CAGT GCT AGA
Score – 1 point per pair of aligned characters
SCORE = 20 points
Pair-wise Sequence Alignment
Example 2
Consider the case when several of the characters
are not aligned
A T T C GGC A T T CA GT GC T A GA
A T T C G G C A T T GC T A G A
A) SCORE = 11 points??
Note that the last 6 characters are identical!
Pair-wise Sequence Alignment
Example 2 contd.
A T T C GGC A T T CA GT GC T A GA
A T T C GGC A T T - - - - GC T A GA
SCORE = 16 points??
Any cost/penalty of inserting gaps?
Penalty = -0.5 / gap
Final Score = 16 – 0.5*4 = 14
Pair-wise Sequence Alignment
Example 3
Areas of similarity and dissimilarity not obvious
as opposed to the previous example where there
were two blocks of identical characters
A T T CG GC A T T CA GA GC T A GA
A T T C G A C A T T GC T A G T G G T A
SCORE = 12 points
Pair-wise Sequence Alignment
Example 3 contd…
Areas of similarity and dissimilarity not obvious
A T T C G G C A T T C A G A G C T A G a
A T T C G A C A T T - - - - G C T A G t
SCORE = 14 – 0.5*4 = 12
SCORE = 14 (1 for exact match)
– 0.5*4 ( - 0.5 for gap)
– 0.5*1 ( - 0.5 for inexact match)
= 14 – 2 – 0.5 =11.5
Pair-wise Sequence Alignment
Example 3 contd.
A T T C GGC A T T CA GA GC T A Ga
AT T CGACAT T - - - - GCT AGt
SCORE = 14 – 0.5*4 – 0.5*1 (inexact matches) = 11.5
Penalty(gap)=Cost(opening) + Cost(per ext)* Length(gap)
 in Example 2 and above, gap penalty = -0.5 + (-0.5*4) = -2.5
score in the above case becomes 8.5
Two types of Sequence Alignment
• Global Alignment: An attempt is made to
align the entire sequence using as many
characters as possible, till the ends of both the
sequences.
• Local Alignment: Stretches of sequence with
the highest density of matches are aligned
generating one or more islands of matches.
This is particularly relevant for multi-domain
proteins that share a conserved domain.
Local vs. Global Alignment
Global alignment
•
•
•
•
An attempt to line up two sequences
matching as many characters as possible
Considers all characters in a sequence.
Bases alignment on the total score, even at
the expense of stretches that that share
obvious similarity
Used for determining whether two protein
sequences are in the same family
Local vs. Global Alignment
Local Alignment
•
•
•
More meaningful – points out
conserved regions between two
sequences
Aligns two partially overlapping
sequences
Aligns two sequences where one is a
subsequence of another
Methods
•
•
•
Dot Matrix/Dot Plot
Bayesian Method
Dynamic Programming
•
•
•
•
•
Smith-Waterman Algorithm
Hidden Markov Models
Genetic Algorithms
Neural Networks
Word-based techniques
•
•
FASTA
BLAST
Dot Matrix Method
• Visual inspection of linear sequences with 100 or more
characters is impractical
• Dot Matrix method is visually more informative
• Makes similarities in patterns more obvious to visual
inspection
• A dot is placed at the intersection of matching character pairs
• Dotter is a graphical dot plot program for detailed comparison
of two sequences.
Reference
A dot-matrix program with dynamic threshold control
suited for genomic DNA and protein sequence analysis"
Erik L.L. Sonnhammer and Richard Durbin
Gene 167:GC1-10 (1995)
Dot Matrix Method
S
E
S
¥
I
M
E
R
U
E
N
C
E
A
N
A
L
Y
S
¥
I
S
¥
P
R
I
M
E
¥
¥
¥
¥
¥
¥
R
¥
¥
¥
¥
¥
¥
C
E
A
N
I
S
P
R
Q
¥
Q
U
E
N
A
L
Y
S
E
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
Dot Matrix Method
The biggest asset of dot matrix analysis is it allows you
to visualize the entire comparison at once, not
concentrating on any one ‘optimal’ region.
Since your own mind and eyes are still better than
computers at discerning complex visual patterns,
especially when more than one pattern is being
considered, you can see all these ‘less than best’
comparisons as well as the main one and then you can
‘zoom-in’ on those regions of interest using more
detailed procedures.
Dot Matrix Method
‘mutated’ inter-sequence comparison
Dot Matrix Method: Detection of Repeats
S
E
S
¥
E
Q
U
¥
E
¥
A
L
Y
S
¥
I
S
¥
P
R
I
M
E
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
E
¥
¥
¥
¥
¥
¥
¥
¥
Q
U
¥
¥
E
N
¥
¥
¥
C
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
Q
¥
¥
¥
N
C
E
N
¥
¥
E
U
E
A
¥
N
C
E
E
¥
U
E
E
S
C
¥
Q
S
N
¥
¥
¥
¥
¥
¥
R
Dot Matrix Method: detecting complicated
mutations
S
E Q U
E N C
A
N
¥
E A N A
¥
¥
¥
¥
A
L
Y
S
I
S
P R
I
M E R
¥
¥
¥
L
Y
¥
¥
Z
E
S
¥
¥
¥
¥
¥
E
¥
¥
¥
¥
¥
¥
¥
Q
U
¥
¥
E
¥
N
C
E
S
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
Dot Matrix Method
Complicated MUTATIONS
Again, notice the diagonals. However,
they have now been displaced off of
the center diagonal of the plot. This is
an example of a ‘transposition’. Dot
matrix analysis is one of the only
sensible
ways
to
locate
such
transpositions in sequences. The
‘deletion’ of ‘PRIMER’ is shown by the
lack of a corresponding diagonal.
Noise in Dot plots and filtering
S
S
H I
B A S
S
H
S
I
T S
S
C H O W D H U R Y
S
I
H
H
H
H
I
B
A
B
A
S
S
I
I
S
S
I
S
H
S
H
B
B
I
S
I
S
H
B
I
I
I
T
S
H B I
H
B
S
S
H
I
S
I
T
S
S
S
S
Dot Matrix Method: filtering noise using the
sliding window approach
Reconsider the same plot. Notice the extraneous dots that neither indicate
runs of identity between the two sequences nor inverted repeats. These
merely contribute ‘noise’ to the plot and are due to the ‘random’
occurrence of the letters in the sequences, the composition of the
sequences themselves.
How can we ‘clean up’ the plots so that this noise does not detract from our
interpretations? Consider the implementation of a filtered windowing
approach; a dot will only be placed if some ‘stringency’ is met within a
user-defined window.
What is meant by this is that if within some defined window size, some defined
criterion is met, then and only then, will a dot be placed at the middle of
that window. Then the window is shifted one position and the entire
process is repeated. This very successfully rids the plot of unwanted noise.
Sliding Window Algorithm
Window Size = 3
Stringency = 3
T A C G G T A T G
A C A G T A T C
C
T
A
T
G
A
C
A
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C

T A C G G T A T G

Dotplot
(Window = 130 / Stringency = 9)
Hemoglobin
-chain
Hemoglobin -chain
Dotplot
(Window = 18 / Stringency = 10)
Hemoglobin
-chain
Hemoglobin -chain
Parameters of Sequence Alignment
Scoring Systems:
• Each symbol pairing is assigned a numerical
value, based on a symbol comparison table.
Gap Penalties:
• Opening: The cost for opening a gap
• Extension: The cost for elongating a gap
Protein Scoring Systems
The concept of amino acid substitution
matrices : A 20x20 matrix containing values
proportional to the probability that amino acid i mutates
into amino acid j for all pairs of amino acids.
– Pioneered by Margaret Dayhoff (1968) who first
described a method to derive scoring matrices from
amino acid replacements observed in aligned families
– of present-day sequences.
– In 1978 her group studied 1572 mutations in 71
families of closely-related protein sequences. Based
on these they built matrices known as PAM (percent
accepted mutation) matrices.
PAM 250 (log odds form or mutation data matrix or MDM)
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
B
Z
W
A
2
-2
0
0
-2
0
0
1
-1
-1
-2
-1
-1
-3
1
1
1
-6
-3
0
2
1
R
-2
6
0
-1
-4
1
-1
-3
2
-2
-3
3
0
-4
0
0
-1
2
-4
-2
1
2
N
0
0
2
2
-4
1
1
0
2
-2
-3
1
-2
-3
0
1
0
-4
-2
-2
4
3
D
0
-1
2
4
-5
2
3
1
1
-2
-4
0
-3
-6
-1
0
0
-7
-4
-2
5
4
C
C
-2
-4
-4
-5
12
-5
-5
-3
-3
-2
-6
-5
-5
-4
-3
0
-2
-8
0
-2
-3
-4
Q
0
1
1
2
-5
4
2
-1
3
-2
-2
1
-1
-5
0
-1
-1
-5
-4
-2
3
5
-8
E
0
-1
1
3
-5
2
4
0
1
-2
-3
0
-2
-5
-1
0
0
-7
-4
-2
4
5
G
1
-3
0
1
-3
-1
0
5
-2
-3
-4
-2
-3
-5
0
1
0
-7
-5
-1
2
1
H
-1
2
2
1
-3
3
1
-2
6
-2
-2
0
-2
-2
0
-1
-1
-3
0
-2
3
3
I
-1
-2
-2
-2
-2
-2
-2
-3
-2
5
2
-2
2
1
-2
-1
0
-5
-1
4
-1
-1
L
-2
-3
-3
-4
-6
-2
-3
-4
-2
2
6
-3
4
2
-3
-3
-2
-2
-1
2
-2
-1
K
-1
3
1
0
-5
1
0
-2
0
-2
-3
5
0
-5
-1
0
0
-3
-4
-2
2
2
M
-1
0
-2
-3
-5
-1
-2
-3
-2
2
4
0
6
0
-2
-2
-1
-4
-2
2
-1
0
F
-3
-4
-3
-6
-4
-5
-5
-5
-2
1
2
-5
0
9
-5
-3
-3
0
7
-1
-3
-4
P
1
0
0
-1
-3
0
-1
0
0
-2
-3
-1
-2
-5
6
1
0
-6
-5
-1
1
1
S
1
0
1
0
0
-1
0
1
-1
-1
-3
0
-2
-3
1
2
1
-2
-3
-1
2
1
T
1
-1
0
0
-2
-1
0
0
-1
0
-2
0
-1
-3
0
1
3
-5
-3
0
2
1
W
W
-6
2
-4
-7
-8
-5
-7
-7
-3
-5
-2
-3
-4
0
-6
-2
-5
17
0
-6
-4
-4
Y
-3
-4
-2
-4
0
-4
-4
-5
0
-1
-1
-4
-2
7
-5
-3
-3
0
10
-2
-2
-3
17
V
0
-2
-2
-2
-2
-2
-2
-1
-2
4
2
-2
2
-1
-1
-1
0
-6
-2
4
0
0
B
2
1
4
5
-3
3
4
2
3
-1
-2
2
-1
-3
1
2
2
-4
-2
0
6
5
Z
1
2
3
4
-4
5
5
1
3
-1
-1
2
0
-4
1
1
1
-4
-3
0
5
6
Each matrix value is calculated from an odds score, which is the probability
that the amino acid pair will be found in alignments of homologous proteins
divided by the probability that the pair will be found in alignments of
unrelated proteins by random chance.
3
Using the matrix…
Alignment
Sequence A
Tyr Cys Asp Ala
Sequence B
Phe Met Glu Gly
PAM 250 matrix value
7
-5
3
1
Total score for alignment of sequence A
with sequence B =?
Using the matrix (example 2)
Sequence 1
PTHPLASKTQILPEDLASEDLTI
Sequence 2
PTHPLAGERAIGLARLAEEDFGM
Scoring
matrix
C
C
S
T
P
A
G
N
9
S -1
4
T
-1
1
5
P -3
-1
-1
7
A
0
1
0
-1
4
G -3
0
-2
-2
0
6
N -3
1
0
-2
-2
0
5
D -3
0
-1
-1
-2
-1
1
.
.
D
6
.
.
T:G
= -2
T:T
= 5
Score = 48
From the mutation
matrix, a Dayhoff
scoring matrix is
constructed.
Then the count matrix is
used to estimate a mutation
matrix at 1 PAM
(evolutionary unit).
First pairs of aligned
amino acids in verified
alignments are used to
build a count matrix
This Dayhoff matrix
along with a model of
indel events is then used
to score new alignments
These alignments can then be used
in an iterative process to
construct new count matrices.
Extrapolating the PAM 1 matrix to obtain
other matrices of the PAM family
• Percent Accepted Mutation. A unit
introduced by Dayhoff et al. to quantify the
amount of evolutionary change in a protein
sequence. 1.0 PAM unit, is the amount of
evolution which will change, on average, 1%
of amino acids in a protein sequence.
PAM family
• PAM matrices are based on global alignments of
closely related proteins.
• The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1%
divergence.
• Other PAM matrices can be extrapolated from
PAM1. Each PAM matrix gives the substitutions
expected for a given period of evolutionary time.
• It is rare that a PAM matrix would be used for an
evolutionary distance any greater than 256
PAMS.
BLOSUM (Block Substitution
Matrices)
• BLOSUM matrices are based on local alignments.
• BLOSUM 62 is a matrix calculated from
comparisons of sequences with no less than 62%
divergence.
• Unlike PAM matrices, all BLOSUM matrices are
based on observed alignments, they are not
extrapolated from comparisons of closely related
sequences.
• BLOSUM 62 is the default matrix in BLAST. Though
BLOSUM 62 is tailored for comparisons of
moderately distant proteins, it performs well in
detecting close relationships. A search for distant
relatives may be more sensitive with a different
matrix.
Blosum 62 substitution matrix
S Henikoff, and JG Henikoff
Amino Acid Substitution Matrices from Protein Blocks
PNAS 89: 10915-10919, 1992.
The relationship between PAM and
BLOSUM matrices
BLOSUM 80
PAM 1
Less divergent
BLOSUM 62
BLOSUM 45
PAM 120
PAM 250
More divergent
Nucleic Acid Scoring Systems
-very simple
actaccagttcatttgatacttctcaaa
Sequence 1
taccattaccgtgttaactgaaaggacttaaagact
Sequence 2
A
G
C
T
A
1
0
0
0
G
0
1
0
0
C
0
0
1
0
T
0
0
0
1
Match: 1
Mismatch: 0
Score = 5
Unitary matrix
Scoring Insertions and Deletions and
Gap Penalties
A T G T A A T G C A
T A T G T G G A A T G A
A T G T - - A A T G C A
T A T G T G G A A T G A
insertion / deletion
The creation of a gap is penalized with a negative score value.
Why Gap Penalties?
Gaps not permitted
Score:
1 GTGATAGACACAGACCGGTGGCATTGTGG 29
|||
| | |||
|
|| || |
1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29
Gaps allowed but not penalized
0
Match = 5
Mismatch = -4
Score: 88
1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29
||| || | | | ||| || | | || || |
1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
Why Gap Penalties?
• The optimal alignment of two similar sequences is usually
that which
• maximizes the number of matches and
• minimizes the number of gaps.
•There is a tradeoff between these two
- adding gaps reduces mismatches
• Permitting the insertion of arbitrarily many gaps can lead to
high scoring alignments of non-homologous sequences.
• Penalizing gaps forces alignments to have relatively few
gaps.
Gap Penalties
•How to balance gaps with mismatches?
•Gaps must get a steep penalty, or else you’ll end up
with nonsense alignments.
•In real sequences, multi-base (or amino acid) gaps
are quit common
•genetic insertion/deletion events
•“Affine” gap penalties give a big penalty for each
new gap, but a much smaller “gap extension” penalty.
Scoring Insertions and Deletions
match = 1
mismatch = 0
Total Score:
Scoring system 1
4
A T G T T A T A C
T A T G T G C G T A T A
Total Score:
8 - 3.2 = 4.8
Gap parameters:
d = 3 (gap opening)
e = 0.1 (gap extension)
g = 3 (gap length)
(g) = -3 - (3 -1) 0.1 = -3.2
A T G T - - - T A T A C
T A T G T G C G T A T A
insertion / deletion
Scoring system 2
Building the PAM mutation data matrix
Studied 1572 mutations in 71
families of closely-related
protein sequences.
Similar
sequences
were
organized
into
a
phylogenetic tree in each
family.
In each of the aligned sequence
families, the number of
changes of each amino acid
to every other amino acid
were counted.
Example
Out of 1572 changes,
there
were
260
changes between F
and Y.
Multiply
1