LecturesPart07
Download
Report
Transcript LecturesPart07
Computational Biology, Part 7
Similarity Functions and
Sequence Comparison with Dot
Matrices
Robert F. Murphy
Copyright 1996, 1999-2001.
All rights reserved.
Similarity Functions
Used to facilitate comparison of two
sequence elements
logical valued (true or false, 1 or 0)
test
whether first argument matches (or could
match) second argument
numerical valued
test
degree to which first argument matches
second
Logical valued similarity
functions
Let Search(I)=‘A’ and Sequence(J)=‘R’
A Function to Test for Exact Match
MatchExact(Search(I),Sequence(J))
would return
FALSE since A is not R
A Function to Test for Possibility of a Match
using IUB codes for Incompletely Specified
Bases
MatchWild(Search(I),Sequence(J))
since R can be either A or G
would return TRUE
Numerical valued similarity
functions
return value could be probability (for DNA)
Let
Search(I) = 'A' and Sequence(J) = 'R'
SimilarNuc (Search(I),Sequence(J)) could return 0.5
since chances are 1 out of 2 that a purine is adenine
return value could be similarity (for protein)
Let
Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine)
SimilarProt(Seq1(I),Seq2(J)) could return 0.8
since lysine is similar to arginine
usually use integer values for efficiency
Scoring (similarity) matrices
For each pair of characters in alphabet,
value is proportional to degree of similarity
(or other scoring criterion) between them
For proteins, most frequently used is
Mutation Data Matrix from Dayhoff, 1978
(MDM78)
Dayhoff PAM250 similarity
matrix (partial)
A
B
C
D
E
F
G
H
A
2
0
-2
0
0
-4
1
-1
B
0
0
-4
3
2
-5
0
1
C
-2
-4
12
-5
-5
-4
-3
-3
D
0
3
-5
4
3
-6
1
1
E
0
2
-5
3
4
-5
0
1
F
-4
-5
-4
-6
-5
9
-5
-2
G
1
0
-3
1
0
-5
5
-2
H
-1
1
-3
1
1
-2
-2
6
Origin of PAM 250 matrix
Take aligned set of closely related proteins
For each position in the set, find the most common
amino acid observed there
Calculate the frequency with which each other
amino acid is observed at that position
Combine frequencies from all positions to give
table showing frequencies for each amino acid
changing to each other amino acid
Take logarithm and normalize for frequency of
each amino acid
Sequence comparison with dot
matrices
Goal: Graphically display regions of
similarity between two sequences (e.g.,
domains in common between two proteins
of suspected similar function)
Sequence comparison with dot
matrices
Basic Method: For two sequences of
lengths M and N, lay out an M by N grid
(matrix) with one sequence across the top
and one sequence down the left side. For
each position in the grid, compare the
sequence elements at the top (column) and
to the left (row). If and only if they are the
same, place a dot at that position.
Sequence comparison with dot
matrices - References
W.M. Fitch. An improved method of testing
for evolutionary homology. J. Mol. Biol.
16:9-16 (1966)
W.M. Fitch. Locating gaps in amino acid
sequences to optimize the homology
between two proteins. Biochem. Genet.
3:99-108 (1969)
Sequence comparison with dot
matrices - References
A.J. Gibbs & G.A. McIntyre. The diagram,
a method for comparing sequences. Its use
with amino acid and nucleotide sequences.
Eur. J. Biochem. 16:1-11 (1970)
A.D. McLachlan. Test for comparing related
amino acid sequences: cytochrome c and
cytochrome c551. J. Mol. Biol. 61:409-424
(1971)
Sequence comparison with dot
matrices - References
J. Pustell & F.C. Kafatos. A high speed, high
capacity homology matrix: zooming
through SV40 and polyoma. Nucleic Acids
Res. 10:4765-4782 (1982)
J. Pustell & F.C. Kafatos. A convenient and
adaptable package of computer programs
for DNA and protein sequence management,
analysis and homology determination.
Nucleic Acids Res. 12:643-655 (1984)
Examples for protein sequences
(Demonstration A5, Sequence 1 vs. 2)
(Demonstration A5, Sequence 2 vs. 3)
Interpretation of dot matrices
Regions of similarity appear as diagonal
runs of dots
Reverse diagonals (perpendicular to
diagonal) indicate inversions
Reverse diagonals crossing diagonals (Xs)
indicate palindromes
(Demonstration A5,
Sequence 4 vs. 4)
Interpretation of dot matrices
Can link or "join" separate diagonals to
form alignment with "gaps"
Each
a.a. or base can only be used once
Can't
trace vertically or horizontally
Can't double back
A gap
is introduced by each vertical or
horizontal skip
Uses for dot matrices
Can use dot matrices to align two proteins
or two nucleic acid sequences
Can use to find amino acid repeats within a
protein by comparing a protein sequence to
itself
Repeats
appear as a set of diagonal runs stacked
vertically and/or horizontally
(Demonstration A5,
Sequence 5 vs. 6)
Uses for dot matrices
Can use to find self base-pairing of an RNA
(e.g., tRNA) by comparing a sequence to
itself complemented and reversed
Excellent approach for finding sequence
transpositions
Filtering to remove “noise”
A problem with dot matrices for long
sequences is that they can be very noisy due
to lots of insignificant matches (i.e., one A)
Solution use a window and a threshold
compare
character by character within a
window (have to choose window size)
require certain fraction of matches within
window in order to display it with a “dot”
Example spreadsheet with
window
(Demonstration A6)
How do we choose a window
size?
Window size changes with goal of analysis
size
of average exon
size of average protein structural element
size of gene promoter
size of enzyme active site
How do we choose a threshold
value?
Threshold based on statistics
using
shuffled actual sequence
find
average (m) and s.d. () of match scores of
shuffled sequence
convert original (unshuffled) scores (x) to Z scores
• Z = (x - m)/
use
using
threshold Z of of 3 to 6
analysis of other sets of sequences
provides
“objective” standard of significance
Displaying matrices by Pustell
method with MacVector
Goal: Determine differences in
arrangements of elements of pBluescript
family of vectors
Starting point: Use sequences of three of the
members of the family: open the first three
files in the Common Vectors: Bluescript
folder.
Dot matrices with MacVector
From Analyze menu select Pustell DNA matrix. Dialog appears.
Dot matrices with MacVector
Select SYNBL2KSM and SYNBL2SKM. Use defaults for all else.
Dot matrices with MacVector
23 reagons of homology (“diagonals”) obtained. Request “Matrix
map” only (don’t need “Aligned sequences”)
Dot matrices with MacVector
Note inversion near nucleotide 700 (the direction of the polylinker is
reversed between the two vectors)
Dot matrices with MacVector
To examine effect of threshold, decrease “min. % score” from 65 to 55
Dot matrices with MacVector
Now we get many (223) diagonals.
Dot matrices with MacVector
Note presence of many short regions of at least 55% homology.
Dot matrices with MacVector
Now increase threshold to 90%.
Dot matrices with MacVector
Now just 3 diagonals are found.
Dot matrices with MacVector
Note absence of short homologous regions (“noise”).
Dot matrices with MacVector
Now compare SYNBL2KSP to SYNBL2SKM.
Dot matrices with MacVector
22 diagonals found using default settings.
Dot matrices with MacVector
Note second large inversion at one end of sequences.
More dot matrices with
MacVector - DNA homology
Goal: Duplicate Figure 6 of Chapter 3 of
Sequence Analysis Primer
Get Accession numbers J02289 (Polyoma)
and J02400 (SV40) from Entrez
Do Pustell DNA Matrix analysis using
parameters similar to those used in text
(window size = 41, %identity = 51)
More dot matrices with
MacVector - DNA homology
More dot matrices with
MacVector - DNA homology
More dot matrices with
MacVector - DNA homology
More dot matrices with
MacVector - protein homology
Goal: Reproduce Figure 15 from Chapter 3
of Sequence Analysis Primer
Get Accession numbers P17678 (Chicken)
and X17254 (human) erythroid transcription
factors using Entrez
Do Pustell Protein Matrix Analysis
Reading for next class
B & O, Chapter 7 just pp. 145-155
Additional optional reading: Sequence
Analysis Primer, pp. 124-134 “Dynamic
Programming Methods” (on web site as
Reading 1)
(03-510) Durbin et al, Sections 2.1 - 2.4
Everybody: Look over paper by Needleman
and Wunsch on web site (Reading 2)
Summary, Part 7
Similarity functions or similarity matrices
describe (quantitatively) the degree of
similarity between two sequence elements
(bases or amino acids)
The Dayhoff MDM78 matrix is a similarity
matrix commonly used to estimate the
degree to which a change from one amino
acid to another can be “tolerated” in a
protein
Summary, Part 7
Dot matrices graphically present regions of
identity or similarity between two sequences
The use of windows and thresholds can
reduce “noise” in dot matrices
Inversions, duplications and palindromes
have unique “signatures” in dot matrices