Transcript Similarity

Last lecture summary
Flavors of sequence alignment
pair-wise alignment × multiple sequence alignment
Flavors of sequence alignment
global alignment × local alignment
global
local
align entire sequence
stretches of sequence with
the highest density of
matches are aligned,
generating islands of
matches or subalignments in
the aligned sequences
Evolution of sequences
• The sequences are the products of molecular evolution.
• When sequences share a common ancestor, they tend to
exhibit similarity in their sequences, structures and
biological functions.
DNA1
DNA2
Protein1
Protein2
Sequence
similarity
Similar 3D structure
Similar function
Similar sequences produce similar proteins
However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260
Homology
• homology, orthology, paralogy
• orthologs – from different spcies, posses same function
• paralogs – different function in the same organism
• How it happens?
• orthology – speciation
• paralogy – gene duplication
• gene duplication – unequal cross-over, chromosome replication,
retrotrasposition
• The degree of sequence conservation in the alignment
reveals evolutionary relatedness of different sequences
• The variation between sequences reflects the changes
that have occurred during evolution in the form of
substitutions and/or indels.
Scoring systems
• DNA and protein sequences can be aligned so that the
number of identically matching pairs is maximized.
A T T G - - - T
A – - G A C A T
• Counting the number of matches gives us a score (3 in
this case). Higher score means better alignment.
• This procedure can be formalized using substitution
matrix.
A
Identity
matrix
T
C
A
1
T
0
1
C
0
0
1
G
0
0
0
G
1
Scoring DNA sequence alignment
• Match score:
• Mismatch score:
• Gap penalty:
+1
+0
–1
•
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
• Matches: 18 × (+1)
• Mismatches: 2 × 0
• Gaps: 7 × (– 1)
Score = +11
Scoring DNA sequence alignment (2)
• Match/mismatch score:
• Origination/length penalty:
+1/+0
–2/–1
•
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
• Matches: 18 × (+1)
• Mismatches: 2 × 0
• Origination: 2 × (–2)
• Length: 7 × (–1)
Score = +7
Substitution matrices
• Should reflect:
• Physicochemical properties of amino acids.
• Different frequencies of individual amino acids occuring in proteins.
• Interchangeability of the genetic code.
• PAM
• Manual alignments of 71 groups of very similar (at least 85%
identity) protein sequences. 1572 substitutions were found.
• These mutations do not significantly alter the protein function.
Hence they are called accepted mutations.
• Two sequences are 1 PAM apart if they have 99% identical
residues.
• PAM1 matrix is the result of computing the probability of one
substitution per 100 amino acids.
• Higher PAM matrices
Zvelebil, Baum, Understanding bioinformatics.
PAM 120
Positive score – frequency of
substitutions is greater than would
have occurred by random chance.
Zero score – frequency is equal to
that expected by chance.
small, polar
Negative score – frequency is less
than would have occurred by random
chance.
small, nonpolar
polar or acidic
basic
large, hydrophobic
aromatic
Selzer, Applied bioinformatics.
How to calculate score?
2
substitution matrix
New stuff
Similarity vs. identity
• Similarity refers to the percentage of aligned residues
that can be more readily substituted for each other.
• have similar physicochemical characteristics and
• the selective pressure results in some mutations being accepted
and others being eliminated
S = [(Ls × 2)/(La + Lb)] × 100
number of aligned residues
with similar characteristics
total lengths of
each sequence
Homology vs. similarity
• Two sequences are homologous when they are
descended from a common ancestor sequence.
• Similarity can be quantified: “two sequences share 40%
similarity”.
• But NOT “two sequences share 40% homology”. Just “two
sequences are homologous”
• Qualitative statement
• And it is a conclusion about a common ancestral relationship drawn
from sequence similarity comparison
Gaps
• How will I score this alignment?
V D S - C Y
V E S L C Y
• The gaps can’t be inserted freely.
• Indels are relatively slow evolutionary processes.
• And alignments with large gaps do not make biological sense.
• Each gap is penalized – gap penalty
• Gap penalty is user adjustable parameter.
• Let’s use gap penalty equaling to -11.
V D S V E S L
4 2 4 -11
C Y
C Y
9 7
S = 4 + 2 + 4 – 11 + 9 + 7=15
Gap penalty
• Affine gap penalty
• different for opening and extending
• constant for extending
• Gap penalty high – fewer gaps will be inserted
• If you’re searching for sequences that are a strict match for your
query sequence, the gap penalty should be set high.
• This will retrieve regions with very closely related sequences.
• Gap penalty low – more and larger gaps will be inserted
• If you are searching for similarity between distantly related
sequences, the gap penalty should be set low.
(A) High gap penalty. Gaps has been inserted only at the beginning and end.
Percentage identity = 10%
(B) Low gap penalty. More gaps. Percentage identity = 18%
Zvelebil, Baum, Understanding bioinformatics.
BLOSUM matrices I
• BLOck SUbstitution Matrix by Henikoff and Henikoff,
1992.
• They used the BLOCKS database containing multiple
alignments of ungapped segments (blocks).
• These alignments correspond to the most highly
conserved regions of proteins.
• Blocks are ungapped sequence motifs. Sequence motif is a
conserved stretch of amino acids confering a specific function to a
protein.
• Any given protein can contain one or more blocks corresponding to
its structural/functional motifs.
Blocks
..
.
BLOSUM matrices II
• Thus the Hanikoffs focused on substitution patterns only
in the most conserved regions of a protein. These regions
are (presumably) least prone to change.
• The substitution patterns of 2000 blocks (block is the
whole alignment, not individual sequences within it)
representing more than 500 groups were examined, and
BLOSUM matrices were generated.
• Sequences sharing no more than 62% identity were used
to calculate BLOSUM62 matrix.
Short and clear explanation of BLOSUM62 derivation: Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol.
2004 22(8):1035-6. PMID: 15286655.
BLOSUM matrices III
• BLOSUM matrices are based on entirely different type of
sequence analysis (local ungapped alignment vs. global
gapped alignment in PAM) and on a much larger data
than PAM.
• All BLOSUM matrices are based on observed alignments.
They are not based on extrapolations like PAM !!!
• BLOSUM numbering system goes in reversing order as
the PAM numbering system.
• The lower the BLOSUM number, the more divergent sequence they
represent.
PAM vs. BLOSUM I
• However, you may ask a question why a particular matrix
•
•
•
•
•
should be used?
Dayhoff et al. (1978) defined terms protein families and
superfamilies.
A protein family is formed by sequences 85% (or greater)
identical to each other.
A protein superfamily is defined as sequences related
from 30% or greater.
Superfamily may clearly contain many families.
These terms are widely used in contemporary literature,
however with different meanings (we’ll come to that later).
Guidance in the choice of scoting matrix: Wheeler D. Selecting the right protein-scoring matrix. Curr Protoc Bioinformatics. 2002;Chapter 3:Unit
3.5. www.nshtvn.org/ebook/molbio/Current%20Protocols/CPB/bi0305.pdf
PAM vs. BLOSUM II – PAM
• At the time of deriving PAM matrices, most known
•
•
•
•
•
proteins were small, globular and hydrophilic. If resercher
believes his protein contain substantial hydrophobic
regions, PAM matrices are not that useful.
Most widely used is PAM250.
It is capable of detecting similarities in the 30% range (i.e.
superfamilies).
Another point of view – PAM250 provides the best lookback in evolutionary time.
PAM250 is most effective if the goal is to know the widest
possible range of proteins similar to the given protein.
It is the best to use when the protein is unknown or may
be a fragment of a larger protein.
PAM vs. BLOSUM III – PAM
• Assume a protein is a known member of the serine
•
•
•
•
protease family.
Using the protein as a query against protein databases
with PAM 250 will detect virtually all serine proteases, but
also considerable amount of irrelevant hots.
In this case, the PAM160 matrix should be used. It detects
similarities in the 50% to 60% range (Altschul, 1991).
And to find only those proteins most similar (70% - 90%)
to the query protein, use PAM40.
Let’s summarize:
• Locate all potential similarities – PAM250
• Determine if the protein belongs to protein family – PAM160
• Determine the most similar proteins – PAM40
PAM vs. BLOSUM IIII – BLOSUM
• Most widely used is BLOSUM62.
• BLOSUM62 appears to be superior to PAM250 in
detecting distant relationships even if the PAM method is
updated with current data sets.
• BLOSUM62 is capable of accurately detecting similarities
down to the 30% range (superfamilies).
• Determine if the protein belongs to protein family –
BLOSUM80 (detects identities at the 50% level)
• Determine the most similar proteins – BLOSUM90
Selecting an Appropriate Matrix
Matrix
Best use
Similarity (%)
Pam40
Short highly similar alignments
70-90
PAM160
Detecting members of a protein family
50-60
PAM250
Longer alingments of more divergent sequences
~30
BLOSUM90
Short highly similar alignments
70-90
BLOSUM80
Detecting members of a protein family
50-60
BLOSUM62
Most effective in finding all potential similarities
30-40
BLOSUM30
Longer alingments of more divergent sequences
<30
Similarity column gives range of similarities that the matrix is able to best detect.
PAM vs. BLOSUM IIIII – battle
• Careful information theory analysis showed that the
following matrices are equivalent:
• PAM250 is equivalent to BLOSUM45
• PAM160 is equivalent to BLOSUM62
• PAM120 is equivalent to BLOSUM80
• Relative to the PAM160 matrix, BLOSUM62 is less
tolerant to substitutions involving hydrophilic amino acids,
while BLOSUM62 is more tolerant to substitutions
involving hydrophobic amino acids.
• Although both PAM250 and BLOSUM62 detect similarities
at the 30% level, since BLOSUM uses much wider range
of proteins, PAM250 is actually equivalent to BLOSUM45
when considering all proteins, not just those that are
hydrophilic.
Scoring DNA Alignment
• The concept of similarity has little relevance here.
• Though transitions (R → R or Y → Y) occur more often
than transversions (R → Y or Y → R), this is usually not
helpful for sequence alignment.
• Instead, concept of identity is used.
• Frequencies of mutations are equal for all bases:
• match score +5
• mismatch score -4
• gap penalty (usually a parameter)
• opening -10
• extending -2
Pairwise alignment algorithms
• Dynamic programming
• Slow, but formally optimizing
• Heuristic methods
• Efficient, but not as thorough
• Word (also k-tuples) methods
• Used in database searches
• Dot plot (dot matrix)
• Graphical way of comparing two sequences
Dynamic programming (DP)
• General class of algorithms typically applied to
optimization problems.
• Recursive approach.
• Original problem is broken into smaller subproblems
and then solved.
• Pieces of larger problem have a sequential
dependency.
• 4th piece can be solved using solution of the 3rd
piece, the 3rd piece can be solved by using solution of
the 2nd piece and so on…
New best alignment = previous best + local best
Best previous alignment
Sequence A
...
...
...
...
Sequence B
If you already have the optimal solution to:
X…Y
A…B
then you know the next pair of characters will either be:
X…YZ
A…BC
or
X…YA…BC
or
X…YZ
A…B-
You can extend the match by determining which of these
has the highest score.
DP algorithms
• Global alignment - Needlman-Wunsch
• Local alignment - Smith-Waterman
• Guaranteed to provide the optimal alignment.
• Disadvantages:
• Slow due to the very large number of computational steps: O(n 2).
• Computer memory requirement also increases with the square of
the sequence lengths.
• Therefore, it is difficult to use the method for very long
sequences.
• Many alignments may give the same optimal score. And none of
these correspond to the biologically correct alignment.