Lecture 5. Sequence Analysis_part2

Download Report

Transcript Lecture 5. Sequence Analysis_part2

Sequence Alignment
Csc 487/687 Computing for bioinformatics
Refining the Scoring Scheme
- Scoring Matrix
 To measure the relative probability of any
particular substitution.
 The relative frequencies of such changes to
form a scoring matrix for substitution
 A likely change will score higher than a rare
one.
Scoring matrix for nucleic acid
sequences
 A simple scheme for substitutions:

+1 for a match, -1 for a mismatch.
 A more complicated scheme based on the
higher frequency of transition mutations than
transversion mutations


a
g and t
c
(a or g)
(t or c)
Refining the Scoring Scheme
- Scoring Matrix
 The scheme should return high values for
alignment of homologous proteins
 Should reward higher alignment of amino
acids often seen in corresponding
positions in homologous proteins
Scoring Matrices
 Importance of scoring matrices
 Scoring matrices appear in all analyses involving
sequence comparisons.
 The choice of matrix can strongly influence the
outcome of the analysis.
 Scoring matrices implicitly represent a particular
theory of relationships.
 Understanding theories underlying a given scoring
matrix can aid in making proper choice.
Identity Matrix
A
1
C
0
1
I
0
0
1
L
0
0
0
1
A
C
I
L
Simplest type of scoring matrix
Similarity
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
Scoring matrices
 Gives scores between each pair of amino acids
 Should reflect


The degree of ”biological relatedness”
The ”probability” that two amino acids occurring in different sequences
have common ancestor
 Should be symmetric
 Substitution matrices


The probability that an amino acid a is changed to amino acid b (in a
certain evolutionary time)
Is generally not symmetric
Scoring matrices
 Identity matrix (scoring 0/1)
 Use of the distances in the genetic codes
 Use of the amino acid similarities based on
physico-chemical properties
 Scoring matrices based on experimental
data (PAM – BLOSUM)
DAYHOFF’s PAM-MATRICES



1.
2.
3.
4.
5.
Based on experimental data
t – evolutionary time interval
Sequences from 34 superfamilies were used
Divide the sequences into groups (71) of homologous sequences, and
make a multiple alignment for each of them
Construct evolutionary trees for each group, and estimate the mutations
that have occurred
Define an evolutionary model to explain the evolution
Construct substitution matrices, for each amino acid pairs (a,b) an
estimate of the probability that an amino acid a has mutated to an amino
acid b in time interval t
Construct scoring matrices from the substitution matrices.
Note that a and b are variables that mean any amino acid.
Example
The model of the evolution
 The probability of a mutation in a position is independent on


Position and neighbour residues
Previous mutations in the position
 The biological (evolutionary) clock is assumed (meaning constant rate of
mutations)
 This means that evolutionary time can be measured in number of mutations
(here substitutions)
 The measure is PAM (Point Accepted Mutations)
 1 PAM is one accepted mutation per 100 residues
The Point-Accepted-Mutation (PAM) model of
evolution and the PAM scoring matrix
A 1-PAM unit is equivalent to 1 mutation found in a
stretch of 2 sequences each containing 100 amino acids
that are aligned
Example 1:
..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV..
|||||||||||||| |||||||||||||||||||||||||||||||||||
..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV..
length = 100, 1 Mismatch, PAM distance = 1
A k-PAM unit is equivalent to k 1-PAM units (or Mk).
Substitution matrix M1
Calculate Mz by matrix multiplication, show for z=2

Z=2 mean two mutations per 100 residues
A residue a can be changed to residue b after 2 PAM of following reasons:

1.
2.
3.
a is mutated to b in first PAM, unchanged in the next, with probability
MabMbb
a is unchanged in first PAM, changed in the next, probability M aaMab
a is mutated to an amino acid x in the first PAM, and then to b in the next,
probability MaxMxb, x being any amino acid unequal (a,b)
These three cases are disjunctive, hence
M ab2  M ab M bb  M aa M ab 
M
x{a , b}
ax
M xb 
M
xM
ax
M xb
Final Scoring Matrix is the Log-Odds
Scoring Matrix
Replacement amino acid
S (a,b) = 10 log10(Mab/Pb)
Original amino acid
Frequency of amino acid b
Mutational probability matrix number
M250
PAM-250 scoring matrix
BLOSUM (Henikoff & Henikoff)
 Perform best in identifying distant
relationships
 Making use of the much larger amount of
data that become available since Dayhoff’s
work
 Based on BLOCKS database of aligned
protein sequence
BLOSUM (Henikoff & Henikoff)
 Make multiple alignments and discover blocks not containing gaps (used
over 2,000 blocks)
...KIFIMK.......GDEVK...
...NLFKTR
GDSKK...
KIFKTK
GDPKA
KLFESR
GDAER
KIFKGR
GDAAK
 For each column in each block they counted the number of occurrences of
each pair of amino acids (210 different pairs (20*21/2) )
 A block of length w from an alignment of n sequences has wn(n-1)/2
occurrences of amino acid pairs



Let hab be the number of occurrences of the pair (ab) in all blocks (hab=hba)
T total number of pairs
fab=hab/T
Gap weighting
 CLUSTAL-W
 For aligning DNA sequences


Use of identity matrix for substitution
Gap penalties 10 for gap initiation and 0.1 for
gap extension by one residue
 For aligning protein sequences


BLOSUM62 matrix
Gap penalties 11 for gap initiation and 1 for
gap extension by one residue