tutorial4_scoringMatices
Download
Report
Transcript tutorial4_scoringMatices
Tutorial 4
Comparing Protein Sequences
Intro to Bioinformatics
1
Amino acids were not born
equally
2
Comparing Protein Sequences
Substitution Matrices
PAM
- Point Accepted Mutations
BLOSUM - Blocks Substitution Matrix
Advance comparison tools
Psi-BLAST
Phi-BLAST
3
Substitution Matrix
Scoring matrix S
20x20 for protein alignment (Amino-acid)
Si,j represents the gain/penalty due to substituting AAj by AAi (i –
line , j – colomn)
Based on likelihood this substitution is found in nature
Computed differently in PAM and BLOSUM
4
Computing probability of Mutation (Mi,j)
PAM -
Point Accepted Mutations
Based on closely related proteins (X%
divergence)
Matrices for comparison of divergent
proteins computed
BLOSUM -
Blocks Substitution Matrix
Based on conserved blocks bounded in
similarity (at least X% identical)
Matrices for divergent proteins are
derived using appropriate X%
5
PAM-1
Captures mutation rates between close
proteins
1% divergence
Mi,j = AB / #A
Problematic when comparing far proteins
The 1% divergence does not capture more
sporadic mutations
PAM250 is theoretical (extrapolation
based)
6
PAM-1
7
BLOSUM62
Captures mutation rates between
divergent proteins
Why is BLOSUM62 called BLOSUM62?
Basically, this is because all blocks
whose members shared at least 62%
identity with ANY other member of that
block were averaged and represented as 1
sequence.
8
BLOSUM62
The idea of BLOSUM matrices is to get a better measure of
differences between two proteins specifically for more distantly
related proteins.
Similar AA have high score
9
PAM & BLOSUM
PAM
BLOSUM
Based on local alignments.
Based on global alignments
of closely related proteins.
10
The PAM1 is calculated from
comparisons of sequences
with no more than 1%
divergence.
BLOSUM 62 is calculated from
comparisons of sequences
with at least 62% identity
in the blocks.
Other PAM matrices are
extrapolated from PAM1.
All BLOSUM matrices are
based on observed
alignments.
They are not extrapolated
from comparisons of closely
related proteins.
Use Recommendations
PAM100
PAM120
PAM160
PAM200
PAM250
~
~
~
~
~
BLOSUM90 Closely Related
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45 Highly Divergent
Query length Matrix
<35
PAM30
11
35-50
50-85
>85
Gap costs
9,1
PAM70
10,1
BLOSUM80 10,1
BLOSUM62 11,1
Example
Query: >ADRM1_HUMAN
(Proteasomal ubiquitin receptor)
Data Base: nr on Human genome.
Blast Program: BLASTP
Matrices: PAM30,BLOSUM45
12
What difference do we observe?
•With BLOSUM45 we found related and divergent sequences.
•With PAM30 we found only related sequences.
PAM 30
13
BLOSUM45
With BLOSUM45 we can discover interesting relations
between proteins
PAM 30
BLOSUM45
.
.
.
14
Mucin-13:a
glycosylated
membrane
protein that
protects the
cell by
binding to
pathogens
Using different scoring matrices can produce slightly
Different alignments:
With PAM 30
With BLOSUM45
15
A same alignment can be solved in many ways, specially when
using a matrix for highly divergent sequences (BLOSUM45):
16
PSI-BLAST
Position Specific Iterative BLAST
We will analyze the following Archeal
uncharacterized protein:
>gi|2501594|sp|Q57997|Y577_METJA PROTEIN
MJ0577
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVI
DEREIKKRDIFSLLLGVAGLNKSVEEFENELKNKLTEEAKNK
MENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIM
GSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS
17
18
Threshold for
initial BLAST
Search
(default:10)
Threshold for
inclusion in
PSI-BLAST
iterations
(default:0.005)
19
The
query
itself
Orthologous
sequences
in two
other
archaeal
species
Other
homologous
sequences
20
21
.
.
.
.
.
.
.
.
.
22
Is MJ0577 a
filament
protein?
Is MJ0577 a
cationic
amino
transporter?
Is MJ0577 a
universal
stress
protein?