tutorial4_scoringMatices

Download Report

Transcript tutorial4_scoringMatices

Tutorial 4
Comparing Protein Sequences
Intro to Bioinformatics
1
Amino acids were not born
equally
2
Comparing Protein Sequences
 Substitution Matrices
 PAM
- Point Accepted Mutations
 BLOSUM - Blocks Substitution Matrix
 Advance comparison tools
 Psi-BLAST
 Phi-BLAST
3
Substitution Matrix
 Scoring matrix S
 20x20 for protein alignment (Amino-acid)
 Si,j represents the gain/penalty due to substituting AAj by AAi (i –
line , j – colomn)
 Based on likelihood this substitution is found in nature
 Computed differently in PAM and BLOSUM
4
Computing probability of Mutation (Mi,j)
 PAM -
Point Accepted Mutations
 Based on closely related proteins (X%
divergence)
 Matrices for comparison of divergent
proteins computed
 BLOSUM -
Blocks Substitution Matrix
 Based on conserved blocks bounded in
similarity (at least X% identical)
 Matrices for divergent proteins are
derived using appropriate X%
5
PAM-1
 Captures mutation rates between close
proteins
 1% divergence
 Mi,j = AB / #A
 Problematic when comparing far proteins
 The 1% divergence does not capture more
sporadic mutations
 PAM250 is theoretical (extrapolation
based)
6
PAM-1
7
BLOSUM62
 Captures mutation rates between
divergent proteins
 Why is BLOSUM62 called BLOSUM62?
Basically, this is because all blocks
whose members shared at least 62%
identity with ANY other member of that
block were averaged and represented as 1
sequence.
8
BLOSUM62
The idea of BLOSUM matrices is to get a better measure of
differences between two proteins specifically for more distantly
related proteins.
Similar AA have high score
9
PAM & BLOSUM
PAM
BLOSUM
Based on local alignments.
Based on global alignments
of closely related proteins.
10
The PAM1 is calculated from
comparisons of sequences
with no more than 1%
divergence.
BLOSUM 62 is calculated from
comparisons of sequences
with at least 62% identity
in the blocks.
Other PAM matrices are
extrapolated from PAM1.
All BLOSUM matrices are
based on observed
alignments.
They are not extrapolated
from comparisons of closely
related proteins.
Use Recommendations
PAM100
PAM120
PAM160
PAM200
PAM250
~
~
~
~
~
BLOSUM90 Closely Related
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45 Highly Divergent
Query length Matrix
<35
PAM30
11
35-50
50-85
>85
Gap costs
9,1
PAM70
10,1
BLOSUM80 10,1
BLOSUM62 11,1
Example
 Query: >ADRM1_HUMAN
(Proteasomal ubiquitin receptor)
 Data Base: nr on Human genome.
 Blast Program: BLASTP
 Matrices: PAM30,BLOSUM45
12
What difference do we observe?
•With BLOSUM45 we found related and divergent sequences.
•With PAM30 we found only related sequences.
PAM 30
13
BLOSUM45
With BLOSUM45 we can discover interesting relations
between proteins
PAM 30
BLOSUM45
.
.
.
14
Mucin-13:a
glycosylated
membrane
protein that
protects the
cell by
binding to
pathogens
Using different scoring matrices can produce slightly
Different alignments:
With PAM 30
With BLOSUM45
15
A same alignment can be solved in many ways, specially when
using a matrix for highly divergent sequences (BLOSUM45):
16
PSI-BLAST
Position Specific Iterative BLAST
We will analyze the following Archeal
uncharacterized protein:
>gi|2501594|sp|Q57997|Y577_METJA PROTEIN
MJ0577
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVI
DEREIKKRDIFSLLLGVAGLNKSVEEFENELKNKLTEEAKNK
MENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIM
GSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS
17
18
Threshold for
initial BLAST
Search
(default:10)
Threshold for
inclusion in
PSI-BLAST
iterations
(default:0.005)
19
The
query
itself
Orthologous
sequences
in two
other
archaeal
species
Other
homologous
sequences
20
21
.
.
.
.
.
.
.
.
.
22
Is MJ0577 a
filament
protein?
Is MJ0577 a
cationic
amino
transporter?
Is MJ0577 a
universal
stress
protein?