Protein Similarity (sequence)

Download Report

Transcript Protein Similarity (sequence)

Sequence Similarity
Andrew Torda, wintersemester 2006 / 2007, 00.904 Angewandte …
• What is the easiest information to find about a protein ?
• sequence
• history - amino acid sequencing
• today - initial DNA / mRNA sequencing
• consequence
• lots of sequences
• want to find similar proteins
• not too much overlap with Dr Willhoeft's lectures
• similarity of sequences
21/07/2015 [ 1 ]
Similarity of sequences
• Problem
ACDEACDE..
ADDEAQDE..
• how similar ?
ACDQRSTSRQDCAEACDE..
ADDQRSTSRQDCAEAQDE..
• size counts - longer sequences are more similar - why ?
• probabilistically - more chances to mutate
• a measure of (di)similarity – evolutionary distance
21/07/2015 [ 2 ]
Too Simple Estimate
• Say difference / distance is time t
• Rate of mutation λ
• Few mutations
• A→C but not A→C →A
(OK ?) if P(mutation) = 10-2
• sequence length nres
• number mutations nmut
• nmut = tλnres so
nmut
t
nres
• can we do better ?
21/07/2015 [ 3 ]
Jukes – Cantor distance
Simplification
• work with 4 base types (like DNA)
Rules and nomenclature
• probability of a specific mutation A→C or G→C
• in time Δt is α
• set α = λ/4
• probability of a change from type A at time t is pAt
• probability of seeing type A at time 1 is pA1
• initial probability at time 0 is pA0 = 1
21/07/2015 [ 4 ]
Jukes – Cantor distance
• probability of change in Δt = 3α
• probability of no change pA1= 1 − 3α
• probability of A→ ? →A in Δt
• α(1 − pAt )
Fear not - slower
detailed explanation in
Übung 10 Nov 2006
• what is the probability of seeing type A at a time t+1 ?
• (no change) + ( A→ ? →A )
• pAt+1= pAt (1 − 3α) + α(1 − pAt )
• what change has occurred in time Δt ?
ΔpAt / Δt = pAt+1 − pAt
= pAt (1 − 3α) + α(1 − pAt ) − pAt
= −4α pAt + α
21/07/2015 [ 5 ]
Jukes – Cantor distance
• ΔpAt / Δt = −4α pAt + α
dpAt
 4p At  
• we like continuous forms
dt
• what we want is a measure in terms of time t
• like any differential equation
dt
1

dp At  4p At  


1
t  
dp At
  4p At   
• Übung – derivation of Jukes-Cantor rates…
21/07/2015 [ 6 ]
Jukes – Cantor distance
• from
• we get


1
t  
dp At
  4p At   
1 3 4t
pno _ change   e
4 4
pchange
3 3 4t
  e
4 4
• but this is for one site
• important what fraction of sites has changed ?
• estimate time
t   ln 1  4 pchange
3

 4 nmut 
t   ln 1 

3 nres 


nmut
nres
21/07/2015 [ 7 ]
Simplifications made
• We have only worried about relative distances
• no attempt to speak of years
• What is time ?
• generations
• years
• 4 bases for DNA (easy to change to 20 amino acids)
Comments on
• base composition equal at t=0
• a residue can mutate to any other
• gaps / alignment quality
• uniform mutation rates
• some details on these issues…
21/07/2015 [ 8 ]
Base Composition
Not a problem
• think back to slide on integration - constant c
• solved by assuming pA0=1 but could be any value
Different kinds of mutations
• We assumed
• pXY= α for all XY types
• Wrong:
• DNA: A→G not as bad as A→C or A→T
• proteins: some changes easy (D → E) some hard (D → W)
21/07/2015 [ 9 ]
Different kinds of mutations
• can be fixed with more parameters
• simple case DNA
• rate α for purine →purine, β for purine → pyrimidine
• protein:
• 19 different probabilities (for each amino acid type)
Gaps
• so far ignored
• more generally
• we have assumed proteins / DNA can be aligned
21/07/2015 [ 10 ]
Gaps and Alignments
• gaps ignored
• more generally - assumption that sequences can be aligned
ACDQRSTSRQDCAEACDE..
ADDQRSTSRQDCAEAQDE..
• but a what about
ACDQRATSRQDQRSTSRQ..
ADDQRSTSRQDCAEAQDE..
• or
ACDQRATSRQDQRSTSRQ..
ADDQRSTSRQDCAEAQDE..
• the more distant the sequences, the less reliable the alignment
21/07/2015 [ 11 ]
Uniform mutation rates
• Between organisms
• fruit flies have short generations
• bacteria have very short generations
• within one class of organisms rates vary (DNA repair)
• Neglect of
• duplication, transposition, major re-arrangements
• Different proteins mutate at different rates
• essential – DNA copying
• less essential
• copied proteins (haemoglobins)
• Functional changes
• similar proteins in different organisms – different functions
• Within one protein
• some sites conserved, some mutate fast
• Complete neglect of natural selection
21/07/2015 [ 12 ]
Similarity of sequences so far
• For very related sequences, not many back mutations
• even simple mutation count (nmut/nres) OK
• Better to allow for back mutations
• Jukes-Cantor (and related) models
• can include some statistical properties (base composition)
• can be easily improved to account for other properties
(different types of mutation occur with different frequencies)
• hard to calibrate in real years, but may not matter
• will be less reliable for less related species / proteins
21/07/2015 [ 13 ]
Statistical approach to similarity
• Completely different philosophy
• Are proteins A and B related ?
• how is A related to all proteins (100 000's) ?
• how strong is the AB relation compared to A-everything ?
• What we need
• BLAST / fasta (more in Dr Willhoeft's lectures)
• idea of distributions
• measure of significance
21/07/2015 [ 14 ]
Significance
e-value (expectation value)
• I have a bucket with 10 numbered balls (1 .. 10)
• I pull a ball from the bucket (and replace it afterwards)
• how often will I guess the correct number ?
• e-value = 0.1
• you guess the number and are correct 0.25 of the time
• much more than expected
• what is the probability (p-value) of seeing this by chance ?
• example distribution.. binomial
21/07/2015 [ 15 ]
Binomial example
• we have 100 attempts (n=100)
• probability p=0.1 of success on any attempt
• what is the probability that we are always wrong ?
• P(0)= 0.9 × 0.9 × 0.9 … = 2.7 × 10-5
• probability that we make one correct guess
• P(1) = 0.1 × 0.9 × 0.9 … +
0.9 × 0.1 × 0.9 … +
0.1 × 0.9 × 0.1 … + … = 3.0 × 10-4
• P(25) = 9.0 × 10-6
my original question
• P(10) = 0.13
what you would guess
21/07/2015 [ 16 ]
Binomial example
• probability that we make one correct guess
• P(1) = 0.1 × 0.9 × 0.9 … +
0.9 × 0.1 × 0.9 … +
0.1 × 0.9 × 0.1 … + … = 3.0 × 10-4
• P(25) = 9.0 × 10-6
my original question
• P(10) = 0.13
what you would guess
0.15
• this formula not for exams
formally
x number of success
n number trials
p probability per trial
0.1
P(n )
0.05
n
P( x )    p x (1  p ) n  x
 x
0
0
20
40
60
n success
80
100
21/07/2015 [ 17 ]
Distributions and sequences
• If I align two proteins, sometimes they will be similar (by
chance)
• Take a protein and align to a large database
• there will be a distribution of scores
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
>120
354
6
16
34
91
130
216
351
484
729
821
1049
1156
1272
1237
1220
1227
1094
929
824
655
494
390
276
239
176
124
76
60
44
46
25
15
3
5
5
3
4
0
0
1
0
0
0
0
0
0
0
0
0
0
0:=================
0:=
one = represents 22 library sequences
0:=
0:==
4:*====
22:*=====
85:===*======
229:==========*=====
471:=====================*
779:================================== *
1086:======================================
*
1328:================================================
1465:=====================================================
1492:==========================================================
1428:=========================================================
1303:========================================================
1146:====================================================*===
979:============================================*=====
817:=====================================*=====
671:==============================*=======
544:========================*=====
436:===================*===
347:===============*==
274:============*
216:=========*=
169:=======*
132:=====*
103:====*
80:===*
62:==*
48:==*
37:=*
29:=*
23:=*
18:*
inset = represents 1 library sequences
14:*
10:*
:===
*
8:*
:====
*
6:*
:
*
5:*
:
*
4:*
:= *
3:*
: *
2:*
: *
2:*
: *
1:*
:*
1:*
:*
1:*
:*
1:*
:*
0:
*
0:
*
0:
*
very few are radically
different
*
*
*
*
*
many sequences
match a bit
these ones are
probably related
21/07/2015 [ 18 ]
Distributions and sequences
• Can we put numbers on this ?
• model for the distribution
• "extreme value distribution"
• Probability of score S > x

P( S  x)  1  exp  kMNex

• NM reflect sequence length
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
>120
354
6
16
34
91
130
216
351
484
729
821
1049
1156
1272
1237
1220
1227
1094
929
824
655
494
390
276
239
176
124
76
60
44
46
25
15
3
5
5
3
4
0
0
1
0
0
0
0
0
0
0
0
0
0
0:=================
0:=
one = represents 22 library sequences
0:=
0:==
4:*====
22:*=====
85:===*======
229:==========*=====
471:=====================*
779:================================== *
1086:======================================
*
1328:================================================
1465:=====================================================
1492:==========================================================
1428:=========================================================
1303:========================================================
1146:====================================================*===
979:============================================*=====
817:=====================================*=====
671:==============================*=======
544:========================*=====
436:===================*===
347:===============*==
274:============*
216:=========*=
169:=======*
132:=====*
103:====*
80:===*
62:==*
48:==*
37:=*
29:=*
23:=*
18:*
inset = represents 1 library sequences
14:*
10:*
:===
*
8:*
:====
*
6:*
:
*
5:*
:
*
4:*
:= *
3:*
: *
2:*
: *
2:*
: *
1:*
:*
1:*
:*
1:*
:*
1:*
:*
0:
*
0:
*
0:
*
*
*
*
*
*
• one method
• estimate λ and k for each sequence
• alternative
• use a recipe and precalculate λ and k
21/07/2015 [ 19 ]
Distributions and sequences
• derivation later (much later) .. trust for now
• important
• from a database search
• measure says: did this alignment occur by chance ?
• very important for remote sequences
• before (evolutionary distance)
• we have two sequences which we believe are similar
• what is the evolutionary distance ?
• implies that I know the proteins are related
• now (distribution based)
• I claim sequences are related
• what is the probability that I am correct ?
21/07/2015 [ 20 ]
Statistical versus evolutionary
• Is one better ? More correct ? More appropriate
• For related sequences…
• evolutionary model has a more rigorous basis
• For distant sequences
• statistical method
• What would you use for a phylogeny ? (Frau Willhoeft lectures)
• have you heard of phylogenies yet ?
21/07/2015 [ 21 ]
An example phylogeny
• metabolic enzyme
from a set of
parasites
Kühnl… Liebau, FEBS Journal 272 (2005) 1465–1477
21/07/2015 [ 22 ]
Statistical versus evolutionary
• Is one better ? More correct ? More appropriate
• For related sequences…
• evolutionary model has a more rigorous basis
• For distant sequences
• statistical method
• What would you use for a phylogeny ? (Frau Willhoeft lectures)
• have you heard of phylogenies yet ?
• When would you use a statistical method ?
• function or structure prediction
• I have sequence. If I knock out the gene, organism dies
• it is not obviously similar to anything
21/07/2015 [ 23 ]