CS273_SequenceSimilarity1

Download Report

Transcript CS273_SequenceSimilarity1

Sequence Similarity
Why sequence similarity
 structural similarity
>25% sequence identity  similar structure
 evolutionary relationship
all proteins come from < 2000 (super)families
 related functional role
similar structure  similar function
functional modules are often preserved
Muscle cells and contraction
Actin and myosin during muscle movement
Actin structure
Actin sequence
• Actin is ancient and abundant
 Most abundant protein in cells
 1-2 actin genes in bacteria, yeasts, amoebas
 Humans: 6 actin genes
• -actin in muscles; -actin, -actin in non-muscle cells
• ~4 amino acids different between each version
MUSCLE ACTIN Amino Acid Sequence
1
61
121
181
241
301
361
EEEQTALVCD
GILTLKYPIE
MFETFNVPAM
RDLTDYLMKI
PDGQVITIGN
TTMYPGIADR
DESGPSIVHR
NGSGLVKAGF
HGIITNWDDM
YVAIQAVLSL
LTERGYSFVT
ERFRGPETMF
MQKEITALAP
KCF
AGDDAPRAVF
EKIWHHTFYN
YASGRTTGIV
TAEREIVRDI
QPSFIGMESS
STMKIKIIAP
PSIVRPRHQG
ELRVAPEEHP
LDSGDGVSHN
KEKLCYVALD
GVHETTYNSI
PERKYSVWIG
VMVGMGQKDS
VLLTEAPLNP
VPIYEGYALP
FEQEMATAAS
MKCDIDIRKD
GSILASLSTF
YVGDEAQSKR
KANREKMTQI
HAIMRLDLAG
SSSLEKSYEL
LYANNVLSGG
QQMWITKQEY
A related protein in bacteria
Relation between sequence and structure
A multiple alignment of actins
Gene expression
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Biomolecules as Strings
• Macromolecules are the chemical building blocks
of cells
 Proteins
• 20 amino acids
 Nucleic acids
• 4 nucleotides {A, C, G, ,T}
 Polysaccharides
The information is in the sequence
• Sequence  Structure  Function
• Sequence similarity
 Structural and/or Functional similarity
• Nucleic acids and proteins are related by molecular
evolution
 Orthologs: two proteins in animals X and Y that evolved from one
protein in immediate ancestor animal Z
 Paralogs: two proteins that evolved from one protein through
duplication in some ancestor
 Homologs: orthologs or paralogs that exhibit sequence similarity
Protein Phylogenies
• Proteins evolve by both duplication and species
divergence
duplication
orthologs
paralogs
Evolution
Evolution at the DNA level
Deletion
Mutation
…ACGGTGCAGTTACCA…
…AC - - - - CAGTCACCA…
REARRANGEMENTS
Inversion
Translocation
Duplication
SEQUENCE EDITS
Evolutionary Rates
next generation
OK
OK
OK
Changes in non-functional sites are OK, so will be propagated
X
X
Still OK?
Most changes in functional sites are deleterious and will be rejected
Sequence conservation implies function
Proteins between humans and rodents are
on average 85% identical
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings
x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, M in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gap
in the other sequence
What is a good alignment?
Alignment:
The “best” way to match the letters of one sequence with
those of the other
How do we define “best”?
Alignment:
A hypothesis that the two sequences come from a
common ancestor through sequence edits
Parsimonious explanation:
Find the minimum number of edits that transform one
sequence into the other
Scoring Function
• Sequence edits:
AGGCCTC
 Mutations
AGGACTC
 Insertions
AGGGCCTC
 Deletions
AGG–CTC
Scoring Function:
Match:
+m
Mismatch:
–s
Gap:
–d
Score F = (# matches)  m – (# mismatches)  s – (#gaps)  d
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Too many possible
alignments:
O( 2M+N)
Alignment is additive
Observation:
The score of aligning
x1……xM
y1……yN
is additive
Say that
aligns to
x1…xi
y1…yj
xi+1…xM
yj+1…yN
The two scores add up:
F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
Key property: optimal solution to the entire problem is composed of
optimal solutions to subproblems
– Dynamic Programming
Dynamic Programming
Construct a DP matrix F: MxN:
Suppose we wish to align
x1……xM
y1……yN
Let
F(i, j) = optimal score of aligning
x1……xi
y1……yj
Dynamic Programming (cont’d)
Notice three possible cases:
1.
2.
3.
xi aligns to yj
x1……xi-1 xi
y1……yj-1 yj
F(i, j) = F(i-1, j-1) +
m, if xi = yj
-s, if not
xi aligns to a gap
x1……xi-1 xi
y1……yj -
F(i, j) = F(i-1, j) – d
yj aligns to a gap
x1……xi y1……yj-1 yj
F(i, j) = F(i, j-1) – d
Dynamic Programming (cont’d)
• How do we know which case is correct?
Inductive assumption:
F(i, j – 1), F(i – 1, j), F(i – 1, j – 1)
are optimal
Then,
F(i, j) = max
Where
F(i – 1, j – 1) + s(xi, yj)
F(i – 1, j) – d
F(i, j – 1) – d
s(xi, yj) = m, if xi = yj;
-s, if not
Example
x = AGTA
y = ATA
F(i,j)
m= 1
s = -1
d = -1
i=0
0
j=0
1
2
3
4
A
G
T
A
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4, 3) = 2
AGTA
A - TA
The Needleman-Wunsch Algorithm
1.
2.
Initialization.
a.
b.
c.
F(0, 0)
F(0, j)
F(i, 0)
= 0
=-jd
=-id
Main Iteration. Filling-in partial alignments
a.
For each
For each
i = 1……M
j = 1……N
F(i, j)
Ptr(i,j)
3.
= max
=
F(i-1,j-1) + s(xi, yj)
F(i-1, j) – d
F(i, j-1) – d
DIAG,
LEFT,
UP,
Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment
if [case 1]
if [case 2]
if [case 3]
[case 1]
[case 2]
[case 3]
Performance
• Time:
O(NM)
• Space:
O(NM)
• Possible to reduce space to O(N+M) using
Hirschberg’s divide & conquer algorithm
Substitutions of Amino Acids
Mutation rates between amino acids have dramatic differences!
How can we quantify the differences
in rates by which one amino acid
replaces another across related proteins?
Substitution Matrices
BLOSUM matrices:
1.
2.
3.
Start from BLOCKS database (curated, gap-free alignments)
Cluster sequences according to > X% identity
Calculate Aab: # of aligned a-b in distinct clusters, correcting by
1/mn, where m, n are the two cluster sizes
4. Estimate
P(a) = (b Aab)/(c≤d Acd);
P(a, b) = Aab/(c≤d Acd)
Gaps are not inserted uniformly
A state model for alignment
M
(+1,+1)
Alignments correspond
1-to-1 with sequences
of states M, I, J
I
(+1, 0)
J
(0, +1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC
IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
Let’s score the transitions
s(xi, yj)
M
(+1,+1)
s(xi, yj)
Alignments correspond
1-to-1 with sequences
of states M, I, J
-d
-e
I
(+1, 0)
s(xi, yj)
-d
J
(0, +1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC
IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
-e
A probabilistic model for alignment
Assign probabilities to
every transition (arrow),
and emission (pair of
letters or gaps)
• Probabilities of mutation
reflect amino acid similarities
• Different probabilities for
opening and extending gap
M
(+1,+1)
I
(+1, 0)
J
(0, +1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC
IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
A Pair HMM for alignments
log(1 – 2)
M
log(1 – )
P(xi, yj)
log 
log 
I
P(xi)
log Prob(xi, yj)
log(1 – )
log 
log 
J
P(yj)
Highest scoring path corresponds to the most likely alignment!
How do we find the highest scoring path?
• Compute the following matrices (DP)
 M(i, j):
 I(i, j):
 J(i, j):
most likely alignment of x1…xi with y1…yj ending in state M
most likely alignment of x1…xi with y1…yj ending in state I
most likely alignment of x1…xi with y1…yj ending in state J
log(1 – 2)
M(i, j) = log( Prob(xi, yj) ) +
max{ M(i-1, j-1) + log(1-2),
I(i-1, j) + log(1-),
J(i, j-1) + log(1-) }
log(1 – )
log 
I(i, j) = max{ M(i-1, j) + log ,
I(i-1, j) + log  }
M
P(xi, yj)
log(1 – )
log 
I
P(xi)
log Prob(xi, yj)
log 
J
P(yj)
log 
The Viterbi algorithm for alignment
• For each i = 1, …, M
 For each j = 1, …, N
M(i, j) = log( Prob(xi, yj) ) +
max {
M(i-1, j-1) + log(1-2),
I(i-1, j) + log(1-),
J(i, j-1) + log(1-)
}
I(i, j) = max { M(i-1, j) + log ,
I(i-1, j) + log 
}
J(i, j) = max {
M(i-1, j) + log ,
I(i-1, j) + log 
}
When matrices are filled, we can trace back from (M, N) the likeliest alignment
One way to view the state paths – State M
y1
x1
……
xm
……
yn
State I
y1
x1
……
xm
……
yn
State J
y1
x1
……
xm
……
yn
Putting it all together
States I(i, j) are
connected with states
J and M (i-1, j)
x1
States J(i, j) are
connected with states
I and M (i-1, j)
……
States M(i, j) are
connected with states
J and I (i-1, j-1)
xm
y1
……
yn
Putting it all together
States I(i, j) are
connected with states
J and M (i-1, j)
x1
States J(i, j) are
connected with states
I and M (i-1, j)
……
States M(i, j) are
connected with states
J and I (i-1, j-1)
Optimal solution is
the best scoring path
from top-left to
bottom-right corner
xm
This gives the likeliest
alignment according
to our HMM
y1
……
yn
Yet another way to represent this model
Ix
BEGIN
Iy
Ix
Iy
Mx1
END
Mxm
Sequence X
We are aligning, or threading, sequence Y through sequence X
Every time yj lands in state xi, we get substitution score s(xi, yj)
Every time yj is gapped, or some xi is skipped, we pay gap penalty