Transcript Slides

Sequence Similarity
xi
yj
MATCH
INSERT
X
xi
―
PROBCONS: Probabilistic
Consistency-based Multiple
Alignment of Proteins
INSERT
Y
―
yj
A pair-HMM model of pairwise
alignment
xi
yj
x ABRACA-DABRA
y AB-ACARDI---
MATCH
INSERT
X
xi
―
INSERT
Y
―
yj
 Parameterizes a probability distribution, P(A), over all possible
alignments of all possible pairs of sequences
 Transition probabilities ~ gap penalties
 Emission probabilities ~ substitution matrix (from BLOSUM)
Computing Pairwise Alignments
P(α)
αviterbi
P(α | x, y)
• The Viterbi algorithm
 conditional distribution P(α | x, y) reflects model’s uncertainty over the
“correct” alignment of x and y
 identifies highest probability alignment, αviterbi, in O(L2) time
Caveat: the most likely alignment is not the most accurate
 Alternative: find the alignment of maximum expected accuracy
The Lazy-Teacher Analogy
4. F
4. F
4. T
4. F
4. F
B
A-
A
A-
A
4. F
4. F
4. F
4. F
4. T
B-
B+
B+
B-
C
• 10 students take a 10-question true-false quiz
• How do you make the answer key?
 Approach #1: Use the answer sheet of the best student!
 Approach #2: Weighted majority vote!
Viterbi vs. Maximum Expected Accuracy
(MEA)
Viterbi
4. T
A
Maximum Expected Accuracy
4. F
4. F
4. T
4. F
4. F
B
A-
A
A-
A
4. F
4. F
4. F
4. F
4. T
B-
C
B-
B+
B+
• picks single alignment with
highest chance of being
completely correct
• picks alignment with highest
expected number of correct
predictions
• mathematically, finds the
alignment α that maximizes
Eα*[1{α = α*}]
• mathematically, finds the
alignment α that maximizes
Eα*[accuracy(α, α*)]
Computing MEA alignments
• Define
accuracy (α, α*) =
# of correct predicted matches
length of shorter sequence
Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)
= ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)
= ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)
= ∑(xi, yj) in α P(xi, yj in α’ | x, y)
• Define M[i, j] = posterior probability that xi is aligned to yj
Computing MEA alignments
• Define
# of correct predicted matches
accuracy (α, α*) = length of shorter sequence
• Then, MEA alignment is highest summing path through
the matrix
M[i, j] = P(xi is aligned to yj | x, y)
• M[i, j] = posterior probability that xi is aligned to yj
 Can compute with forward, backward dynamic programming in
O(L2) time
Computing MEA alignments
• Define
accuracy (α, α*) =
# of correct predicted matches
length of shorter sequence
• Then, MEA alignment is highest summing path through
the matrix
M[i, j] = P(xi is aligned to yj | x, y)
• M[I, j] = posterior probability that xi is aligned to yj
 Can compute with forward, backward dynamic programming in
O(L2) time
The consistency signal
zk
z
xi
x
y
yj
yj’
To estimate P(xi  yj | x, y, z)
Method 1:
triplet-HMM
2
XY
1  1   2
1
2
1
2
XZ
1
XYZ
Y
1  1   2
2
1
2
2
1 2  1
1  1   2
1
1
2
1 2  1
2
1
1 2  1
Z
Running time: O(N3L3)
N: # sequences
L: sequence lengths
1
X
P(xi ~ yj | x, y, z)
= ∑k P(xi~yj~zk | x, y, z)
Parameters trained with
unsupervised EM
2
1
1
2
1
YZ
2
2
1
Probabilistic consistency
• Compute P(xi is aligned to yj | x, y)
P(xi is aligned to yj | x, y, z)
• 2 approaches:
 1) Exact – triplet HMM, O(L3) time
 2) Approximate – use independence assumptions
∑k P(xi ~ zk and zk ~ yj | x, y, z) =
∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk)  (assume indep.)
∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)
Probabilistic consistency
• Compute P(xi is aligned to yj | x, y, z)
To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)
Notice that for any given i, most entries k and j will be close to 0
-- sparse matrices
Pxy|z  PxzPzy
Finally, let
Pxy|S  1/|S| ∑z in S PxzPzy
Multiple sequence alignment
• A straightforward generalization
 sum-of-pairs
 tree-based progressive alignment
 iterative refinement
ABRACA-DABRA
AB-ACARDI--ABRA---DABI-
ABRACA-DABRA
AB-ACARDI---
ABRACADABRA
ABRA--DABI-
AB-ACARDI--ABRA---DABI-
Multiple sequence alignment
• A straightforward generalization
 sum-of-pairs
 tree-based progressive alignment
 iterative refinement
ABRACA-DABRA
AB-ACARDI--ABRACA-DABRA
ABRA---DABIABRACADABRA
AB-ACARDI--ABRA--DABIABRA---DABI-
ABRACA-DABRA
ABRACA-DABRA
AB-ACARDI--AB-ACARDI--ABRA---DABIABRACA-DABRA
ABRACADABRA
AB-ACARDI--ABRA--DABIABACARDI
ABRACADABRA
ABACARDI
ABRACA-DABRA
AB-ACARD--IABRA---DABIAB-ACARDI--ABRA---DABIABRADABI
Summary of PROBCONS Algorithm
Given K sequences to be aligned,
(1)
Compute M[i, j] for all pairs of sequences, x and y
(2)
Use probabilistic consistency to reestimate M[i, j]
(3)
Build a tree of the sequences by connecting closest first
• “Closest” defined according to expected accuracy
• EA(x, y) = E(accuracy) of MEA alignment of x and y
(4)
Perform progressive alignment along the tree
• Score of a column: sum-of-pairs M[i, j]
(5)
Apply iterative refinement
Training/testing methodology
BAliBASE
PREFAB
SABmark
•
3 reference benchmark sets
•
PROBCONS parameters trained via unsupervised EM on unaligned
sequences from BAliBASE.
•
Quality score:
# of correct predicted matches
Q(α, α*) =
total # of true matches
Evaluation of Algorithm Components
Quality
(74)
Time
(sec)
Viterbi
0.375
0.72
MEA
0.403
1.6
PC (O(L3))
0.431
584.2
PC x 1 (O(L2))
0.422
1.7
PC x 2 (O(L2))
0.427
1.9
Progressive PC x 2 (O(L2))
0.432
1.9
Progressive PC x 2 (O(L2)) + IR
0.435
3.3
Algorithm
all-pairs
pairwise
multiple
Performance of different alignment tools
Algorithm
BAliBASE
(237)
PREFAB
(1932)
SABmark
(698)
Q
t
Q
t
Q
t
Align-m
0.804
19:25
-
-
0.352
56:44
DIALIGN
0.832
2:53
0.572
12:25:00
0.410
8:28
CLUSTALW
0.861
1:07
0.589
2:57:00
0.439
2:16
MAFFT
0.882
1:18
0.648
2:36:00
0.442
7:33
T-Coffee
0.883
21:31
0.636
144:51:00
0.456
59:10
MUSCLE
0.896
1:05
0.648
3:11:00
0.464
20:42
PROBCONS
0.910
5:32
0.668
19:41:00
0.505
17:20
Resources for alignment
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/
CLUSTALW
– most widely used
(1994)
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
MUSCLE
– most scalable
(2004)
http://probcons.stanford.edu/
PROBCONS
– most accurate
(2004)
Some more protein multiple aligners:
MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT,
DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX,
PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP,
etc…….
ProbCons: Chuong (Tom) Do
Profile hidden Markov models for
sequence famillies
PFAM
Protein FAMilies database of alignments
• Profile HMMs describe each family
• For each family in Pfam you can:





Look at multiple alignments
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures
PFAM
Pfam-A – curated multiple alignments
 Grows slowly; quality controlled by experts
Pfam-B – automatic clustering (ProDom derived)
 New sequences instantly incorporated; unchecked
• Search by: Sequence, keyword, domain, taxonomy
• Browsing by family or genome
• Evolutionary tree
• Source of seed alignments:
 Pfam-B families
 Published articles
 ‘Domain hunting' studies
Profile HMMs
D1
BEGIN
I0
D2
I1
M1
Dm-1
Dm
Im-1
M2
Im
END
Mm
Protein Family F
• Each M state has a position-specific pre-computed substitution table
• Each I state has position-specific gap penalties (and in principle can
have its own emission distributions)
• Each D state also has position-specific gap penalties
 In principle, D-D transitions can also be customized per position
Profile HMMs
D1
BEGIN
I0
D2
Dm-1
I1
M1
Dm
Im-1
M2
Im
END
Mm
Protein Family F
transition between match states –
transitions between match and insert states –
transition within insert state –
transition between match and delete states –
αD(i)M(i+1)
 transition within delete state –
 emission of amino acid b at a state S –




αM(i)M(i+1)
αM(i)I(i), αI(i)M(i+1)
αI(i)I(i)
αM(i)D(i+1),
αD(i)D(i+1)
εS(b)
Profile HMMs
D1
BEGIN
I0
Dm-1
D2
I1
M1
Dm
Im-1
M2
Im
END
Mm
Protein Family F
 transition probabilities ~ frequency of a transition in alignment
 emission probabilities ~ frequency of an emission in alignment
 pseudocounts are usually introduced
a kl 
Akl

l'
Akl '
ek (a ) 
E k (a )
E
a'
'
'
(
a
)
k
Alignment of a protein to a profile HMM
To align sequence x1…xn to a profile HMM:
We will find the most likely alignment with the Viterbi DP algorithm
• Define
 VjM(i):
score of best alignment of x1…xi to the HMM ending in xi being
emitted from Mj
 VjI(i):
score of best alignment of x1…xi to the HMM ending in xi being
emitted from Ij
 VjD(i):
score of best alignment of x1…xi to the HMM ending in Dj (xi is
the last character emitted before Dj)
• Denote by qa the frequency of amino acid a in a ‘random’ protein
Alignment of a protein to a profile HMM
• VjM(i) = log (εM(j)(xi) / qxi) +
• VjI(i) = log (εI(j)(xi) / qxi) +
• VjD(i) =
max
Vj-1M(i – 1) + log αM(j-1)M(j)
Vj-1I(i – 1) + log αI(j-1)M(j)
Vj-1D(i – 1) + log αD(j-1)M(j)
max
VjM(i – 1) + log αM(j)I(j)
VjI(i – 1) + log αI(j)I(j)
VjD(i – 1) + log αD(j)I(j)
max
Vj-1M(i) + log αM(j-1)D(j)
Vj-1I(i) + log αI(j-1)D(j)
Vj-1D(i) + log αD(j-1)D(j)
Weight of each sequence
i
h
gf
e
d
c
2
b
3
1
•
a
One simple weighting scheme is to find how much edge length each leaf
contributes
 Example: edge 1 belongs to a
 Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a
Δwi = ecurrent wi / (leaves k below e
current
wk)
How to build a profile HMM
Resources on the web
• HMMer – a free profile HMM software
 http://hmmer.wustl.edu/
• SAM – another free profile HMM software
 http://www.cse.ucsc.edu/research/compbio/sam.html
• PFAM – database of alignments and HMMs for protein families and
domains
 http://www.sanger.ac.uk/Software/Pfam/
• SCOP – a structural classification of proteins
 http://scop.berkeley.edu/data/scop.b.html