Transcript gsridhara_1

Protein homology detection
by HMM–HMM comparison
Johannes Söding
A topic in Sequence analysis
Presented by:
Giriprasad Sridhara
[email protected]
CISC 841 Spring 2006
APR 20 2006
Organization of presentation
•
•
•
•
Introduction
Theory
Results
Conclusion
Introduction
• Paper Details:
– Bioinformatics journal
– Vol. 21 no. 7 2005, pages 951–960
• Author Details
– Dr. Johannes Söding
• Department of Protein Evolution,
MaxPlanckInstitute for Developmental Biology,
Spemannstrasse 35, D72076
Tübingen, Germany
Introduction
• Tool Details:
– A tool HHPred has been developed.
– Described in Nucleic Acid Research, 2005,
Vol 33
– A web server is available at
http://www.protevo.eb.tuebingen.mpg.de/toolk
it/index.php?view=hhpred
Introduction
• Central theme in bioinformatics:
– Homology and sequence alignment
• Issues:
– Problem of finding a close homolog with known
function or structure which would allow to make
inferences about the protein under observation.
– New and highly sensitive methods could detect and
align remotely homologous sequences that provide
information about the protein’s function, structure or
evolution.
Introduction
• Methods (Tools) of homology detection:
(In increasing order of sensitivity)
– Sequence - Sequence
• BLAST
• FASTA
– Profile - Sequence
• PSIBLAST
– More sensitive since it uses a sequence profile
– Profile – Profile
• COMPASS
• PROF_SIM
– Profile - HMM
• HMMER
– HMM-HMM
• HHPred
Introduction
• Sequence profiles
– Built from a multiple alignment of homologous sequences
– Contains more information about the sequence family than a
single sequence.
– Helps to distinguish between
• conserved and non-conserved positions
• Conserved are important for defining members of the family
• Non-conserved are variable among the members of the family.
– Describe exactly
• what variation in amino acids is possible at each position
• Done by recording the probability for the occurrence of each amino
acid along the multiple alignment.
Introduction
• Profile Hidden Markov Models (Profile HMMs)
– Similar to simple sequence profiles
– have amino acid frequencies as in the columns of a MSA
– Also have position specific probabilities for inserts and deletions
along the alignment
– logarithms of these probabilities =position specific gap penalties
– Perform better than sequence profiles in the detection of
homologs and in the quality of alignments
– Why higher sensitivity?
• Position specific gap penalties penalize chance hits much more
than true positives
– which tend to have insertions or deletions at the same positions as the
sequences from which the HMM was built.
Introduction
• Pictorial representation of profile HMM
With M, I and D states.
Theory
• Align 2 HMM by maximizing a score
– Score is log-sum-of-odds score.
• What does a path through the 2 HMMs
Represent?
• A sequence co-emitted by both HMMs
• How do we find this path?
– Use dynamic programing (Viterbi)
– Find path that maximizes log-sum-of-odds
score
Theory
• Advantages of HMM-HMM
– Improves both sensitivity and alignment quality
• Calibrate the score for additional sensitivity
– Use scoring correlation function
– Use secondary structure information
• Even sequences that are distantly homologous will have
similar secondary structures.
• This can help distinguish real homologs from chance hits
• Biologically, secondary structures diverge more slowly than
sequences
• This knowledge is utilized.
Theory
Theory
• Additionally to enhance homology detection
– Score secondary structure
– Use other available additional information (like
confidence – term covered later on in the slides)
• Tool HHPred
– Homology detection & structure prediction
– Novelty
• HMM-HMM comparison
• Scores secondary structure
• Reliability measured by
– Probability of each match being a true positive
– Used since e-values reported by most tools can be inaccurate
Theory (Log-sum-of-odds score)
• Defined as
SLSO  log
 (P (x1, ..., xL | emission on path)  P(x1,..., xL | NULL))
x1,.. xL
• Numerator
– probability that x1,…xL is co-emitted by both HMMs
along the alignment path
• Denominator
– Standard null model probability
• Summation
– Runs over all sequences of L residues that can be emitted
along the alignment path by both HMMs
Theory
• How do we apply Viterbi algorithm?
• Denote
– 2 HMMs p and q
– Probability of emitting amino acid a in match state i or j is qi(a)
and pj(a)
– Trans prob = qi(X, X’) and pj(Y,Y’)
– X or Y can belong to {M, I or D}
– f(a) = fixed background frequency
– Let Xk and Yk be states in q and p in the k’th column of the
alignment of q with p.
– i(k) and j(k) be the corresponding columns from q and p.
– qk(l)P (a) and pk(l)P(a) = emission prob from q and p.
Theory
• Ρ tr is the product of all transition probabilities
for the path through p and q
• qk(l)P (a) = qi(k) (a) for Xk = M
• qk(l)P (a) = f (a) for Xk = I
lL
 ( q
SLSO  log
l 1
x1,.. xL
20
20
L
 log  ...  q
x11
xL 1 l 1
L
20
 log  ( q
a 1
l 1

S
aa
P
k (l )
P
k (l )
P
k (l )
( xl )
( xl )
( xl )
p
p
P
k (l )
p
lL
P
k (l )
P
k (l )
( xl )  Ptr ) /( f ( xl ))
l 1
( xl )  Ptr // f ( xl )
( xl ) / f (a ))  log Ptr
(qi ( k ), pj ( k ))  log Ptr
k : XkYk  MM
20
Saa(qi, pj )  ColumnScore  log  qi (a ) pj (a ) / f (a )
a 1
Theory
• Column score properties:
– Positive when 2 distributions are similar
– Negative otherwise
– Insert states have vanishing column score
• Completely non-conserved, pj(a) = f(a)
– 1/f(a)
• Weight factor to co-emission probability
• For a rare amino acid
– f(a) will be low  1/f(a) will be high
–  Weight of rarer amino acids increases in the score
calculation as compared to common amino acids.
Theory
• Pair-wise alignment of HMMs
• Allowed transitions
• Dynamic programing matrices for Viterbi
Theory
• We use 5 DP matrices S xy one for each
pair state XY belonging to {MM, MI, IM,
DG, GD}
• SMM (i, j) = Saa(qi,pj) + max {
SMM(i-1,j-1) + log[q i-1(M,M) p j-1(M,M)],
SMI(i-1,j-1) + log[q i-1(M,M) p j-1(I,M)]
SIM(i-1,j-1) + log[q i-1(I,M) p j-1(M,M)]
SDG(i-1,j-1) + log[q i-1(D,M) p j-1(M,M)]
SGD(i-1,j-1) + log[q i-1(M,M) p j-1(D,M)] }
Theory
• SMI (i, j) = max {
SMM(i-1,j) + log[q i-1(M,M) p j-1(M,I)],
SMI(i-1,j) + log[q i-1(M,M) p j-1(I,I)] }
• SDG (i, j) = max {
SMM(i-1,j) + log[q i-1(M,D)],
SDG(i-1,j) + log[q i-1(D,D) }
• Initialize SMM(I,0) = 0 = SMM(0,j)
• S LSO = max over last row, col of S MM
• Trace back from this cell.
Theory
• Scoring correlations
– Clustering
• In an alignment of 2 homologous HMMs
– Expect high column scores in
» Clusters along the sequence
• In an alignment of non-homologous HMMs
– Do not Expect any clustering.
– The above can help
• Differentiate homologous and non-homologous alignments
– If l’th pair state of optimum path aligns columns i(l) of q and
j(l) of p
• Sl = SAA(qi(l), pj(l)) iff l’th pair state = MM, else 0.
– Auto-correlation function
g (d ) 
L d
SS
l
l 1
l  d
Theory
• Scoring correlations
– Auto-correlation function describes correlation of Sl at a
fixed sequence separation d
– Expect
• if 2 HMMs are homologous
– A Positive g(d) for small d.
– Add a correction factor
Scorr  wcorr  g ( d )
4
d 1
– wcorr is found empirically to be 0.1
– The correction factor is added after the best alignment
is found.
Theory
• Scoring secondary structure
– Allows to score predicted secondary structure against
• Another predicted secondary structure
• Or a known secondary structure
• Predicted secondary structure vs. known secondary structure.
– DSSP used to assign 1 of 7 states of observed secondary
structure
– PSIPRED used to predict secondary structure states, H, E or C.
– Predict secondary structure of all domains in SCOP (filtered to
twilight zone)
– Compare the PSIPRED predictions with DSSP
– Get the count of combination of (σ;ρ,c).
• σ belongs to {H,E,C,G,B,S,T}
• ρ belongs to {H,E,C}
• c belongs to {0,1,…,9}
Theory
• Scoring secondary structure
– Derive 10 3*7 substitution matrices (one for
each confidence value)
Mss(σ;ρ,c) = log (P (σ;ρ,c)/P(σ)P(ρ,c))
• Let
– Column i of HMM q have pred sec struct ρiq and confidence value
ciq
– Column j of HMM p have known sec struct σjp (Note: known sec
struct  secondary structure of seed seq of alignment)
– Define
• Sss(q I p j) = wss Mss(σjp;ρiq ciq)
• Empirically Wss is 1/7.
• This score is added to amino acid column score Saa(qi, pj) for use
in the Viterbi algorithm.
Theory
• Scoring secondary structure (predicted vs predicted)
M (  , c ,  , c )  log  P(  , c |  ) P(  , c |  ) P( ) / P(  , c ) P(  , c )
ss
q
q
p
p
q
q
p
p
q
q
p
p
i
i
j
j
i
i
j
j
i
i
j
j

• The above matrix informs
– How much more probable is it to get the
predictions ρiq ciq and ρjp cjp for a pair of aligned
homologous residues than to get them independently of
each other.
• Sss(q I p j) = wss Mss(ρiq ciq ρjp cjp)
– Empirically Wss is 1/7.
– This score is added to amino acid column score Saa(qi, pj)
for use in the Viterbi algorithm.
Results and Discussion
• All-against-All comparison with the following similarity search
tools:
– Sequence-Sequence
• BLAST
– Profile-Sequence
• PSI-BLAST
– HMM-Sequence
• HMMER
– Profile-Profile
• COMPASS
• PROF_SIM
• Test
– Input below the twilight zone
– Ability to detect remote homologs
– Ability to give high-quality alignments.
Results and Discussion
• Different versions of tool used for better juxtaposition
of results
– HHSearch 0
• Simple profile-profile comparison
• Gap opening penalty = -3.5, Gap Extension = -0.2
• Above used instead of transition prob log
– HHSearch 1
• Basic HMM-HMM version
– HHSearch 2
• Version 1 + inclusion of correlation score
– HHSearch 3
• Version 2 + usage of predicted vs predicted secondary
structure
– HHSearch 4
• Version 3 + usage of predicted vs known secondary structure
Results and Discussion
• SCOP (structural classification of proteins) database
with filtering for twilight zone used.
• Detection of homologs:
– Domain in SCOP
• Family or superfamily or fold or class
– Pair of domains are homologous
• If they are members of the same super family
• Domains from different classes are classified as nonhomologous
• We present a chart of TP vs FP
– TP  homologous pairs
– FP  non-homologous pairs.
Results and Discussion
• The figure shows classical sensitivity in the
benchmark test.
Results and Discussion
• Alternative definition of TP and FP
– A pair is a TP
• If the domains belong to same SCOP super-family
• Or if the seq based alignment gives structural alignment with a
“maxSub” score of at least 0.1
– A pair is a FP
• If it is from different classes and has 0 MaxSub score
– What is MaxSub score?
• Informally
– Defined such that a value > 0 occurs very rarely by chance
– It tells what fraction of the query residues can be superposed
structurally with the aligned residues from the other structure.
• Formally
– Weighted number of aligned pairs that can be superimposed with
a maximum distance per pair of 3.5 Angstrom units/number of
residues in the query sequence
– Pairs with 0 Angstorm deviation  wieght 1
– Pairs with 3.5 Angstorm deviation  wieght 0.5
Results and Discussion
• Plot of TP vs FP with new definition of TP
and FP
Results and Discussion
• Observation
– More sensitive tools which use secondary
structure (HHSearch 3, 4) improve
– Reason
• Reclassification of “harder to detect” ones as TP helps
the more sensitive tools, since they would detect these.
Results and Discussion
(Alignment quality)
•
Sequence alignment assessed by
– Looking at the spatial distances between aligned pair of residues
• upon superposition of the 3D structures
•
•
2 scores used.
maxSub score
–
Drawback
•
•
Developer’s score
–
–
–
•
S Dev = N correct/min (Lq, Lp)
N Correct = No of residue pairs that are present in the max subset identified by
maxSub
Lq and Lp = No of residues in the 2 sequences to be aligned.
Modeler’s score
–
–
–
•
Does not penalize over-prediction
S Mod = N correct / L ali
L Ali = No of aligned residue pairs in the seq alignment.
Does not penalize under-prediction.
Balanced score
–
–
S balanced = (S dev + S mod) / 2
Penalizes both under and over prediction
Results and Discussion
Results and Discussion
• HHSearch3 performs the best
– Family level
•
•
•
•
•
•
Aligns 58% of all pairs with balanced score >= 0.3
1.23 times more than COMPASS
1.28 times more than PROF_SIM
1.34 times more than HMMER
1.57 times more than PSI_BLAST
4.4 times more than BLAST
– Super family level
•
•
•
•
•
•
Aligns 27% of all pairs with balanced score >= 0.3
1.7 times more than COMPASS
1.9 times more than PROF_SIM
2.2 times more than HMMER
2.9 times more than PSI_BLAST
14 times more than BLAST
Results and Discussion
• HHSearch3 performs the best
– Fold level
•
•
•
•
•
•
Aligns 4.5% of all pairs with balanced score >= 0.3
3.3 times more than COMPASS
6.0 times more than PROF_SIM
7.3 times more than HMMER
9.4 times more than PSI_BLAST
63 times more than BLAST
– Actually 4.5% at fold level is a lot
– Pairs aligned at fold level are deemed nonhomologous by SCOP
– So we do not expect any good alignments at all
Conclusion
• A generalization of HMM – Sequence alignment
– Pairwise alignment of profile HMMs
• Algorithm to maximize log-sum-of-odds score
– Generalization of log-odds score
• Increased sensitivity of 5-10%
– Due to derivation of novel correlation score
• Statistical methods for
– Scoring predicted vs known secondary structure
– Predicted vs predicted secondary structure
– Uses confidence values of secondary structure
prediction
Conclusion
• HHPred
– New tool based on the research paper
• Benchmarking
– With 5 other homology detection tools
– Dataset in twilight zone
• Results
– Improvement in
• Sensitivity
• Alignment quality
Thank you.
Have a nice day!