Hidden Markov models for detecting remote protein homologies

Download Report

Transcript Hidden Markov models for detecting remote protein homologies

Hidden Markov models for detecting
remote protein homologies
Kevin Karplus, Christian Barrett,
Richard Hughey
Georgia Hadjicharalambous
Brief outline
• Description and evaluation of a new hidden
Markov model method, SAM-T98, for finding
remote homologs of protein sequences.
• Evaluation with three fold-recognition test
datasets and a curated database.
• Comparison with WU-BLASTP and DOUBLEBLAST
• Results
1.Biology background
–
Homologs: chromosomes carrying the same genetic
loci;
– Structure of a protein can be predicted by using a
homology to sequences for which the structure is
known.
– Similar structures similar functions
 Can classify proteins into families with similar
functions.
 Remote-homolog Detection
2.Statistics background
• Hidden Markov Models
Q=set of states={match, insert, delete}
V=output alphabet={20 amino acids}
(i)=probability of being in state i, at time t=0
A=transition probabilities={aij},
where aij=Pr[entering state j at time t+1| in state i
at time t]
 B=output probabilities={bj(k)},
where bj(k)=Pr[producing vk at time t | in state j
at time t]
•HMMs as profile models
Homologs are chromosomes carrying the same genetic
loci; a diploid cell has 2 copies of each homolog, one
derived from each parent.
A profile of a protein family is a labeling of the
positions of the amino acids in the secondary structure
and a probability distribution for each position.
The structure of a protein can be predicted by using a
homology to sequences for which the structure is
known. Proteins with similar structure assumed to have
similar functionclassification of proteins into families
according their function.
Typical Profile HMM:
• chain of match, insert and delete states.
• Specific probabilities to all transitions
between nodes and character costs in
match and insert states.
• BEST PATH: a single path from ‘Start’
to ‘End’ and each character is related to
a successive match or insertion state
along the path.
Example
For this work:
• Local alignment procedure was used: relates
part of the sequence to one contiguous path
through part of the HMM.
• An HMM is trained on sequences-members
of protein family, the resulting HMM
identifies the positions of amino acids
which describe structure of family.
 use this HMM to discriminate family
members.
TEST SETS
Fold recognition datasets
• FSSP: based on a protein classification tree (Holm
and Sander, 1996, 1997)  presents a
continuously updated structural classification of 3dimensional protein folds (sequences of 1050
leaves of the FSSP tree, 166 target sequences).
• Uses DALI to determine structural homology.
• Classification: z-scores > 6  homologs
z-scores < 2  non-homologs
2 < z-scores < 6  partly yes or no
• Even with the best possible classifier, about
2% of the non-self pairs represent
homologies to be detected.
• At the minimum-error point for an optimal
classifier, there are about 1.4% homolog
pairs.
• SCOP:(Structural Classification of proteins)
• 2 test sets (Brenner, 1996; Park et al.,
1997): identical lists for both target lists
• Database of known folds
• Homologous pair: if both sequences were in
the same SCOP superfamily.
• No 2 sequences had >40% sequence
similarity.
• Whole-chain test set: .6% correct
homologies
• Domain test set: the same
Sequence comparison dataset
• Pearson: curated version of the PIR
database ( Barker et al., 1990).
• 12 216 sequences total
• Set of 67 target sequences
• .4% were considered correct
• PIR families (Protein Information
Resource): close homologs  Pearson test
is for close homolog classification, NOT
remote homolog.
ALGORITHMS
•
•
•
•
WU-Blast (Basic Local Alignment Tool):
Protein identification software for genes
Set E (# of false positives) = 10
Log of P-value reported as the score to
threshold.
• Optimum threshold never corresponded to
P-value > .005.
•
•
DOUBLE-Blast: Inspired by ISS (Park et
al., 1997), ISS was used to recognize
remote revolutionary related sequence
pairs derived from SCOP database. It
considered to increase detection compared
to FASTA.
Two-step approach: 1. Set of close
homologs found to the target sequence in
NRP. 2. Each homolog is used as a query
to search the final database.
• SAM-T98: Single target sequence: finds and
multiplies align a set of homologs and creates an
HMM from that multiple alignment. Resulting
HMM is used then for database search.
• SAM package
• Database small: method is used to create an HMM
for each sequence in the database.
• For fold-recognition tests created HMMs for all
sequences.
• Pearson test only for the 67 target sequences.