Using the Fisher kernel method to detect remote protein
Download
Report
Transcript Using the Fisher kernel method to detect remote protein
Using the Fisher kernel method to
detect remote protein homologies
Tommi Jaakkola, Mark Diekhams, David
Haussler
ISMB’ 99
Talk by O, Jangmin (2001/01/16)
Abstract
Detecting remote protein homologies
Fisher kernel method
Variant of Support Vector Machines using new kernel
function
Derived from Hidden Markov Models
Introduction (1)
Detecting protein homologies (sequence-based algorithm)
BLAST, Fasta, PROBE, templates, profiles, position-specific
weight matrices, HMM
Comparison by (Brenner 1996; Park et al. 1998)
SCOP classification of protein structures
Remote protein homologies existing between protein domain in
the same structural superfamily.
Statistical models like PSI-BLAST and HMMs are better than
simple pairwise comparison methods.
Introduction (2)
Generative statistical models (HMMs)
Extracting features from protein sequences
Mapping all protein sequences to points in a Euclidean feature
space of fixed dimension.
General discriminative statistical method to classify the
points.
Improvements acquired
Over HMMs alone.
Methods
How generative models work. (HMMs)
Training examples ( sequences known to be members of protein
family ) : positive
Tuning parameters with a priori knowledge
Model assigns a probability to any given protein sequence.
The sequence from that family yield a higher probability than
that of outside family.
Log-likelihood ratio as score
L( X ) log
P ( X | H1 ) P ( H1 )
P( X | H 0 ) P( H 0 )
P ( X | H1 )
P ( H1 )
log
log
P( X | H 0 )
P( H 0 )
Discriminative approaches
Using both positive and negative examples
Parameter is tuned so that the model can optimally
discriminate members of the family from nonmembers.
When training examples are few
Likelihood ratio is optimal if generative models perfectly fit to
data but…
Discriminative methods often performs better.
Kernel methods
Discriminant function L(X)
Where { Xi, i = 1,…,n} and hypothesis class H1, H2
L( X ) log P( H1 | X ) log P( H 0 | X )
K(X , X ) K(X , X )
i: X i H1
i
i
i: X i H 0
i
i
+ : the sequence of the family, - : outside of the family
Contribution of Kernel
i : overall importance of the example Xi.
Measure of pairwise similarity : K(Xi, X)
User supplies the type of kernel for the application area!!
The Fischer kernel (1)
Deriving kernel function from generative models
Advantage 1 : handle variable length protein sequences!!
Advantage 2 : encoding of prior knowledge about protein
sequences
HMMs (difference)
Kernel function specifies a similarity score for any pair of
sequences.
Likelihood score from an HMM only measures the closeness of
the sequence to the model itself.
The Fischer kernel (2)
Sufficient statistics
Each parameter in HMM : Posterior frequencies
Of
particular transition.
Of generating one of the residues of the query sequence.
Reflects the process of generating the query sequence from
HMM.
Alterative of sufficient statistics : Fischer score
U X log P( X | H1 , )
Magnitude of the components : how each contributes to
generating the query sequence.
The Fischer kernel (3)
Kernel function used in this paper.
K(X , X ') e
1
2
T
(
U
U
)
(U X U X ' )
X
X
'
2
note that its fixed vector.
Summary
Train HMM with positive examples.
Map each new protein sequence X into a fixed vector, Fisher
score.
Calculate the kernel function
Get resulting discriminant function (SVM-Fisher)
L( X )
K ( X , X ) K ( X , X )
i: X i H1
i
i
i: X i H 0
i
i
The Fischer kernel (4)
Combination of scores
There might be more than one HMM model for the family or
superfamily of interest.
Average score
Lave ( X )
1
N
L (X )
i
i
Maximum score
Lmax ( X ) max Li ( X )
i
Experimental Methods
Methods
SVM-Fisher (this paper)
BLAST (Altshul et al. 1990; Gish & States 1993)
HMMs using SAM-T98 methodology (Park et al. 1998; Karplus,
Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996)
Measurement of recognition rate for members of
superfamilies of the SCOP protein structure classification
(Hubbard et al. 1997)
Withholding all members of SCOP family
Train with the remaining members of SCOP superfamily
Test with withheld data
Question: “Could the method discover a new family of a known
superfamily?”
Overview of experiments
Database
SCOP version 1.37 PDB90 : consisting of protein domains, no
two of which have 90% of more residue identity
PDB90 eliminates redundant sequences.
Generative models
SAM-T98 HMMs
Data selection
Get 33 test families from 16 superfamilies.
Evaluation strategy
Assessing to what extent it gave better scores to the positive test
examples thant it gave to the negative test examples.
SCOP: a Structural Classification of
Proteins database
Hierachical levels
Family: clustered proteins by common evolutionary origin:
residue identities of above 30%, lower sequence identities but
very similar functions and structures
Superfamily: low sequence identities but probably common
evolutionary origin
Fold: same major secondary structure in the same arrangement
and with the same topological connections
Figure 1: Separation of the SCOP PDB90 database into training and test
sequences, shown for the G proteins test family
Multiple models used
Modeling superfamily
SAM-T98 : starts with a single sequence (the guide sequence
for the domain) and build a model
Too many sequences!
Using a subset of PDB90.
Train SVM-Fisher method using each of models in turn
Details on the training and test sets
All PDB90 sequence outside the fold of the test family
were used as either negative training or negative test
examples.
Reverse test/training allocation of negative examples, and repeat
experiments.
Fold-by-fold basis split of negative examples.
For positive examples
PDB90 sequences in the superfamily of the test family are used.
Homologs found by each individual SAM-T98 model are used.
BLAST methods
WU-BLAST version 2.0a16 (Althcshul & Gish 1996)
PDB90 database was queried with each positive training
examples, and E-values were recorded.
BLAST:SCOP-only
BLAST:SCOP+SAM-T98-homologs
Scores were combined by the maximum method
Generative HMM models
SAM-T98 method
Null model: reverse sequence model
Same data and same set of models as in the SVM-Fisher
Combined with maximum methods
Results
Metric : the rate of false positives (RFP)
RFP for a positive test sequence : the fraction of negative
test sequences that score as good of better than positive
sequence.
G-proteins
The result of the family of the nucleotide triphosphate
hydrolases SCOP superfamily
Test the ability to distinguish 8 PDB90 G proteins from 2439
sequences in other SCOP folds.
Table 1
In SVM-Fisher
5
of the 8 G proteins are better than all 2439 negative test
sequences.
Maximum RFP
Median RFP
Figure 2
RFP curve
Table 1. Rate of false positives for G proteins family. BLAST =
BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, ST98 = SAMT-98, and SVM-F = SVM-Fisher method
Figure 2: 4 methods on the 33 test families. Curve of median RFP
Discussion
New approach
to recognition of remote protein homologies make a
discriminative method built on top of a generative model
(HMMs)
Discriminative method on top of HMM methods
Significant improvement
Combining multiple score would be improved.
Allocation problem
Different training set for tuning HMM and different training set
for discriminative model
Extend the method to identify multiple domains within
large protein sequences