Using the Fisher kernel method to detect remote protein

Download Report

Transcript Using the Fisher kernel method to detect remote protein

Using the Fisher kernel method to
detect remote protein homologies
Tommi Jaakkola, Mark Diekhams, David
Haussler
ISMB’ 99
Talk by O, Jangmin (2001/01/16)
Abstract



Detecting remote protein homologies
Fisher kernel method
Variant of Support Vector Machines using new kernel
function
 Derived from Hidden Markov Models
Introduction (1)

Detecting protein homologies (sequence-based algorithm)
 BLAST, Fasta, PROBE, templates, profiles, position-specific
weight matrices, HMM

Comparison by (Brenner 1996; Park et al. 1998)
 SCOP classification of protein structures
 Remote protein homologies existing between protein domain in
the same structural superfamily.
 Statistical models like PSI-BLAST and HMMs are better than
simple pairwise comparison methods.
Introduction (2)

Generative statistical models (HMMs)
 Extracting features from protein sequences
 Mapping all protein sequences to points in a Euclidean feature
space of fixed dimension.


General discriminative statistical method to classify the
points.
Improvements acquired
 Over HMMs alone.
Methods

How generative models work. (HMMs)
 Training examples ( sequences known to be members of protein
family ) : positive
 Tuning parameters with a priori knowledge
 Model assigns a probability to any given protein sequence.
 The sequence from that family yield a higher probability than
that of outside family.

Log-likelihood ratio as score
L( X )  log
P ( X | H1 ) P ( H1 )
P( X | H 0 ) P( H 0 )
P ( X | H1 )
P ( H1 )
 log
 log
P( X | H 0 )
P( H 0 )
Discriminative approaches



Using both positive and negative examples
Parameter is tuned so that the model can optimally
discriminate members of the family from nonmembers.
When training examples are few
 Likelihood ratio is optimal if generative models perfectly fit to
data but…
 Discriminative methods often performs better.
Kernel methods

Discriminant function L(X)
 Where { Xi, i = 1,…,n} and hypothesis class H1, H2
L( X )  log P( H1 | X )  log P( H 0 | X )

 K(X , X )   K(X , X )
i: X i H1
i
i
i: X i H 0
i
i
 + : the sequence of the family, - : outside of the family

Contribution of Kernel
 i : overall importance of the example Xi.
 Measure of pairwise similarity : K(Xi, X)

User supplies the type of kernel for the application area!!
The Fischer kernel (1)

Deriving kernel function from generative models
 Advantage 1 : handle variable length protein sequences!!
 Advantage 2 : encoding of prior knowledge about protein
sequences

HMMs (difference)
 Kernel function specifies a similarity score for any pair of
sequences.
 Likelihood score from an HMM only measures the closeness of
the sequence to the model itself.
The Fischer kernel (2)

Sufficient statistics
 Each parameter in HMM : Posterior frequencies
 Of
particular transition.
 Of generating one of the residues of the query sequence.
 Reflects the process of generating the query sequence from
HMM.

Alterative of sufficient statistics : Fischer score
U X   log P( X | H1 , )
 Magnitude of the components : how each contributes to
generating the query sequence.
The Fischer kernel (3)

Kernel function used in this paper.
K(X , X ')  e


1
2
T
(
U

U
)
(U X U X ' )
X
X
'
2
note that its fixed vector.
Summary
 Train HMM with positive examples.
 Map each new protein sequence X into a fixed vector, Fisher
score.
 Calculate the kernel function
 Get resulting discriminant function (SVM-Fisher)
L( X ) 
 K ( X , X )   K ( X , X )
i: X i H1
i
i
i: X i H 0
i
i
The Fischer kernel (4)

Combination of scores
 There might be more than one HMM model for the family or
superfamily of interest.

Average score
Lave ( X ) 

1
N
 L (X )
i
i
Maximum score
Lmax ( X )  max Li ( X )
i
Experimental Methods

Methods
 SVM-Fisher (this paper)
 BLAST (Altshul et al. 1990; Gish & States 1993)
 HMMs using SAM-T98 methodology (Park et al. 1998; Karplus,
Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996)

Measurement of recognition rate for members of
superfamilies of the SCOP protein structure classification
(Hubbard et al. 1997)
 Withholding all members of SCOP family
 Train with the remaining members of SCOP superfamily
 Test with withheld data
 Question: “Could the method discover a new family of a known
superfamily?”
Overview of experiments

Database
 SCOP version 1.37 PDB90 : consisting of protein domains, no
two of which have 90% of more residue identity
 PDB90 eliminates redundant sequences.

Generative models
 SAM-T98 HMMs

Data selection
 Get 33 test families from 16 superfamilies.

Evaluation strategy
 Assessing to what extent it gave better scores to the positive test
examples thant it gave to the negative test examples.
SCOP: a Structural Classification of
Proteins database

Hierachical levels
 Family: clustered proteins by common evolutionary origin:
residue identities of above 30%, lower sequence identities but
very similar functions and structures
 Superfamily: low sequence identities but probably common
evolutionary origin
 Fold: same major secondary structure in the same arrangement
and with the same topological connections
Figure 1: Separation of the SCOP PDB90 database into training and test
sequences, shown for the G proteins test family
Multiple models used

Modeling superfamily
 SAM-T98 : starts with a single sequence (the guide sequence
for the domain) and build a model
 Too many sequences!
 Using a subset of PDB90.
 Train SVM-Fisher method using each of models in turn
Details on the training and test sets

All PDB90 sequence outside the fold of the test family
were used as either negative training or negative test
examples.
 Reverse test/training allocation of negative examples, and repeat
experiments.
 Fold-by-fold basis split of negative examples.

For positive examples
 PDB90 sequences in the superfamily of the test family are used.
 Homologs found by each individual SAM-T98 model are used.
BLAST methods

WU-BLAST version 2.0a16 (Althcshul & Gish 1996)
 PDB90 database was queried with each positive training
examples, and E-values were recorded.
 BLAST:SCOP-only
 BLAST:SCOP+SAM-T98-homologs
 Scores were combined by the maximum method
Generative HMM models

SAM-T98 method
 Null model: reverse sequence model
 Same data and same set of models as in the SVM-Fisher
 Combined with maximum methods
Results


Metric : the rate of false positives (RFP)
RFP for a positive test sequence : the fraction of negative
test sequences that score as good of better than positive
sequence.
G-proteins

The result of the family of the nucleotide triphosphate
hydrolases SCOP superfamily
 Test the ability to distinguish 8 PDB90 G proteins from 2439
sequences in other SCOP folds.

Table 1
 In SVM-Fisher
5
of the 8 G proteins are better than all 2439 negative test
sequences.
 Maximum RFP
 Median RFP

Figure 2
 RFP curve
Table 1. Rate of false positives for G proteins family. BLAST =
BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, ST98 = SAMT-98, and SVM-F = SVM-Fisher method
Figure 2: 4 methods on the 33 test families. Curve of median RFP
Discussion

New approach
 to recognition of remote protein homologies make a
discriminative method built on top of a generative model
(HMMs)
 Discriminative method on top of HMM methods
 Significant improvement


Combining multiple score would be improved.
Allocation problem
 Different training set for tuning HMM and different training set
for discriminative model

Extend the method to identify multiple domains within
large protein sequences