Protein Domain Boundary Prediction

Download Report

Transcript Protein Domain Boundary Prediction

Protein Domain Boundary Prediction
Which model is best?
Paul Yoo
Advanced Networks Research Group
School of Information Technologies
The University of Sydney
What is Protein Domain?
• Domains can be seen as distinct
functional and/or structural units of a
protein.
• Independent folding unit of a polypeptide
chain also carries specific function.
• Domains are often identified as recurring
(sequence or structure) units, which may
exist in various contexts.
• 1IGR: First three domain protein / 1998
- Domain 1: L domain (Magenta)
- Domain 2: Growth factor receptor domain
(Brown)
- Domain 3: L domain (Green)
PDB: 1IGR
Introduction
• Domains provide one of the most valuable
information for the prediction of protein structure,
function, evolution and design.
• Since Anfinsen’s (1973) seminal work, many have
proposed various structure prediction models from
amino acid sequence only.
• This study,
- Provides an overview of the modeling methods for protein
domain boundary prediction.
- Proposes an new semi-parametric model that can show
superior performance to the existing models.
Motivation
• Accurate prediction of domain boundaries forms a
basis of many types of protein research.
- New proteins such as chimeric proteins can be created as they
are composed of multifunctional domains (Suyama & Ohara,
2003).
- The search method for templates used in comparative
modeling can also be optimized by the delineation of domain
boundaries (Contreras-Moreira & Bates, 2002).
- As for threading methods, the domain boundary prediction
can improve its performance by enhancing the signal-to-noise
ratio (Wheelan et al., 2000).
- Accurate identification of domain boundaries for homologous
domains plays a key role for reliable multiple sequence
alignment (Gracy & Argos, 1998).
Problem Statement
• Limitations of experimental tools
• X-ray crystallography
• Nuclear Magnetic Resonance (NMR)
• Costly, time consuming, laborious and inefficient
• 3D coordinates to 1D amino acids
• Assumption: a domain has relatively more contacts
within itself than with residues in the remainder of the
structure.
•
•
•
•
High dimensionality of protein data
Bias and variance dilemma of ANN models
Long-range dependencies
Multi-domain benchmark dataset
Outline
• High Dimensionality
• Bias and Variance Dilemma
• Introduction to Improved General Regression
Network
• Experiment 1: ML Models on Benchmark_2
Dataset
• Experiment 2: Domain Predictors on CASP7
Dataset
• Long-Range Information
• Future Work
High Dimensionality
• High dimensionality of protein sequential data
10
- 10 amino acids represents a search space of 20 possibilities
and requires a network with 200 inputs.
• Learning in high dimensional space
- Large network training requires large dataset of known
examples.
- Computational complexity
- Overfitting problem
• Performance of ANNs is dependent upon their input data:
- Better generalization and faster training <- fewer weights to
be adjusted by fewer inputs.
- Beyond a certain point, adding new features can actually lead
to a reduction in the performance of the classification system.
Bias and Variance Dilemma
• Bias: measures the extent to which the
estimation function differs from the true function.
• Variance: measures the sensitivity of the
estimation function to the data sample.
• Parametric models tend to have high bias, but
low variance (underfitting).
• Non-parametric models tend to have low bias, but
high variance (overfitting).
Bias and Variance Dilemma cont.
Bias and Variance Tradeoff
• Desirable to have both low
bias and low variance, but
they are incompatible.
• Reduce the variance at the
cost of increased bias.
• Need to find a good tradeoff
between the bias and
variance (between the states
of underfitting and overfitting)
• Semiparametric model has
theoretically proven to
achieve a good tradeoff.
A New Semi-Parametric Model
• Semi-Parametric modeling
- SP models take assumptions that are stronger than those
of nonparametric models but less restrictive than those of
parametric model.
- They avoid most serious practical disadvantages of
nonparametric methods but at price of increased risk of
specification error.
• Improved General Regression Network
- Find the optimal trade-off between parametric and nonparametric models.
- Low learning bias and low generalization variance.
• New decision function for IGRN is:
 ( x  xj )T ( x  xj )
 ( x  Qi ( x))T ( x  Qi ( x)) Zi
Z i exp
  exp
2 2
2 2
j 1
A New Semi-Parametric Model cont.
• Reduced Computation
- In GRNN equation, each and every training data pair {x i  yi }
is incorporated into its architecture.
- Each local region of the input space is represented by a
centre vector Pi
• Semi-Parametric Approximation
- GRNN uses a spherical kernel function as a radial basis
function (Non-Parametric).
- IGRN more dependent on the Gaussian radial basis
function (Semi-Parametric)
• Applicability of Boosting Method
- High specification error of SP model
- Boosting combines base learners to find better fit for the
training set by maintaining a set of weights over training
samples.
- No parameter tuning
Experiment 1: ML Models on Benchmark_2
• New multi-domain benchmark dataset (Benchmark_2)
- Contains proteins of known structure for which three methods (CATH
(Pearl et al., 2000), SCOP (Andreeva et al., 2004) and literature) agree on
the assignment of the number of domains.
- Comprises 315 polypeptide chains
No. Domains
1-domain
2-domains
3-domains
4-domains
5-domains
6-domains
No. Chains
106
140
54
8
5
2
- Non-redundant: each combination of topologies occurs only once
- All sequences are taken from Protein Data Bank (PDB)
• Pre-processing
- Position Specific Scoring Matrix (PSSM) using PSI-BLAST
- Secondary Structure (SSpro, Pollastri et al., 2002)
- Solvent Accessibility (ACCpro, Pollastri et al., 2002)
- Domain Linker Index (DomCut, Suyama & Ohara, 2003)
Experiment 1 cont.
Comparison of Prediction Accuracy and Generalization
Variance on Different Window Sizes.
IGRN II (%)
IGRN (%)
GRNN (%)
RBFN (%)
FFNN (%)
Win 7
68.3
66.2
67.1
65.4
65.6
Win 11
66.7
65.2
65.1
61.3
63.5
Win 19
67.1
64.5
65.4
62.1
66.6
Win 27
65.5
63.6
63.8
59.7
63.2
Overall Accuracy
67.0
64.9
65.4
62.1
64.7
St. Dev
1.16
1.01
1.36
2.40
1.64
• IGRN shows higher learning bias than its original model
(GRNN) but low generalization variance.
• IGRN II achieves both low learning bias and low
generalization variance.
Experiment 2: Domain Predictors on CASP7
• CASP7 Benchmark Dataset
- The most widely known benchmark dataset
- Comprises 94 polypeptide chains
No. Domains
1-domain
2-domains
3-domains
No. Chains
66
26
2
• Different Structural Information
- DOMpro: PSSM, Secondary Structure, and Solvent
Accessibility
- DomPred: Homology (PSSM) and Fold Recognition
(Secondary Structure)
- DomSSEA: Secondary Structure
- DomainDiscovery, IGRN and ML models:
- Position Specific Scoring Matrix (PSSM)
- Secondary Structure
- Solvent Accessibility
- Domain Linker Index
Experiment 2 cont.
Predictive Performance Comparison on CASP7
No of domains
DOMpro
DomPred
DomSSEA
DomainDiscovery
1
84.4
85.9
80.5
80.5
2
0
9.5
19.1
3
0
33
Overall Accuracy (%)
62.64
66.28
IGRN II
IGRN
GRNN
RBFN
FFNN
89.6
85.2
88.3
85.2
87.9
31
25.3
21.9
14.8
11.1
4.3
33
29.2
7.5
7.5
0
20
0
62.06
67.34
70.10
68.1
68.52
65.79
65.69
• IGRN II achieved superior predictive performance than existing
domain boundary predictors on CASP7.
• Structural Information used in IGRN and other ML models more
useful than the information used by other predictors.
Experiment 2 cont.
• Comparison of prediction scores simulated by IGRN and
GRNN on a protein chain, CASP7 target number: T0318
(PDB code: 2HB6)
• The protein chain has two domains and its boundary is at
the residue 155.
Long-Range Information
• The most notable breakthrough: the exploitation of
evolutionary information (Rost and Sander 1993).
• Machine learning based prediction method (ANNs) on a profile
compiled from the multiple sequence alignments.
• Increased the prediction accuracy by 6 to 8 percentage points.
• Overall three-state accuracy of 70.8% for globular proteins.
• Long-range interaction also plays a key role.
• The regions of β-sheets, involves long-range interactions
between amino acids.
• Thus, prediction accuracy for β-strands is less than that for αhelix or coil.
• Accurate prediction of β-sheet is useful for a variety of
biological problems.
• tertiary structure prediction,
• elucidating folding pathways,
• and designing new proteins.
• The problem is:
• β-sheet formation is seen as a tertiary structure interaction
which brings two or more strands together by hydrogen bonds.
THEY CAN BE SITUATED FAR APART IN THE AMINO ACIDS
SEQUENCE!
Future Work
• Further improve the new semi-parametric model
to efficiently capture long-range information.
• At the same time, find or develop a new encoding
scheme or profiles that contains more structural
information.
Thank you!