Thesis presentationx

Download Report

Transcript Thesis presentationx

University of Tehran
College of Engineering
School of Electrical and Computer Engineering
Protein Secondary Structure
Classifier Fusion
By: Majid Kazemian
Advisor
Prof. Behzad Moshiri
Co Advisor
Prof. Caro Lucas
April 2006
Agenda
•
Introduction
•
Problem statement
•
Expertness assessment of protein secondary structure
prediction engines
•
Meta classifier
– Definition and general architecture
– Classifier types
– Combining approaches
– Decision profile
– Accuracy measures
– Results
•
Data fusion simulation tool (SDF tool)
•
Conclusion and further research
2
Introduction
•
•
Protein
–
A large, complex molecule composed of amino acids.
–
Proteins are essential to the structure, function, and regulation of the body.
–
Examples are hormones, enzymes, and antibodies
Amino Acid
–
The fundamental building blocks of a protein molecule.
–
There are 20 different amino acids commonly found in proteins
Side chain
R
OH
Cα
H
C
N
O
H
Amino group
H
Carboxyl group
3
Introduction (cont.)
•
Proteins are complex macromolecules whose structure can be
discussed in levels of:
– Primary (the sequence of amino acids that are linked together)
– Secondary (alpha helices, beta sheets, random coils)
– Tertiary (the way the random coils, alpha helices and beta
sheets fold in respect to each other)
– Quaternary (packing of the protein subunits to form the final
protein complex)
Structure
Function
The accumulation of protein sequences especially
after the beginning of whole genome sequencing
project is in stark contrast to the much slower
experimental determination of protein structures.
4
Problem statement
•
A variety of approaches that each of them might have its own strengths
and weaknesses in protein structure prediction
•
One of the most useful approaches is through determining secondary
structural elements as first step toward 3D structure determination.
There are three main classes of secondary structural elements in
proteins as alpha helices (H), beta strands (E) and irregular structures
(turns or coils that are shown as C), so prediction engines can be
assumed as structural classifiers.
•
–
Provide a unified access to data for users (Biologists).
–
To achieve a higher overall accuracy than any of the individual classifiers
A standard tool for fusion systems simulation and prototyping
5
Expertness assessment of protein
secondary structure prediction engines
•
•
The first level
–
Deals with the prediction accuracy of the whole secondary structure content of
proteins regardless of prediction accuracy of each secondary structure class
–
Q3 criterion is one of the most important criteria in this level
–
This level is known as global measurement level
The Second Level
–
Explains the prediction accuracy in each secondary structure class separately
–
This level allows analyzing prediction methods in more details comparing the
previous level
–
QH, QE,QC criteria are introduced in this level
6
Expertness assessment of protein secondary
structure prediction engines (cont.)
•
The Third level
– Evaluate the prediction accuracy of each secondary structure class
based on amino acid index
– Reveals more details of prediction accuracy than the second level
mentioned above
number of correctly predicted amino acid aa in structure index
aa
Q
=
× 100
index (Critical Assessment of Techniques for Protein Structure Prediction), is commonly
CASP,
number of amino acid aa in structure index
referred to as a competition for protein structure prediction taking place every two years.
indexprovides
= Helixresearch
, Beta Strand
Coilan opportunity to assess the quality of their methods for
CASP
groups, with
protein structure prediction from the primary structure of the protein.
aa ∈ list of 20 amino acids
7
Expertness assessment of protein secondary
structure prediction engines (cont.)
•
Single global propensity
– Single Global Propensity (SGP) shows the tendency or
hindrance of a certain amino acid to occur in a certain
secondary structure.
– The great amount of SGP value of an amino acid for a
specific secondary structure means that it has a tendency to
take that specific structure in comparison with other
i
secondary structures.
n aaindex
SGP aa
i
index
all
n aaindex
=
i
n aa dataset
all
n aa dataset
To calculate single global propensity, we used Hobohm and Sander
representative set of protein chains with homology less than 25%.
Those structures are resolved by X-Ray Crystallography with
resolution higher than 3 A°
8
Expertness index assessment
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
A C D E F G H
385
I
K L M N P Q R S T V W Y
A C D E F G H
I
K L M N P Q R S T V W Y
C
H
258
150
111
31
SGP H
SGP E
SGP C
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
A C D E F G H I
K L M N P Q R S T V W Y
Engine
Index
Name
Method
CASP 4
Q3
031
BioInfo.PL
BASIC
79.73
111
SAM-T99
HMM
78.14
150
Chandonia-Cohen
NN+F
72.63
151
Pred2ary/Chandonia
NN
72.74
258
PSIPRED
NN
78.49
385
Prof-server
NN
75.66
E
9
CASP4 (2000)
Expertness index assessment
1
20
23
68
1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
A C D E F G H I K L M N P Q R S T V W Y
72
100
108
357
393
517
SGP H
SGP E
SGP C
A C D E F G H I K L M N P Q R S T V W Y
C
H
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
A C D E F G H I
K L M N P Q R S T V W Y
Engine
Index
Name
Method
CASP 5
Q3
001
SAM-T02-human
NN
HMM
79.91
020
Bujnicki-Janusz
Consensus
81.79
023
Baldi-SSpro
RNN
79.61
068
Jones-NewFold
NN+F
79.83
072
PSIPRED
NN
79.96
100
SBI
NN
79.6
108
CaspIta
Consensus
81.04
357
PROF/Rost
NN
77.72
393
MacCallum
GP+NN
79.97
517
GeneSilico
Consensus
80.77
E
Y
10
CASP5 (2002)
Analysis results
•
Less accuracy in beta strand prediction
•
The harmony between SGP values and prediction accuracy
•
Different prediction behavior
Amino Acids
P
A
G
CASP 4
V
I
L
T
Q
Y
M
Structure
S
N
D
C
W
H
K
E
CASP 5
Average
STD.
Dev.
Average
STD.
Dev.
H
0.7768
0.0974
0.8287
0.0812
E
0.5916
0.1296
0.6973
0.1285
C
0.7578
0.0801
0.7654
0.0680
R
F
Hydrophobic
Polar
11
Meta Classifiers
•
A Meta Classifier combines the results of multiple classifiers at
once, returning more accurate and less confusing results.
•
Meta Classifier can improve the Classification Quality in many
ways:
– Accuracy and Efficiency
– One Classification Results {one-click paradigm}
How to combine the individual classifiers?
Data fusion deals with the synergistic combination of
information made available by various knowledge source
such as sensors, data bases, … in order to provide a better
understanding of a given scene or environment
12
Architecture of meta-classifier
Current environment
User
User Interface
User
Interface
Feedback
Data is
Hard to find
Hard to understand
Hard to reconcile
Hard to analyze
Query
DSSP
used
H
H
G
H
I
H
E
E
B
E
T
L
S
L
''
L
*
Personalized
Knowledge
Dispatcher
1 1
BCSE
Pre-Processing
PreProcessing
2 2
BCSE
Pre-Processing
PreProcessing
…
…
CN
Statistics
&
Expertness
Pre-Processing
Fusion
Fusion
Post-Processing
Display
Display
* Dictionary of Secondary Structure of Protein
Biological
Insights
Simplified Version
13
Types of Classifiers & Combination
Approaches
•
Type I classification is a simple statement of the class (abstract
level outputs)  Majority voting.
•
Type II classification is a ranked list of probable classes (rank
level outputs)  Borda voting.
•
Type III classification assigns probabilities to each classes
(measurement level outputs)  Any Fusion Technique.
Most of protein secondary structure classifiers are Type I classifier
How to extract Measurement level from Abstract level Classifications?
14
Measurement level from Abstract level Classifications
•
•
Confusion matrix (a posteriori probabilities of each classification)
A confusion matrix is a matrix in which the actual class or a datum under
test is represented by the matrix row, and the classification of that
particular datum is represented by the confusion matrix column.
Classified as H
Classified as E
Classified as C
Actual Class H
1600
100
300
Actual Class E
200
1200
600
Actual Class C
100
500
1400
CONFUSION MATRIX FOR A CLASSIFIER
•
The normalized values of column i of the classifier Dj’s confusion matrix
have been assumed as the decision vector of classifier Dj, where the
classifier Dj outputs the class i as its classification label.
15
Decision Profile
•
Let D1, D2, D3,…, DL be a set of our protein secondary structure
classifiers and be a set of the class labels, which are the main secondary
structural elements. The output of each secondary structure classifier
can be assumed as three-dimensional vector:
Di x   d i , H x , d i , E x , d i ,C x ,
•
d i , j x  0,1
Where di,j(x) is the degree of support given by classifier Di to the
hypothesis that x comes from class j. The decision profile is the matrix of
soft labels:
Output of Classifier D3 (x)
 d1, H x 
d x 
 2, H
DPx    d3, H x 

d4, H x 
 d5, H x 

d1, E x  d1,C x 
d2, E x  d2,C x 
d3, E x  d3,C x 

d4, E x  d4,C x 
d5, E x  d5,C x 
Support from classifiers D1 , D2 ,..., D5 for class E
16
Paradigms for designing a classifier
combination system
Combiner
Combiner
Classifier 1
...
Classifier i
...
Classifier L
A. Different combination schemes
Classifier 1
...
Classifier i
...
Classifier L
B. Different classifier models
Combiner
Classifier 1
...
Classifier i
...
Classifier L
C. Different feature subsets
D. Different training sets
17
Classifier fusion methods
Type of outputs
Class labels
Decision Profile
Rank
Plurality [Hansen 1990]
Majority [Lam 1997]
Borda voting
Unanimity
Weighted Borda voting
Class indifferent
Class conscious
Naïve Bayes [Xu 1992]
BKS [Huang 1995]
Brute force [Kuncheva 2000b]
Wernecke [Wernecke 1992]
Stacked generalization [Wolpert 1992]
Dempster-Shafer [Rogova 1994]
Decision templates [Kuncheva 2001]
Linear Models
Nonlinear Models
Order Statistics
Product
Minimum
Geometric mean
Maximum
[Chibelushi 1999]
Fuzzy integral [Verikas 1999]
Median
[Hashem 1997]
Neural networks [Gader 1996]
OWA [Kuncheva 1997a]
Fixed weights
[Tresp 1995, Verikas 1999]
Data dependent weights
Soft weights
Mixture of Experts
[Jordan 1995]
Crisp weights
Classifier selection
[Kuncheva 2000a]
18
Techniques of Data Fusion
Redundant information
Conventional Approaches
OWA
Bayesian Approaches
Dempster Shafer Method
Kalman Filter
Source 1
Source 2
Complementary information
Knowledge based Systems & Intelligent Approaches
Neural Network
Fuzzy Logic (Integral Operators)
19
n
w
i 1
i
1
Ordered Weighted Averaging
n
Yager 1988
mapping
F: Rn → R
OWA( x1 , x2 ,..., xn )   w j x ( j )
j 1
Where σ is a permutation that orders the elements: x (1)  x ( 2)  .... x ( n )
The weights are all non-negative ( wi  0 ) and their sum equals to
one  w  1
n
i 1
i
• Provide a parameterized family of aggregation operators
• Maximum, Minimum
• Median, k-order statistics
• Arithmetic mean
•…
• A parameterized way to go from the min to the max.
Min ( w1, w2,..., wn)  OWA( w1, w2,..., wn)  Max ( w1, w2,..., wn)
1 n
orness ( w1 , w2 ,..., wn ) 
(n  i ) wi

n  1 i 1
20
Exponential class of OWA
• Satisfying a given degree of maxness
• Proposed by Filev 1998:
 Optimistic weights:
w1=α ; w2= α(1–α) ; w3= α(1- α)2 ; ... ; wn-1= α(1–α)n-2 ; wn=(1- α)n-1
 Pessimistic weights:
w1= α n-1 ; w2=(1- α) α n-2; w3=(1- α) α n-3 ; ... ; wn-1=(1- α) α ; wn=(1- α)
Where parameter alpha belongs to the unit interval [0 1], and is
related to orness value regarding the number of sources
21
Alpha-Orness
22
OWA operator for classifier fusion
(1) Pick “L” OWA coefficient such that:
b  b1 ,..., bL  ,
T
L
b
i 1
i
1
where L = number of classifiers
(2) For k = 1, 2, …, c
a. Sort di,k(x), i=1, 2, …, L in descending order, so that
a1  max d i ,k  x , and
i
a L  min d i ,k  x 
i
b. Calculate the support for class  k
L
 k  x    bi ai
i 1
Kuncheva 2001:Combining classifiers: Soft computing solutions
23
Fuzzy Integral Operator
Sugeno integral
Choquet integral
OWA
Min
Order statistic
Med
Arithmetic
Mean
Max
Weighted Sum
Relations between various aggregation operators and fuzzy integral
24
Fuzzy integral for classifier fusion
(1) Fix the L fuzzy densities g1, g2, …, gL e.g., by setting gj to the
estimated probability of correct classification of Dj
(2) Calculate from Equation:
  1   1  g j ,
L
0
j 1
(3) For a given x sort the kth column of DP(x) to obtain d i ,k x , d i ,k x ,..., d i ,k x T
(4) Sort the fuzzy densities correspondingly, i.e., g i , g i ,..., g i and set g 1  g i
1
1
2
2
L
1L
1
(5) For t = 2 to L, calculate recursively
g (t )  g it  g t  1  g it g t  1
(6) Calculate the final degree of support for class by
L

 k x   d i ,k x    d i
1
j 2
j 1 , k
x   d i ,k xg  j  1
j
Kuncheva 2001:Combining classifiers: Soft computing solutions
25
Dempster-Shafer Evidential Reasoning
•
For k = 1, 2, …, c
–
Calculate the support for class  k
from Dempster combination rule:
 k x  
d x 




i
i k
1
1i  L
d x 


 

i
i ,k
i  1i  L
i ,k
26
Some of the common classifiers
Protein Secondary Structure classifiers
Server
Location
Prediction method
APSSP2
Institute of Microbial Technology, INDIA
EBL + Neural network
PROFSEC
Columbia University, USA
Profile-based Neural network
PSIPRED
University College London, UK
Neural network
SAMT99_SEC
University of California, Santa Cruz, USA
Hidden Markov Model
SSPRO2
University of California, Irvine, USA
Recurrent Neural Network
Dataset statistics
Dataset
Name
EVA:
Common
subset 1
Data of
Publication
Number of
Proteins
Cumulative results
from 1999 to
2002/10
30
Number of
Residues
More than
4000
Web Address
http://cubic.bioc.columbia.edu/
eva/sec_2002_10/
common.html#common_set1
27
Meta Classifier Architecture
H
C
E
Classifier 1
Classifier 2
...
Classifier N
Fusion
ASSPS2
DP(1)
agrmax
PROFSEC
PSIPRED
SAMT99_SEC
SSPRO2
DP(2)
DP(3)
DP(4)
DP(5)
Fusion Operator
DataSet
Predicted
class
Fused
DP
Confidence H
Confidence E
Confidence C
28
Some of the Accuracy Measures
3
1
Q3  100 
  M ii
N res i 1
Qi% obs  100 
M ii
obsi
cl
Confusion Matrix
as
si
fie
d
a
re
H
E
C
l
a11 a12 FN a13 


a
E a21 a22
23 
 FP
TN
C 
a31 a32 a33
H
Specificit y 
TN
TN  FP
Sensitivity 
TP
TP  FN
Mij = number of residues observed in state i and predicted in state j,
where i, j ε {H,E,C}
29
The results of EVA1 dataset prediction by
common selected engines
Q3
Qh% obs Qe% obs Qc% obs
Specificity
Sensitivity
APSSP2
74.49
78.00
65.65
77.01
87.04
74.08
PROFSEC
74.71
75.38
74.48
74.05
87.12
74.24
PSIPRED
74.78
78.53
68.25
75.67
87.18
74.36
SAMT99_SEC
74.63
82.60
63.12
75.06
87.13
74.27
SSPRO2
73.58
78.14
62.79
76.45
86.60
73.21
90
88
86
84
PSIPRED
82
APSSP2
80
PROFSEC
78
SAMT99_SEC
76
SSPRO2
74
72
70
Accuracy
Specificity
Sensitivity
30
Results
0.9
0.88
0.86
Choquet
OWA
Dempster-Shafer
Majority voting
APSSP2
PROFSEC
PSIPRED
SAMT99_SEC
SSPRO2
0.84
0.82
0.8
0.78
0.76
0.74
0.72
0.7
Sensitivity
Specificity
Accuracy
Choquet
0.767152256
0.883576128
0.7702
OWA
0.759633459
0.879816729
0.7647
Dempster-Shafer
0.757518797
0.878759398
0.7642
Majority voting
0.757753759
0.87887688
0.7568
APSSP2
0.740836466
0.870418233
0.7449
PROFSEC
0.742481203
0.871240602
0.7471
PSIPRED
0.743656015
0.871828008
0.7478
SAMT99_SEC
0.742716165
0.871358083
0.7463
SSPRO2
0.732142857
0.866071429
0.7358
31
Results
85
80
75
Choquet
OWA
70
65
Dempster-Shafer
Q3
H
E
C
Choquet
77.02
84.43
72.58
73.43
OWA
76.47
84.31
73.08
71.92
Dempster-Shafer
76.42
81.12
71.52
75.81
Majority voting
75.68
80.17
66.37
77.68
PSIPRED
74.78
78.53
68.25
75.67
Majority voting
PSIPRED
32
Sensor/Data Fusion Design Pattern
•
Today it is difficult to imagine the task of system development
without help of a simulation
•
Simulation is a very important means in all fields of science and
engineering.
•
It provides better understanding of the function, testing and
finding of critical states and regions of operation, fast
optimization of system and control.
•
Patterns help you build on the collective experience of skilled
system engineers
•
They capture existing, well-proven experience in system
development and help to promote good design practice
•
Every pattern deals with a specific, recurring problem in the
design or implementation of a system
33
Design patterns
•
System-Level Design Requirements
– Describe system architecture
– Model different levels
– Model different components
– Simulate and test
– Re-use design
34
Data Fusion Design Pattern & Simulation Tool
•
•
•
Motivations
– Lack of standardization
– Various common modules found in fusion algorithms
– Reduce programming efforts
• Save time and energy
Intent
– development of a high-level graphical system modeling
using the powerful modeling environments
Functionalities offered by this view
– rapid and accurate system design
– Modularity which means that the new components can be
easily added
– Reusability and easiness of use which are the most
important aims of design patterns
35
Data Fusion Design Pattern & Simulation Tool
• Applicability
– Defense, Robotics, Medical Engineering, Air and Space
– Soft computing, Agriculture, Process control, Fire control systems
– Biology, Bioinformatics
• Implementation
– Simulink platform using Matlab S-function technique
• An interactive software package to model,
simulate, and analyze dynamic systems
• A graphical, mouse-driven program that allows systems to
be modeled by drawing a block diagram on the screen.
MATLAB
SIMULINK
36
Data Fusion Design Pattern & Simulation Tool
•
Inter- relationship
– Each block has one output, an input vector, and a set of parameters
– Each block converts the input vector to an output
– Each block’s output could be used as an input of other blocks
Interactive link (wrapper) between
VTB and Matlab/Simulink
37
Data Fusion Design Pattern & Simulation Tool
• Structure
– 10 functional blocks each performing
specific tasks related to data fusion
•
Participants
– Continuous output fusion subsystem
– Label output fusion subsystem
•
Consequences
– A general framework for designing, developing, and applying data
fusion algorithms
– Substantially reduce the amount of time and energy needed for
algorithm implementation
38
Conclusion
•
•
•
Analysis
– The more accurate perception of weak points in secondary structure
prediction methods can lead us to enhance these methods more
objectively.
– The results showed that the third level of expertness study gives us
such perception and exposes more hidden facts of prediction
methods than before.
Fusion
– This approach is a step toward a meta-classifier or meta-predictor
concept, which can obtain better results than each individual
classifier.
– It has been shown that the performance of a Meta classifier system
can be better than those of each individual; also such systems can
provide a unified access to data for users.
SDF Tool
– provides a convenient and intuitive framework for designing and
developing data fusion algorithms
– substantially reduce the amount of time needed for algorithm
implementation and avoid undetectable human errors
39
Further research
•
•
•
Analysis
– The only two nonhydrophobe - nonpolar amino acids, Proline and Glycine,
showed an unusual prediction behavior. It seems that the extension of current
evaluation criteria based on physico-chemical properties of amino acids could
result in more objective perception of prediction methods.
– A conceptual view to the weaker results of beta strand prediction comparing
with other secondary structure classes shows that if there were a suitable
engine, which focused on this specific secondary structure class, the fusion of
this engine with previous engines could improve the overall results.
– Moreover, we can develop a prediction algorithm focusing on Proline and
Glycine and then perform a fusion in decision level among the results
achieved by this engine and other engines to acquire better overall results.
Fusion
– Classifier selection, in which the most expert classifier is selected to contribute
the decision
– To provide better identifiers to convert Type I classification into Type III
classification.
– To contain the confidence of each decision, separately (WOWA)
– To publish the meta-classifiers as an open-access web service.
SDF Tool
– Extending the algorithms and blocks of fusion
– Developing some standard models for data fusion
– Achieving the hardware design from the highest level of design (e.g. UML)
40
THANK YOU FOR
YOUR ATTENTION