Using Real-Valued Meta Classifiers to Integrate Binding Site
Download
Report
Transcript Using Real-Valued Meta Classifiers to Integrate Binding Site
Using Real-Valued Meta
Classifiers to Integrate Binding
Site Predictions
Yi Sun, Mark Robinson, Rod Adams, Paul Kaye,
Alistair G. Rust, Neil Davey
University of Hertfordshire, 2005
Outline
•
•
•
•
•
Problem Domain
Description of the Datasets
Experimental Techniques
Experiments
Summary
Problem Domain (1)
• One of the most exciting and active areas of
research in biology currently, is understanding
the regulation of gene expression.
• It is known that many of the mechanisms of
gene regulation take place directly at the
transcriptional or sequence level.
Problem Domain (2)
• Transcription factors will bind to a number of
different but related sequences, thereby
effecting changes in the expression of genes.
The current state of the art algorithms for
transcription factor binding site prediction are, in
spite of recent advances, still severely limited in
accuracy.
Description of the Datasets (1)
• The original dataset has 68910 possible binding
sites.
• A prediction result for each of 12 algorithms.
–
–
–
–
Single sequence algorithms (7);
Coregulatory algorithms (3);
Comparative algorithm (1);
Evolutionary algorithm (1).
• It includes two classes labelled as either binding
sites or non-binding sites with about 93% being
non-binding sites.
Description of the Datasets (2)
Fig. 1. Organisation of dataset, showing alignment of algorithmic predictions,
known information and original DNA sequence data.
Description of the Datasets (3)
Windowing
Fig. 2. The window size is set to 7 in this study. The middle label of 7 continuous prediction sites
is the label for a new windowed inputs. The length of each windowed input now is 12X7.
Imbalanced Data
(93% being Non-binding Sites)
Sampling Techniques for Imbalanced Dataset Learning
• For the under-sampling, we randomly selected a subset of data
points from the majority class.
• The synthetic minority over-sampling technique (SMOTE) proposed
by N.V.Chawla, et al,. is applied for the over-sampling.
– For each pattern in the minority class, we search for its K-nearest neighbours in
the minority class using Euclidean distance.
– For continuous features, the difference of each feature between the pattern and
its nearest neighbour is taken, and then multiplied by a random number between
0 and 1, and added to the corresponding feature of the pattern.
– For binary features, the majority voting principle to each element of the K-nearest
neighbours in the feature vector space is employed.
Experimental Techniques
The Classification Techniques
• Majority Voting (MV);
•
•
•
•
Weighted Majority Voting (WMV);
Single Layer Networks (SLN);
Rules Sets (C4.5-Rules);
Support Vector Machines (SVM).
Performance Metrics
Recall
TP
,
TP FN
TP
Precision
,
TP FP
F - Score
2 Recall Precision
,
Recall Precision
TP TN
Accuracy
,
TP FN TN FP
Fp - Rate
FP
.
FP TN
TN
FP
FN
TP
A confusion matrix
Experiments (1)
Consistent Dataset
Table 1: Common performance metrics (%) tested on the same consistent
possible binding sites with single and windowed inputs separately.
Inputs
single
windowed
classifier
Recall
Precision F-Score
Accuracy Fp-Rate
Best Alg.
40.95
17.46
24.48
82.22
14.66
MV
43.10
13.14
20.14
75.95
21.57
WMV
41.19
17.35
24.42
82.05
14.86
SLN
28.81
22.16
25.05
87.86
7.66
SVM
32.14
24.46
27.78
88.23
7.52
C4.5Rules
29.29
23.08
25.81
88.15
7.39
SLN
34.29
18.87
24.34
85.00
11.16
SVM
38.81
20.25
26.61
84.93
11.58
C4.5Rules
23.57
18.64
20.82
87.38
7.79
Experiments (2)
Full
Dataset
Table 2: Common performance metrics (%) tested on the full test dataset
with single and windowed inputs separately.
Inputs
single
windowed
classifier
Recall
Precision F-Score
Accuracy Fp-Rate
Best Alg.
36.36
18.40
24.44
85.97
10.73
MV
35.73
15.12
21.25
83.48
13.35
WMV
34.75
20.04
25.42
87.28
9.23
SLN
25.19
25.09
25.14
90.64
5.01
SVM
27.91
26.97
27.43
90.79
5.03
C4.5Rules
23.03
23.14
23.08
90.43
5.09
SLN
31.82
22.66
26.47
88.97
7.23
SVM
36.78
23.50
28.67
88.58
7.97
C4.5Rules
22.26
19.70
20.90
89.49
6.04
Experiments (3)
Fig. 3. ROC graph: five classifiers applied to the
consistent test set with single inputs.
Experiments (4)
Fig. 4. ROC graph: 3 classifiers applied to the
full test set with windowed inputs.
Summary
• By integrating the 12 algorithms we considerably
improve binding site prediction using the SVM.
• Employing a ‘window’ of consecutive results in
the input vector can contextualise the
neighbouring results, so that it can use the
distribution of data to improve binding site
prediction.