data preprocessing and classification

Download Report

Transcript data preprocessing and classification

A Study on Feature Selection for
Toxicity Prediction*
Gongde Guo1, Daniel Neagu1 and Mark Cronin2
1Department
of Computing, University of Bradford
2School of Pharmacy and Chemistry, Liverpool John Moores
University
*EPSRC Project: PYTHIA – Predictive Toxicology Knowledge representation and Processing Tool based on a
Hybrid Intelligent Systems Approach, Grant Reference:GR/T02508/01
Outline of Presentation
1.
2.
3.
4.
5.
6.
7.
8.
9.
Predictive Toxicology
Feature Section Methods
Relief Family: Relief, ReliefF
KNNMFS Feature Selection
Evaluation Criteria
Toxicity Dataset: Phenols
Evaluation I: Toxicity
Evaluation II: Mechanism of Action
Conclusions
Predictive Toxicology
• The goal of predictive toxicology is to describe the
relations between the chemical structure of a molecule
and biological and toxicological processes (StructureActivity Relationship SAR) and to use these relations to
predict the behaviour of new, unknown chemical
compounds.
• Predictive toxicology data mining comprises steps of
data preparation; data reduction (includes feature
selection); data modelling; prediction (classification,
regression); and evaluation of results and further
knowledge discovery tasks.
Feature Selection Methods
Feature selection is the process of identifying and removing as much of the
irrelevant and redundant information as possible.
Seven feature selection methods (Witten et al, 2000) are involved in our study:
1. GR – Gain Ratio feature evaluator;
2. IG – Information Gain ranking filter;
3. Chi – Chi-squared ranking filter;
4. ReliefF – ReliefF Feature selection;
5. SVM- SVM feature evaluator;
6. CS – Consistency Subset evaluator;
7. CFS – Correlation-based Feature Selection;
But in this work, we focused on the drawbacks of the ReliefF feature selection
method and proposed the kNNMFS feature selection method.
Relief Feature Selection Method
The Relief algorithm works by randomly sampling an instance and
locating its nearest neighbour from the same and opposite class. The
values of the features of the nearest neighbours are compared to the
sampled instance and used to update the relevance scores for each
feature.
K=1
Miss
Hit
Noise? M=? How to choose individual M instances?
Relief Feature Selection Method
Algorithm Relief
Input: for each training instance a vector of attribute values and
the class value
Output: the vector W of estimations of the qualities of attributes
Set all weights W[Ai]=0.0, i=1,2,…,p ;
for j=1 to m do begin
randomly select an instance Xj;
find nearest hit Hj and nearest miss Mj;
for k=1 to p do begin
W[Ak]=W[Ak]-diff(Ak, Xj, Hj)/m+diff(Ak, Xj, Mj)/m;
end;
end;
ReliefF Feature Selection Method
K=3
Hit
Miss
Noise (X); K=? M=? How to choose M instances?
ReliefF Feature Selection Method
kNN Model-based Classification Method
(Guo et al, 2003)
The basic idea of kNN model-based classification
method is to find a set of more meaningful
representatives of the complete dataset to serve as the
basis for further classification.
kNNModel can generate a set of optimal
representatives via inductively learning from the
dataset.
An Example of kNNModel
Each representative di is represented in the form of <Cls(di),
Sim(di), Num(di), Rep(di)> which respectively represents the class
label of di; the similarity of di to the furthest instance among the
instances covered by Ni; the number of instances covered by Ni; a
representation of instance di.
KNNMFS: kNN Model-based Feature Selection
kNNMFS takes the output of the kNNModel as seeds for
further feature selection. Given a new instance, kNNMFS
finds the nearest representative for each class and then
directly uses the inductive information of each
representative generated by kNNModel for feature
weight calculation. The k in ReliefF is varied in our
algorithm. Its value depends on the number of instances
covered by each nearest representative used for feature
weight calculation. The M in kNNMFS is the number of
representatives output from the kNNModel.
KNNMFS Feature Selection Method
Toxicity Dataset: Phenols
Phenols data set was collected from TETRATOX database (Scheultz,
1997) which contained 250 compounds. A total of 173 descriptors were
calculated for each compounds using different software tools, e.g.,
ACD/Labs, Chem-X, TSAR. These descriptors were calculated to
represent the physico-chemical, structure and topological properties that
were relevant to toxicity. Some features are irrelevant to or poor correlate
with the class label
X: CX-EMP20 Y:Toxicity
X:TS_QuadXX Y:Toxicity
Evaluation Measure for Continuous Class
Values Prediction
Endpoint I: Toxicity
Table 1. Performance of linear regression algorithm on different phenols subsets
Evaluation Using Linear Regression
FSM
NS
F
CC
MAE
RSE
RAE
RRSE
Phenols
173
0.8039
0.3993
0.5427
59.4360%
65.3601%
MostU
12
0.7543
0.4088
0.5454
60.8533%
65.6853%
GR
20
0.7722
0.4083
0.5291
60.7675%
63.7304%
IG
20
0.7662
0.3942
0.5325
58.6724%
63.1352%
Chi
20
0.7570
0.4065
0.5439
60.5101%
65.5146%
ReliefF
20
0.8353
0.3455
0.4568
51.4319%
55.0232%
SVM
20
0.8239
0.3564
0.4697
53.0501%
56.5722%
CS
13
0.7702
0.3982
0.5292
59.2748%
63.7334%
CFS
7
0.8049
0.3681
0.4908
54.7891%
59.1181%
kNNMFS
35
0.8627
0.3150
0.4226
46.8855%
50.8992%
Endpoint II: Mechanism of Action
Table 2. Performance of wkNN algorithm on different phenols subsets
FSM
NSF
10-Fold Cross Validation Using wkNN (k=5)
Average Accuracy
GR
IG
Chi
ReliefF
SVM
CS
CFS
kNNMFS
Phenols
20
20
20
20
20
13
7
35
173
89.32
89.08
88.68
91.40
91.80
89.40
80.76
93.24
86.24
Variance
1.70
1.21
0.50
1.32
0.40
0.76
1.26
0.44
0.43
Deviation
1.31
1.10
0.71
1.15
0.63
0.87
1.12
0.67
0.66
Conclusion and Future Research Directions
•
•
•
•
•
Using a kNN model as the starter selection can choose a set of
more meaningful representatives to replace the original data
for feature selection;
Presenting a more reasonable ‘difference function calculation’
based on inductive information in each representative obtained
by kNNModel.
Better performances are obtained on the subsets of the Phenol
dataset with different endpoints by kNNMFS.
Investigating the effectives of boundary data or centre data of
clusters chosen as seeds for kNNMFS
More comprehensive experiments on the benchmark data will
be carried out.
References
1.
2.
3.
Witten, I.H. and Frank, E.: Data Mining: Practical
Machine Learning Tools with Java Implementations,
Morgan Kaufmann (2000), San Francisco
Guo, G., Wang, H., Bell, D. et al.: kNN Model-based
Approach
in
Classification.
In
Proc.
of
CoopIS/DOA/ODBASE 2003, LNCS 2888, SpringerVerlag, pp. 986-996 (2003)
Scheultz, T.W.: TETRATOX: The Tetrahymena
Pyriformis Population Growth Impairment Endpoint – A
Surrogate for Fish Lethality. Toxicol. Methods, 7, 289309 (1997)
Thank you very much!