Models - Data Mining and Machine Learning Group

Download Report

Transcript Models - Data Mining and Machine Learning Group

An Effective Combination based on ClassWise Expertise of Diverse Classifiers for
Predictive Toxicology Data Mining
Dr. Daniel NEAGU, UK
ADMA 2006, Xi’an, China
Dr. Gongde GUO
Dept. of Computer Science, Fujian Normal University, China
Ms. Shanshan WANG
Dept. of Computer Science, Nanjing University of Aeronautics and Astronautics, China
Bradford, UK
Bradford,
West Yorkshire


National Museum of Film and Television

School of Informatics,
University of Bradford
Overview (1)




Introduction to ML applications to KDD
Proposal of Combination Operators
Model Construction and Classification Algorithms
Model Library for Predictive Toxicology


Collection of datasets
Central store for models and results


Formal structure to speed access and improve organisation; reduce
‘misplaced’ files
Remote Access

Secure access to data from remote locations possible in the future
Overview (2)

Comparative Studies







Results from UoB Model Library
Study of different Machine Learning techniques
Variety of Feature Selection techniques
Many datasets and endpoints
Large variation in accuracy of created models
One aim is to automatically build ensembles based on
best class-wise models
Results and Conclusions
Current Context
Hardware

SW
(Algorithms)
Data
collection/
management
Nowadays more scientific data is generated and
flows within systems:




More data is stored and available:




Man power/ laboratories
Techniques and computational power (Moore’s Law)
Funds/ Legislation
Storage technology faster and cheaper (Storage Law)
DBMS capable of handling bigger DB
Web/on line access to distributed data
Consequences


Human expert is overloaded: very little data is
checked
Knowledge Discovery is NEEDED for data
understanding and use
General definitions





Data is defined as facts regarding things (such as people, objects, events) which can
be digitally transmitted or processed.
Information is generally defined as data that have been processed and presented in
a form suitable for human interpretation with the purpose of revealing meanings
(such as patterns or rules).
Models are defined as creating representations of patterns.
Knowledge: the theoretical and practical comprehension of a certain domain, that
supports making decisions.
Intelligence: the capability of learning, understanding and finding solutions for
problems in a specific domain.




1234567.89 is data.
"Your bank balance has jumped 80.87% to £1234567.89" is information.
"Nobody owes me that much money" is knowledge.
"I'd better talk to the bank before I spend it, because of what has happened to
other people" is intelligence.
http://foldoc.doc.ic.ac.uk
Knowledge Discovery in Databases (KDD)
Data
sources
Feature Selection
Select/preprocess
Transform
Models
Extracted
information
Data mining
Knowledg
e
Interpret/Evaluate/Assimilate
Data
preparation


The nontrivial process of identifying valid, novel, potentially useful and,
ultimately understandable patterns in data.
Involves the following steps:








understanding the application domain and definition of the goals
selecting the target data set
data cleaning and pre-processing
data reduction and projection
choosing the function of data modelling and the algorithm
data mining
interpretation
evaluation and utilization of the discovered knowledge
Predictive Data Mining



The processes of data classification/ regression
having the goal to obtain predictive models for a
specific target, based on predictive relationships
among large number of input variables.
Classification identifies characteristics of data and
identifies a data item as member of one of several
predefined categorical classes.
Regression uses the existing numerical data values
and maps them to a real valued prediction (target)
variable.
Machine Learning Applications in Data Mining
Dynamics (ISI Thompson Web of Knowledge)
3000
ANNs
GAs
ILP
RI
DTs
k-NN
2500
2000
1500
1000
ANNs
500
GA
ILP
s
References to Machine Learning
techniques with applications in
Predictive Data Mining:
RI
0
DTs
k2004 2003 2002 2001
2000 1999 1998
1997 1996 1995 NN
ANNs
ANNs
55%
GAs
ILP
RI
k-NN
3%
GAs
30%
DTs
10%
RI
1%
ILP
1%
DTs
k-NN
Multi-Classifier Systems


Different classifiers potentially offer complementary or
at least additional information about patterns to be
classified
Various approaches to classifier combinations:








Majority voting [4]
Entropy-based combination [5]
Dempster-Shafer theory-based combination [6], [7]
Bayesian classifier combination [8]
Similarity-based classifier combination [9]
Fuzzy inference [10]
Gating networks [11]
Statistical models [2]

The Proposed Effective Combination
Scheme


We propose a hybrid classifier combination scheme
which makes use of class-wise expertise of diverse
classifiers – a priori knowledge obtained from the
training set - to achieve potentially better performance.
2 Operators proposed:
TPji
 j  arg max i,M i { i
| i  1,2,.., m},
i
TPj  FPj
  arg maxi,Mi {CAi | i  1,2,..,m}
j  1,2,.., L
Architecture of the Effective Multiple
Classifier System
x
Best Model
for Class 1
A1
If x is classified as C1
1
Testing
data
No
x
A2
Best Model
for Class 2
If x is classified as C2
2
Training
data
Data
Pre-processing
A3
…
…
Best Model
for Class L
Output
If x is classified as CL
L
Am
No
x
Best Model
for All Classes
Otherwise
Model construction algorithm
Classification Algorithm
ML applications for Predictive Toxicology


The EC proposal for the REACH regulation indicates that the
information requirements under REACH can be (partially)
fulfilled by using scientifically valid (Q)SAR models.
To guide the validation of computer-based methods, five OECD
principles for the validation of (Quantitative) StructureActivity Relationships were adopted:
 a defined endpoint
 an unambiguous algorithm
 a defined domain of applicability
 appropriate measures of goodness-of-fit, robustness and
predictivity
 a mechanistic interpretation, if possible
Datasets (1)

1.
2.
3.
4.
5.
DEMETRA*
LC50 96h Rainbow Trout
acute toxicity (ppm)

282 compounds
EC50 48h Water Flea acute
toxicity (ppm)

264 compounds
LD50 14d Oral Bobwhite
Quail (mg/ kg)

116 compounds
LC50 8d Dietary Bobwhite
Quail (ppm)

123 compounds
LD50 48h Contact Honey
Bee (μg/ bee)

105 compounds
*http://www.demetra-tox.net
Datasets (2)

CSL APC* Datasets


5 endpoints
A single endpoint/descriptor set used for our
experiments




Mallard Duck
LD50 toxicity value
60 organophosphates
248 descriptors
*http://www.csl.gov.uk
Datasets (3)

TETRATOX*/LJMU** Dataset





Tetrahymena Pyriformis
inhibition of growth IGC50
Phenols data
250 phenolic compounds
187 descriptors
• http://www.vet.utk.edu/tetratox/
• http://www.ljmu.ac.uk
Descriptors

Multiple descriptor types

Various software packages to calculate 2D and 3D
attributes*
http://www.demetra-tox.net
Model Library

Algorithms chosen for their representability and
diversity, easy, simple and fast access





Instance-based Learning algorithm (IBL)
Decision Tree learning algorithm (DT)
Repeated Incremental Pruning to Produce Error
Reduction (RIPPER)
Multi-Layer Perceptrons (MLPs)
Support Vector Machine (SVM)
Dimensionality
Dataset Four
Model
Parameter
file
Results
file
Feature Selection
Feature Selection
Feature Selection
Feature Selection
Dataset Three
Algorithms
Dataset Two
Algorithms
Dataset One
Algorithms
Algorithms
Organisation
Source
Endpoint/
Descriptors
Feature
Selection
File Type
Files
CSL
APC
Trout
Mallard_Duck
CFS
Chi
Feature Subsets
Model 1
CS
DEMETRA
Water
Flea
GR
Models
Model 2
Oral
Quail
IG
TETRATOX/LJMU
Dietary
Quail
ReliefF
Bee
SVM
Parameters
Model 3
PHENOLS
KNNMFS
Raw
Results
Model n
Comparison of performance of
combination schemes on seven data sets
MCS: Majority Voting-based Combination (MVC)
Maximal Probability-based Combination (MPC)
Average Probability-based Combination (APC)
Classifier Combination based on Dempster Rule of Combination (DRC)
CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers)
Conclusions

The proposed combination scheme CSCEDC
(Combination Scheme based on Class-wise
Expertise of Diverse Classifiers):



not only makes use of the expertise of best individual
classifiers
but removes their negative influences as well
therefore results presented previously show significant
improvement of global performance
Acknowledgements



This work is part-funded by:

EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and
Processing Tool based on a Hybrid Intelligent Systems Approach

http://pythia.inf.brad.ac.uk/

EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental
Modules for Evaluation of Toxicity of pesticide Residues in Agriculture

http://www.demetra-tox.net
Special thanks also to:

Dr. Q. Chaudhry (CSL York)

Dr. Mark Cronin (LJMU)
and PhD students:

Ms. Ladan Malazizi, BSc, PhD student

Research Theme: Development of Artificial Intelligence-based in-silico toxicity
models for use in pesticide risk assessment

Mr. Paul Trundle, BSc, PhD student

Research Theme: Hybrid Intelligent Systems applied to predict Pesticide Toxicity

Ms. Areej Shhab, BEng, MPhil

Research Theme: Applications of Machine Learning in Knowledge Discovery and
Data Mining

Mr. M. Craciun (University of Galati), BSc, MSc