Biological Data Mining - University of Portsmouth

Download Report

Transcript Biological Data Mining - University of Portsmouth

Biological Data Mining
A comparison of Neural Network and
Symbolic Techniques
http://www.cmd.port.ac.uk/biomine/
Grantholder
Professor Martyn Ford
Centre for Molecular Design
University of Portsmouth
[email protected]
Collaborators
• Dr Anthony Browne
School of Computing, Information Systems and
Mathematics, London Guildhall University.
[email protected]
• Professor Philip Picton
School of Technology and Design,
University College Northampton.
[email protected]
• Dr David Whitley
Centre for Molecular Design,
University of Portsmouth.
[email protected]
Objectives
• The project aims:
– to develop & validate techniques for extracting
explicit information from bioinformatic data
– to express this information as logical rules and
decision trees
– to apply these new procedures to a range of
scientific problems related to bioinformatics
and cheminformatics
Extracting information
• Artificial neural networks (ANNs) can be
used to identify the non-linear relationships
that underlie bioinformatic data, but . . .
– trained ANNs do not lead to a concise and
explicit model
– specifying the underlying structure is therefore
difficult
– as a result, ANNs are often regarded as ‘black
boxes’
Data Mining and Neural Networks
• Standard data mining algorithms exist (such
as ID3 or C5) so why use an ANN? It would
be advantageous if the rules extracted:
– Give a better fit to the data with the same number
of rules (i.e. explain the data more accurately);
– Give the same fit to the data with less rules (i.e.
explain the data more comprehensibly); or
– Give both a better fit to the data and use less
rules (i.e. explain the data more comprehensibly
and more accurately).
Extracting Decision Trees
• The TREPAN procedure (Craven,1996)
– extracts decision trees from ANNs
– performs better than the symbolic learning
algorithms ID3 and C5
– the current implementation is restricted to a
particular network architecture, but
– the underlying algorithm is independent of
network architecture
Trepan
• Builds a decision tree representing the
function the ANN has learnt by recursively
partitioning the input space.
• Draws query instances by taking into
account the distribution of instances in the
problem domain.
• For real-valued features uses kernel density
estimates to generate a model of the
underlying data that is used to select
instances for presentation to the network.
Trepan
• Builds the decision tree in a best-first
manner:
– as each node is added the fidelity of the
decision tree to the ANN is maximised
– this is done by examining the significance of
the distributions at consecutive levels of the
tree (Kolmogorov-Smirnoff test for real valued
features, chi-squared for discrete ones)
• Allows the user to control the size of the
final tree by selecting appropriate stopping
criteria.
Aims
• Implement the TREPAN algorithm in a
portable format, independent of network
architecture.
• Extend the algorithm to enable the
extraction of regression trees.
• Provide a Bayesian formulation for the
decision tree extraction algorithm.
• Compare the performance of these
algorithms with existing symbolic data
mining techniques (ID3/C5).
Aims
• Apply the extracted decision trees
– to searches of bioinformatic databases
• protein databases
• genomic databases
– to searches of cheminformatic databases
• chemical libraries
• natural product databases
– to investigate ligand/receptor binding
– to quantify molecular similarity/diversity
– to identify new leads and optimise properties
Case study:
ligand interaction with GPCRs
LNL
8.5
LNI
SSF
TLI SSF
IEF TLI SSF
ALM
IEL
ISI
ISI
LNL
log1/c
• 28 GPCRs
• a number of putative
interaction sites
• 3 principal properties
of amino acids (AAs)
TLI
7.5
MNP
IEF
IEF
LNL
LEL
MAV
6.5
LEL
LNI
MNV
5.5
6
7
8
9
Prediction
10
9
LSI
LSI
log1/c
• MLR results for 2
ligands
SLF
SVI
LAM
LSI
SAI
SLF
ETL
ETL
ETL
8
7
NTI
NAP
EKF
6
EKF
NTV
NTL
EKF
ETL
EKL
5
5
6
7
8
9
Prediction
10
11