Biological Data Mining

Download Report

Transcript Biological Data Mining

Biological Data Mining
A comparison of Neural Network and
Symbolic Techniques
http://www.cmd.port.ac.uk/biomine/
People
Centre for Molecular Design, University of Portsmouth
•Professor Martyn Ford
•Dr David Whitley
•Dr Shuang Cang
(Mar - Sept 2000)
•Dr Abul Azad
(Jan 2001 - )
•Dr Antony Browne, London Guildhall University.
•Professor Philip Picton, University College Northampton.
1. Objectives
• The project aims:
– to develop and validate techniques for extracting
explicit information from bioinformatic data
– to express this information as logical rules and
decision trees
– to apply these new procedures to a range of
scientific problems related to bioinformatics and
cheminformatics
2. Methods for Extracting Information
• Artificial Neural Networks
– good predictive accuracy
– hard to decipher
– often regarded as ‘black boxes’
• Decision Trees
– symbolic rules easier to interpret
– more likely to reveal relationships in the data
– allow behaviour of individual cases to be explained
3. Extracting Decision Trees
• The Trepan procedure (Craven,1996)
extracts decision trees from a neural network
and a set of training cases by recursively
partitioning the input space.
• The decision tree is built in a best-first
manner, expanding the tree at nodes where
there is greatest potential for increasing the
fidelity of the tree to the network.
4. Splitting Tests
• The splitting tests at the nodes are m-of-n
expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi
are Boolean conditions.
• Start with a set of candidate tests
– binary tests on each value for nominal features
– binary tests on thresholds for real-valued features
• Use a beam search with a beam width of two.
• Initialize the beam with the candidate test that
maximizes the information gain.
5. Splitting Tests (II)
• To each m-of-n test in the beam and each
candidate test, apply two operators:
• m-of-n+1
• m+1-of-n+1
e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3}
e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3}
• Admit new tests to the beam if they increase the
information gain and are significantly different
(chi-squared) from existing tests.
6. Example: Substance P Binding
to NK1 Receptors
• Substance P is a neuropeptide with amino acid
sequence
H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2
• Wang et al. (1993) used the multipin technique to
synthesize 512 = 29 stereoisomers generated by
systematic replacement of L- by D-amino acids at
9 positions, and measured binding potencies to
central NK1 receptors.
• The objective was to identify the positions at
which stereo-chemistry affects binding strength.
7. Application of Trepan
• A series of networks with 9:9:1 architectures were
trained using 90% of the data as a training set.
• For each network a decision tree was grown using
Trepan.
• The positions identified agree with the FIRM
(Formal Inference-based Recursive Modelling)
analysis of Young and Hawkins (1999).
8. A Typical Trepan Tree
9. Future Work
• Complete the implementation of the Trepan
algorithm.
– model the distribution of the input data and generate from
this a set of query instances that are classified using the
network and used as additional training cases during
extraction of the tree.
• Extend the algorithm to enable the extraction of
regression trees.
• Provide a Bayesian formulation for the decision tree
extraction algorithm.
• Compare the performance of these algorithms with
existing symbolic data mining techniques (ID3/C5).
• Apply Trepan to ligand-receptor binding problems.