Biological Data Mining - University of Portsmouth

Download Report

Transcript Biological Data Mining - University of Portsmouth

Biological Data Mining
A comparison of Neural Network and
Symbolic Techniques
http://www.cmd.port.ac.uk/biomine/
1. Objectives
• The project aims:
– to develop and validate techniques for extracting
explicit information from bioinformatic data
– to express this information as logical rules and
decision trees
– to apply these new procedures to a range of
scientific problems related to bioinformatics and
cheminformatics
2. Extracting information
• Artificial neural networks can be trained to reproduce the
non-linear relationships underlying bioinformatic data
with good predictive accuracy
– but it is often hard to comprehend those relationships from the
internal structure of the network
– with the result that networks are often regarded as ‘black
boxes’.
• Decision trees using symbolic rules are easier to interpret
– leading to a greater likelihood of understanding the
relationships in the data
– allowing the behaviour of individual cases to be explained.
3. Extracting Decision Trees
• The Trepan procedure (Craven,1996) extracts
decision trees from a neural network and a set of
training cases by recursively partitioning the input
space.
• The decision tree is built in a best-first manner,
expanding the tree at nodes where there is greatest
potential for increasing the fidelity of the tree to the
network.
4. Splitting Tests
• The splitting tests at the nodes are m-of-n
expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi
are Boolean conditions.
• Start with a set of candidate tests
– binary tests on each value for nominal features
– binary tests on thresholds for real-valued features
• Use a beam search with a beam width of two.
• Initialize the beam with the candidate test that
maximizes the information gain.
5. Splitting Tests (II)
• To each m-of-n test in the beam and each
candidate test, apply two operators:
• m-of-n+1
• m+1-of-n+1
e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3}
e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3}
• Admit new tests to the beam if they increase the
information gain and are significantly different
(chi-squared) from existing tests.
6. Example: Substance P Binding
to NK1 Receptors
• Substance P is a neuropeptide with the sequence:
H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2
• Wang et al. used the multipin technique to
synthesize 512 = 29 stereoisomers generated by
systematic replacement of L- by D-amino acids at
9 positions
• The aim was to measure binding potencies to NK1
receptors & identify the positions at which stereochemistry affects binding strength.
7. Application of Trepan
• A series of networks with 9:9:1 architectures were
trained using 90% of the data as a training set.
• For each network a decision tree was grown using
Trepan.
• The trees showed high fidelity with the networks
on a 10% test set.
8. Results
• Binding activity was determined by five
positions, viz.
– H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-MetNH2
• The positions identified agree with the FIRM
(Formal Inference-based Recursive Modelling)
analysis of Young and Hawkins
– Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput
screening data using recursive partitioning. Molecular Modelling &
Prediction of Bioactivity (ed. Gundertofte & JØrgensen).
9. A Typical Trepan Tree
10. Test set confusion matrix:
tree versus network
Network
R
1
0.70
30
12
1
1
2
0.68
27
16
0
1
3
0.64
31
9
2
2
4
0.71
29
10
2
3
5
0.71
29
9
2
4
6
0.67
30
9
3
2
True +ve True -ve False +ve False -ve
11. Test set confusion matrix:
tree versus observed
Network
R
1
0.70
29
10
2
3
2
0.68
25
10
2
7
3
0.64
28
7
5
4
4
0.71
29
10
2
3
5
0.71
29
10
2
3
6
0.67
28
7
5
4
True +ve True -ve False +ve False -ve
12. Future Work
• Complete the implementation of the Trepan
algorithm.
– model the distribution of the input data and generate a
set of query instances to be classified by the network &
used as additional training cases during tree extraction.
• Extend the algorithm to enable the extraction of
regression trees.
• Provide a Bayesian formulation for the decision
tree extraction algorithm.
13. Future Applications
• Apply Trepan to ligand-receptor binding
problems.
– compare the performance of these algorithms with
existing symbolic data mining techniques (ID3/C5).
14. References
• Wang J-X et al. (1993) Study of stereo-requirements of
substance P binding to NK1 receptors using analogues
with systematic D-amino acid replacements. Biorganic &
Medicinal Chemistry Letters, 3, 451-456.
• Young S & Hawkins D.M. (2000) Analysis of a large,
high-throughput screening data using recursive
partitioning. Molecular Modelling & Prediction of
Bioactivity (ed. Gundertofte & JØrgensen).
Grantholder
Professor Martyn Ford
Centre for Molecular Design
University of Portsmouth
[email protected]
Research Fellows
Dr Shuang Cang
Dr Abul Azad
Mar - Sept 2000
Jan 2001 -
Collaborators
• Dr Antony Browne
School of Computing, Information
Systems and Mathematics, London
Guildhall University.
[email protected]
• Professor Philip Picton
School of Technology and Design,
University College Northampton.
[email protected]
• Dr David Whitley
Centre for Molecular Design,
University of Portsmouth.
[email protected]