Machine Learning in Bioinformatics
Download
Report
Transcript Machine Learning in Bioinformatics
Machine Learning in
Bioinformatics
Simon Colton
The Computational
Bioinformatics Laboratory
Talk Overview
Our research group
– Aims, people, publications
Machine learning
– A balancing act
Bioinformatics
– Holy grails
Our bioinformatics research projects
– From small to large
A future direction
– Integration of reasoning techniques
Computational Bioinformatics
Laboratory
Our aim is to:
– Study the theory, implementation and application of
computational techniques to problems in biology and medicine
Our emphasis is on:
– Machine learning representations, algorithms and applications
Our favourite techniques are:
– ILP, SLPs, ATF, ATP, CSP, GAs, SVMs
– Kernel methods, Bayes nets, Action Languages
The (major) research tools we’ve produced are:
– Progol, HR, MetaLog (in production)
The Research Group Members
Hiroaki Watanabe (RA, BBSRC)
Alireza Tamaddoni-Nezhad (RA, DTI)
Stephen Muggleton (Professor)
Ali Hafiz (PhD)
Huma Lodhi (RA, DTI)
Simon Colton (Lecturer)
Jung-Wook Bang (RA, DTI)
(Nicos Angeloupolos, now in York) (RA, BBSRC)
Room 407
– http://www.doc.ic.ac.uk/bioinformatics
Some External Collaborators
Mike Sternberg (Biochemistry, Imperial)
Jeremy Nicholson (Biomedical Sciences, Imperial)
Steve Oliver (Biology, Manchester)
Ross King (Computing, Aberystwyth)
Doug Kell (Chemistry, Manchester)
Chris Rawlings (Oxagen)
Charlie Hodgman (GSK)
Alan Bundy (Informatics, Edinburgh)
Toby Walsh (Cork Constraint Computation Centre)
Some Departmental
Collaborators
Krysia Broda, Allesandro Russo, Oliver Ray
– Aspects of ILP and ALP
Marek Sergot
– Action Languages
Tony Kakas (Visiting professor, Cyprus)
– Abductive Logic Programming
Machine Learning Overview
Ultimately about writing programs which improve
with experience
– Experience through data
– Experience through knowledge
– Experience through experimentation (active)
Some common tasks:
– Concept learning for prediction
– Clustering
– Association rule mining
Maintaining a Balance
Predictive
tasks
Supervised
learning
Know
what you’re
looking for
Don’t know
what you’re
looking for
Descriptive
tasks
Unsupervised
learning
Don’t know
you’re even
looking
A Partial Characterisation of
Learning Tasks
Concept learning
Outlier/anomaly detection
Clustering
Concept formation
Conjecture making
Puzzle generation
Theory formation
Maintaining a Balance
in Predictive/Descriptive tasks
Predictive tasks
– From accuracy to understanding
– Need to show statistical significance
• But hypotheses generated often need to be understandable
– Difference between the stock market and biology
Descriptive tasks
– From pebbles to pearls
– Lots of rubbish produced
• Cannot rely on statistical significance
– Have to worry about notions of interestingness
• And provide tools to extract useful information from output
Maintaining a Balance
in Scientific Discovery tasks
Machine learning researchers
– Are generally not domain scientists also
Extremely important to collaborate
– To provide interesting projects
• Remembering that we are scientists not IT consultants
– To gain materials
• Data, background knowledge, heuristics,
– To assess the value of the output
Inductive Logic Programming
Concept/rule learning technique (usually)
– Hypotheses represented as Logic Programs
Search for LPs
– From general to specific or vice-versa
• One method is inverse entailment
– Use measures to guide the search
• Predictive accuracy and compression (info. theory)
– Search performed within a language bias
Produces good accuracy and understanding
– Logic programs are easier to decipher than ANNs
Our implementation: Progol (and others)
Example learned LP
fold('Four-helical up-and-down bundle',P) :helix(P,H1),
length(H1,hi),
position(P,H1,Pos),
interval(1 =< Pos =< 3),
adjacent(P,H1,H2),
helix(P,H2).
Predicting protein folds from helices
Stochastic Logic Programs
Generalisation of HMMs
Probabilistic logic programs
– More expressive language than LPs
– Quantative rather than qualitative
• Express arbitrary intervals over probability distributions
Issues in learning SLPs
– Structure estimation
– Parameter estimation
Applications
– More appropriate for biochemical networks
Automated Theory Formation
Descriptive learning technique
– Which can also be used for prediction tasks
Cycle of activity
– Form concepts, make hypotheses, explain hypotheses,
evaluate concepts, start again,…
– 15 production rules for concepts
– 7 methods to discover and extract conjectures
– Uses third party software to prove/disprove (maths)
– 25 heuristic measures of interestingness
Project: see whether this works in bioinformatics
Our implementation: HR
Other Machine Learning
Methods used in our Group
Genetic algorithms
– To perform ILP search (Alireza)
Bayes nets
– Introduction of hidden nodes (Philip)
Kernel methods
– Relational kernels for SVMs and regression (Huma)
Action Languages
– Stochastic (re)actions (Hiraoki)
Bioinformatics Overview
“Bioinformatics is the study of information content
and information flow in biological systems and
proceses” (Michael Liebman)
– Not just storage and analysis of huge DNA sequences
“Bioinformaticians have to be a Jack of all trades
and a master of one” (Charlie Hodgman, GSK)
Highly collaborative
– biology, mathematics, statistics, computer science,
biochemistry, physics, chemistry, medicine, …
From Sequence to Structure
attcgatcgatcgatcgatcaggcgcgcta
Cgagcggcgaggacctcatcatcgatcag…
MRPQAPGSLVDPNEDELRMAPWYWGRISREEA
KSILHGKPDGSFLVRDALSMKGEYTLTLMKDG
CEKLIKICHMDRKYGFIETDLFNSVVEMINYY
KENSLSMYNKTLDITLSNPIVRAREDEESQPH
GDLCLLSNEFIRTCQLLQNLEQNLENKRNSFN
AIREELQEKKLHQSVFGNTEKIFRNQIKLNES
FMKAPADA……
There is a computer program…?
Holy Grail Number One
From protein sequence to protein function
HGP data needs to be interpreted
– Genome split into genes, which code for a protein
– Biological function of protein dictated by structure
Structure of many proteins already determined
– By X-ray crystallography
Best idea so far: given a new gene sequence
– Find sequence most similar to it with known structure
• And look at the structure/function of the protein
Other alternatives
– Use ML techniques to predict where secondary structures will
occur (e.g., hairpins, alpha-helices, beta-sheets)
Holy Grail Number Two
Drug companies lose millions
– Developing drugs which turn out to be toxic
Predictive Toxicology
– Determine in advance which will be toxic
Approach 1: Mapping molecules to toxicity
– Using ML and statistical techniques
Approach 2:
– Producing metabolic explanations of toxic effects
– Using probabilistic logics to represent pathways
• And learning structures and parameters over this
Other aims of Bioinformatics
Organisation of Data
– Cross referencing
– Data integration is a massive problem
Analysing data from
– High-throughput methods for gene expression
– Ask Yike about this!
Produce Ontologies
– And get everyone to use them?
Some Current
Bioinformatics Projects
SGC
– The Substructure Server
SGC and SHM
– Discovery in medical ontologies
SHM
–
–
–
–
Studying biochemical networks (£400k, BBSRC)
Closed loop learning (£200k, EPSRC)
The Metalog project (£1.1 million, DTI)
APRIL 2 (£400k, EC)
A Substructure Server
Lesson from Automated Theorem Proving
– Best (most complex) methods not most used
• Other considerations: ease of use, stability, simplicity, e.g., Otter
Aim: provide a simple predictive toxicology program
– Via a server with a very simple interface
Sub-projects
– Find substructures in many positives, few negatives: Colton
• Simple Prolog program, writing Java version, use ILP??
– Put program on server: Anandathiyagar (MSc.)
– Distribute process over our Linux cluster: Darby (MEng.)
– Babel preprocessor (50+ repns), Rasmol back-end: ???
The Substructure Server
Using Medical Ontologies
Use Ontology and ML for database integration
– Muggleton and Tamaddoni-Nezhad
– Bridge between two disparate databases
• LIGAND (biochemical reactions)
• Enzyme classification system (EC) = ontology
Automated ontology maintenance
– Colton and Traganidas (MSc. Last year)
– Gene Ontology (big project)
– Use data to find links between GO terms
• Equivalence and implication finding using HR
Gene Ontology Discovery
55%
Studying Biochemical Networks
Use SLPs to find mappings between genomes
– Map function of pairs of homologous proteins
• E.g., mouse and human
– Homology is probabilistic
Developed SLP learning algorithms
Initial results applying them in biological networks
Work by
– Muggleton, Angeloupolos and Watanabe
Closed Loop Machine Learning
Active learning
– Information theoretic algorithm designs and chooses the most
informative and lowest cost experiments to carry out
• Implemented in the ASE-Progol system
– Learning generates hypotheses
– Being studied by Ali Hafiz (PhD)
Idea: use machine learning to guide experimentation
– using a real robot geneticist in a cyclic process
Aims of current project: determine the function of genes
Cost savings of 2 to 4 times over alternatives
Upcoming Nature article
APRIL 2
Applications of Probabalistic Relational
Induction in Logic
Aim: develop representations and learning
algorithms for probabilistic logics
Applications: bioinformatics
– Metabolic networks
– Phylo-genetics
2 RAs at Imperial (with Mike Sternberg)
– Starting in January
The Metalog Project Overview
Aim:
– Modelling disease pathways and predicting toxicity
– Gap filling: existing representations correct but incomplete
– Predict where the toxin is acting (focus)
Multi-layered problem representation
–
–
–
–
Meta-network level (Bayes nets) Philip
Network level (SLPs) Huma
Biochemical reaction level (LPs) Alireza
Problog lingua-franca developed
• to represent learned knowledge
NMR Data from metabonomics from Jeremy Nicholson
KEGG Background knowledge from Mike Sternberg
The Metalog Project Progress
Year 1 achievements (all objectives achieved)
Function predictions from LIGAND
Mapping between KEGG and metabolic networks
Initial Bayes-net model
– Drawn much interest from experts
• Agrees with KEGG, and disagrees in interesting ways
• Interaction between metabolytes which are not explained
Year 2
– Working towards abductive model for gap filling
Future Directions for Machine
Learning in Bioinformatics
In-silico modelling of complete organisms
Representation and reasoning at all levels
– From patient to the molecule
Probabalistic models
– For more complex biological processes
• Such as biochemical pathways
Biochemical Pathways
1/120th of a biochemical network
Future Directions for
My Research
Descriptive Induction meets Biology data
Most ML bioinformatics projects are predictive
– Very carefully compressed notions of interestingness
• Into a single measure: predictive accuracy
• Domain scientist not bombarded with a lot of information
• A correctly answered question can be highly revealing
Can we push this envelope slightly?
– Use descriptive induction (WARMR, CLAUDIEN, HR)
• To tell biologists something they weren’t expecting about the data they
have collated
– Have to worry hard about dull output
• Need to determine heuristics from domain scientists
More Future Directions
Put “Automated Reasoning” back together again
– Essential for scientific discovery
ML, ATP, CSP, etc., all work well individually
– Surely work better in combination…
Improve ATP to prove a different theorem?
– Make flexible using CSP and ATP
Improve ML by rationalising input concepts?
– Use ATF and ATP to find concepts and hypotheses
Improve CSP by introducing additional constraints
– Use ATF, ML to find constraints, ATP to prove them