Title goes here - VideoLectures.NET

Download Report

Transcript Title goes here - VideoLectures.NET

The Challenge of
Predicting Gene Function



Ross D. King
Department of Computer Science
University of Wales, Aberystwyth
Gene Function Prediction

The most important revelation from the sequenced
genomes is that the functions of typically only
between 60-70% of the predicted genes are known
with any confidence.

The new science of functional genomics is dedicated
to determining the function of the genes of
unassigned function, and to further detailing the
function of genes with purported function
Data Mining Prediction

We have developed a method for predicting the
functional class of gene products based on
ILP/Relational data mining.

The idea is to learn a reliable predictive function on
the examples of genes with products of known
function.

Then apply this function to genes where the
functional class is unknown.

We call this approach: Data Mining Prediction (DMP).
Predicting Gene Function in
Yeast
We will demonstrate our approach using ORFs in yeast
(Saccharomyces cerevisiae).
Using the MIPS functional classification scheme
● For those ORFs whose function is currently unknown
● Using 5 types of data:
1. Sequence statistics
2. Homology (sequence similarity)
3. Predicted Secondary Structure
4. Expression (microarray)
5. Phenotype
●
We want to map from sequence to
function class
Sequence 1
Sequence 2
Sequence 3
Sequence 4
Function
Class 1
Function
Class 2
Classification Schemes 1
MIPS/GeneOntology
1,0,0,0 "METABOLISM"
2,0,0,0 "ENERGY"
3,0,0,0 "CELL CYCLE AND DNA PROCESSING"
4,0,0,0 "TRANSCRIPTION"
5,0,0,0 "PROTEIN SYNTHESIS"
6,0,0,0 "PROTEIN FATE (folding, modification, destination)"
8,0,0,0 "CELLULAR TRANSPORT AND TRANSPORT MECHANISMS"
10,0,0,0 "CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM"
11,0,0,0 "CELL RESCUE, DEFENSE AND VIRULENCE"
13,0,0,0 "REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT"
14,0,0,0 "CELL FATE"
29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS"
30,0,0,0 "CONTROL OF CELLULAR ORGANIZATION"
40,0,0,0 "SUBCELLULAR LOCALISATION"
62,0,0,0 "PROTEIN ACTIVITY REGULATION"
63,0,0,0 "PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT "
67,0,0,0 "TRANSPORT FACILITATION"
98,0,0,0 "CLASSIFICATION NOT YET CLEAR-CUT"
99,0,0,0 "UNCLASSIFIED PROTEINS"
Classification Schemes 2
Hierarchy of classes
1,0,0,0 "METABOLISM"
1,1,0,0 "amino acid metabolism"
1,2,0,0 "nitrogen and sulfur metabolism"
1,3,0,0 "nucleotide metabolism"
1,4,0,0 "phosphate metabolism"
1,5,0,0 "C-compound and carbohydrate metabolism"
1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism"
1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups"
1,20,0,0 "secondary metabolism"
Classification schemes 3
Hierarchy of classes
1,0,0,0 "METABOLISM"
1,1,0,0 "amino acid metabolism"
1,1,1,0 "amino acid biosynthesis"
1,1,4,0 "regulation of amino acid metabolism"
1,1,7,0 "amino acid transport"
1,1,10,0 "amino acid degradation (catabolism)"
1,1,99,0 "other amino acid metabolism activities"
1,2,0,0 "nitrogen and sulfur metabolism"
1,3,0,0 "nucleotide metabolism"
1,4,0,0 "phosphate metabolism"
1,5,0,0 "C-compound and carbohydrate metabolism"
1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism"
1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups"
1,20,0,0 "secondary metabolism"
... and ORFs may have multiple functions too!
Sequence Data
field
aa_rat_X
seq_len
aa_rat_pair_X_Y
mol_wt
theo_pI
atomic_comp_X
aliphatic_index
hydro
strand
position
cai
motifs
tmSpans
chromosome
description
% of amino acid X in the protein
length of the protein sequence
% of the amino acids X and Y consecutively
molecular weight of the protein
theoretical pI (isoelectric point)
atomic composition of X (C,H,N,O,S)
aliphatic index
grand average of hydropathy
the DNA strand
the number of exons (no. of start positions)
codon adaptation index
number of PROSITE motifs
number of transmembrane spans
chromosome number
478 attributes in total
type
real
int
real
int
real
real
real
real
'w' or 'c'
int
real
int
int
1..16,mit
Homology data
YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk....
PSI-BLAST
gene
tfc
sfc3
wsv442
cg9463
f1l3
organism
baker's yeast
fission yeast
white spot virus
fruit fly
Arabidopsis
Sequence
database
NRDB
score
0.0
1.0e-18
2.1
2.9
3.0
sfc3:
keyword(membrane)
length(358)
dbref(prosite)
dbref(embl)
We look up the associated
information from SwissProt
Predicted Secondary Structure
Data
mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk...
cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb...
We record length and relative
positions of the secondary
structure elements.
This is relational data.
Expression Data
•
•
Microrarray experiments to measure
expression changes in yeast under a variety
of conditions, including cell cycle, heat shock,
diauxic shift.
Short time series data, numerical-valued
a0
a7
a 14
a 21
YBR166C
0.33
-0.17
0.04
-0.07
YOR357C
-0.64
-0.38
0.32
-0.29
YLR292C
-0.23
0.19
0.36
0.14
YGL112C et
-0.69
-0.89
-0.74
-0.56
Spellman
al (1998),
Roth et
al (1998)
...
DeRisi
et al (1997), Eisen et al (1998)
Gasch et al (2000, 2001), Chu et al (1998)
Phenotype Data
•
•
•
•
•
Data from knockout gene growth experiments
Many missing data
69 attributes x 1461 ORFs of known function
991 genes of unknown function
Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)
deleted ORF
ORF
YAL001C
YAL019W
YAL021C
YAL029C
growth medium
calcofluor
white
w
n
n
n
sorbitol
n
s
n
w
benomyl
n
w
n
w
H2O2
...
w
w
n
r
s = sensitive (less growth)
w = wild-type (no observable effect)
r = resistant (more growth)
n = no data
What are the
Machine Learning Issues?
•
•
•
•
•
•
•
Large volume of data
Missing data
Accurate results required
Intelligible results required
Class hierarchy
Multiple labels
Relational data
Relational vs Propositional
Propositional: single table, fixed number of columns/attributes
orf
yal001c
yal002w
yal003w
yal004c
time0
0.34
0.76
0.77
0.38
time7
0.52
0.82
0.46
0.50
time14
0.48
0.89
0.78
0.49
Relational: multiple tables, multiple values
orf
yal001c
yal001c
yal002w
yal002w
SwissProtID
p03415
p08640
p32583
p08775
e-val
2e-4
8e-58
6e-52
3e-42
SwissProtID
p03415
p03415
p03415
p08640
keyword
apoptosis
repeat
zinc
membrane
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule
creation
2/3
1/3
Training
data
C4.5
Rule
generation
Validation data
All
rules
Select
best
rules
Best
rules
Measure
rule
accuracy
Results
Warmr

Warmr is an ILP Algorithm Developed by Dehaspe et
al.

It is an ILP version of the well known Apriori data
mining algorithm.

Designed to find frequent patterns in a datalog
database.
PolyFARM
•
•
•
•
•
First-order association rule mining
Finding all frequent first order patterns
in the data
Distributed on a Beowulf cluster
47,034 homology patterns, f > 5%
19,628 structure patterns, f > 2%
[Clare & King PADL 2003]
A close
homology to a
short protein in
E. coli
hom(SPID, close) ^
sq_len(SPID, short) ^
classification(SPID, ecoli)
Contains alphacoil-alpha with a
high overall coil
distribution
struc(Pos1, a) ^
neighbour(Pos1, Pos2, c) ^
neighbour(Pos2, Pos3, a) ^
coil_dist(high)
Propositionalisation
Transforming relational data into boolean attributes
patt1
YAL001C
YAL002W
YAL003W
YAL004W
YAL005C
...
patt2
patt3
patt4
...
patt47034
0
0
1
1
0
1
1
0
1
0
0
1
0
0
0
0
0
1
0
0
...
...
...
...
...
1
1
0
1
1
Dichotomic Search 1

As an alternative to the WARMR data-mining
approach, we developed a frequent pattern finding
method based on dichotomic search.

This approach uses domain-specific logics as
intermediates between propositional logic and
predicate logic.
Dichotomic Search 2

Most existing algorithms traverse the search space in
either a top-down or a bottom-up fashion. We
propose a new approach based on dichotomic search
which explores the search space in both direction,
allowing larger steps

Dichotomic search combines completeness (w.r.t.
concepts), non-redundancy, and flexibility.

Ferre, S. & King, R.D. (2005). Fundamenta Informaticae
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule
creation
2/3
1/3
Training
data
C4.5
Rule
generation
Validation data
All
rules
Select
best
rules
Best
rules
Measure
rule
accuracy
Results
C4.5
aa_ratio_pair_p_y
Open source decision tree algorithm
•propositional learning
>0.232
<=0.232
•commonly used
metabolis
strand
•produces interpretable rules
m
w
•reliable
c
•fast
transcriptio
aa_rat_a
•accurate
n
Made modifications for:
•multiple labels
•hierarchical labels
[Clare & King Bioinformatics 2002]
<=6.4
cell fate
>6.4
transport
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule
creation
2/3
1/3
Training
data
C4.5
Rule
generation
Validation data
All
rules
Select
best
rules
Best
rules
Measure
rule
accuracy
Results
Results


Many rules from each data type
Rules at each level of hierarchy

Some classes are much easier to predict than others
(for example "protein synthesis" at 71-93%, "energy"
at 20-47%)

Good levels of accuracy on held out test data

Many predictions for ORFs of unknown function
(some function at some level is predicted for 96% of
the ORFs of unknown function)

Some rules explainable by biology -> scientific
knowledge discovery
Clare & King (2003) Bioinformatics suppl. 2., 42-49
Accuracy Table
Level
Datatype
1
2
3
4
all
Seq
55
55
33
0
71
Struc
49
43
0
0
58
Hom
65
38
69
20
55
Expr
42
37
35
0
75
Phen
75
40
7
0
68
Expression Data Rule
If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25
and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29
and in the micro-array experiment (YPD stationary phase) the ORF expression is > 1.06
then the function of this ORF is
”pheromone response, mating type determination, sex-specific proteins"
Accuracy on training data: 11/12 (92%)
Accuracy on the test data: 3/4 (75%)
21 predictions made
Structure Rule
If true: coil (of length 3) followed by alpha (10 <= length < 14)
and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14)
and true: coil (of length 3) followed by alpha (3 <= length < 6)
and false: coil followed by beta followed by coil (c-b-c)
and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2)
then the function of this ORF is
"mitochondrial transport"
•
•
•
•
80% accurate on test data
Most matching ORFs belong to the Mitochondrial Carrier Family
These have 6 long transmembrane alpha-helices of about 20-30
amino acids
Why do we notice alpha-helices of length 10-14?
Alignment
YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251
YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241
YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310
YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271
YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250
YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246
YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261
YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239
YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT
300
YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242
YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302
YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255
YJL133W SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310
YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF
300
YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364
YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325
YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY
310
YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH
303
YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312
YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289
YMR166C
Alignment
YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251
YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241
YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310
YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271
YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250
YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246
YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261
YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239
YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300
YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242
YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302
YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255
YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310
YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300
YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364
YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325
YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310
YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303
YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312
YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289
YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360
YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293
YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359
YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305
Homology rule
If the ORF is not weakly homologous to a protein in klebsiella
and is strongly homologous to a protein in desulfurococcales
and is strongly homologous to a short protein in cyprinidae
then the function of this ORF is
"Protein fate (folding, modification, destination)"
•
•
•
This rule is 100% accurate on test data
Almost all matching ORFs are from the 20S proteasome
subunit for degradation of proteins
These subunits exist in archaea and eukaryotes, but only
in one specific branch of bacteria (actinomycetes).
Homology rule
If the ORF is not weakly homologous to a protein in klebsiella
and is strongly homologous to a protein in desulfurococcales
and is strongly homologous to a short protein in cyprinidae
then the function of this ORF is
"Protein fate (folding, modification, destination)"
•
•
•
This rule is 100% accurate on test data
Almost all matching ORFs are from the 20S proteasome
subunit for degradation of proteins
These subunits exist in archaea and eukaryotes, but only
in one specific branch of bacteria (actinomycetes).
Application of DMP
to Bacterial Genomes

Successful for both M. tuberculosis and E. coli.

Of the ORFs with no assigned function >40% were
predicted to have a function at one or more levels of
the class hierarchy.

It was found that many of the predictive rules were
more general than possible using sequence
homology.
References
King et al. (2000) KDD 2000
King et al. (2000) Yeast (Comparative and Functional Genomics)
King et al. (2001) Bioinformatics
Example Rule (level 2 E. coli)
If the ORF is not predicted to have a b-strand of length  3 
a homologous protein from class Chytridiomycetes was found
Then its functional class is “Cell processes, Transport/binding
proteins”
12/13 (86%) correct on Test Set - probability of this
result occurring by chance is estimated at 4x10-7.
24 ORFs of unknown function are predicted by the rule.
16 ORFs now with putative or
confirmed function - 93.8% accurate
predictions
Experimental Conformation

The original bacterial ORF predictions were made
over three years ago.

In the intervening time many more ORFs have been
sequenced, making traditional homologous prediction
methods more accurate and sensitive, and the
function of some ORFs have been determined by wet
biology.

The E. coli genome has been re-annotated by
Monica Riley’s group.
“Wet” Biology conformation

A number of predictions have been confirmed or
falsified by new “wet” experimental data.

This new data is biased towards hard classes.
Despite this the results are still good:
– Level 2: 23 predictions - 47.8% accuracy
– Level 3: 23 predictions - 43.4% accuracy
This is very much better than random as there
are many classes.
Confirmation of “Wet” Predictions
ORF
Rule
Predicted Class
Confirmed Function
Result
b0805
b1519
b1533
b1981
8
15
43
42
Cell envelop
Degradation of small molecules
Transport/binding proteins
Transport/binding proteins
C
C
C
C
b1981
56
Transport/binding proteins
b2210
b2392
b2392
b2392
b2924
15
43a
43b
54
45
Degradation of small molecules
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
b3839
b0103
b0103
b0103
b1822
b2530
b2392
b2889
b3222
b3223
b3337
b3338
b3569
b3955
43
42
41
43
15
35
14
50
54
39
28
39
32
8
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
Degradation of small molecules
Global regulatory functions
Degradation of small molecules
Energy metabolism carbon
Transport/binding proteins
Ribosome constituents
Laterally acquired elements
Ribosome constituents
Laterally acquired elements
Cell envelop
b3955
18
Energy metabolism carbon
b3955
20
Energy metabolism carbon
Outer membrane protein
Trans-aconitate methyltransferase
Cysteine pathway metabolite transport
Shikimate and dehydroshikimate transport
protein
Shikimate and dehydroshikimate transport
protein
Malate:quinone oxidoreductase
High-affinity manganese transporter
High-affinity manganese transporter
High-affinity manganese transporter
Component of the MscS mechanosensitive
channel – “new gene family”
Essential component of translocase
dephospho-CoA kinase
dephospho-CoA kinase
dephospho-CoA kinase
23S rRNA m1G745 methyltransferase
cysteine desulfurase
High-affinity manganese transporter
Isopentenyl diphosphate isomerase
ManNAc kinase
ManNAc epimerase
regulatory or redox component
Periplasmic endochitinase
transcriptional regulator of xylose utilization
Required for invasion of brain microvascular
endothelial cells
Required for invasion of brain microvascular
endothelial cells
Required for invasion of brain microvascular
endothelial cells
C
C
C
C
C
C
C
W
W
W
W
W
W
W
W
W
W
W
W
EF
EA
EA
Extension to Arabidopsis Genome

Collaborative project with the Institute of Grassland
and Environmental Research and the University of
Nottingham.

Large increase in data: 6,000 (yeast) -> 25,000 ORFs.

Large amount of micro-array data from the Nottingham
Arabidopsis stock centre.

The increase in data is a challenge to our machine
learning algorithms, 100s MBs.
Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional
Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;
Results

Accuracy comparable to yeast and bacteria

Large fraction of genes of currently unknown function
are predicted.

Some rules could be interpreted in terms of known
biology
Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional
Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;
Gibberellin Biosynthesis
Prediction





Gibberellin is an important plant hormone.
Chosen because of interesting phenotypes – often
extreme size.
Insertion of a promoter to overproduce gene product.
Result
– 2 days earlier flowering
– Average leaf number and weight increased at 21
days.
This phenotype is consistent with prediction.
Leaf number increases more rapidly in the mutant
(yellow bars) than in wildtype Landsberg erecta
(blue bars)
Number of leaves
Leaf number
18.00
16.00
14.00
12.00
10.00
8.00
6.00
4.00
2.00
0.00
21
24
28
days after sowing
31
34
Paclobutrazol (P) (inhibitor of gibberllin) abolishes
the difference between mutant (M) and wildtype (L)
C = control
Average Leaf number at 21 days Expt 4
8.0
6.0
Days
LC
MC
4.0
LP
MP
2.0
LC
MC
LP
0.0
1
Treatment
MP
Availability
All predictions available at http://www.genepredictions.org
All rules and data available at
http://www.aber.ac.uk/compsci/Research/bio/dss/
ILP 2005 Challenge 1

Yeast function prediction data used as a community
challenge: http://www.protein-logic.com/

The intention of the challenge was to provide a realworld data set to test of how far we have progressed
in the field of ILP and multi-relational data mining.
The questions we wanted to answer were: Are the
tools up to the job? Do they scale? Do they handle
noisy, sparse and complex data?
ILP 2005 Challenge 2
A. J. Knobbe, E. K. Y. Ho, R. Malik: ILP CHallenge 2005: The
Safarii MRDM environment.
C. Perlich: Approaching the ILP 2005 challenge: ClassConditional Bayesian Propositionalization for Genetic
Classification.
J. Struyf, C. Vens, T. Croonenborghs, S. Dzeroski, H. Blockeel:
Applying Predictive Clustering Trees to the Inductive Logic
Programming 2005 Challenge Data.
F. Riguzzi: A Simple Approach to a Multi-Label Classification
Problem.
Propositional Approach

Zafer Barutcuoglu, Robert E. Schapire and Olga G.
Troyanskaya. Hierarchical multi-label prediction of
gene function. Bioinformatics (in press)

Hierarchy of SVMs.
Uses a Bayesian net to combine predictions.

Conclusions
•
Data mining and machine learning are powerful
tools for functional genomics.
•
The DMP method can be successfully applied to
different genomes (bacterial, yeast, Arabidopsis) to
predict gene functional class.
•
Micro-array data is a useful component in DMP.
•
Biological insight can be extracted from DMP rules.
•
The structure of gene prediction problems makes
them an exciting test bed for machine learning
methods.
Acknowledgements




Amanda Clare Aberystwyth
Andreas Karwath Freiburg (Aberystwyth)
Luc Dehaspe
PharmaDM
Helen Ougham IGER
BBSRC
The Need for Logic to Represent
Scientific Knowledge

Logic is the best understood way to represent
knowledge.

Traditional statistics, machine learning, and data
mining are based on propositional logic.

For some problems we require a richer description
language, i.e. first-order predicate calculus.

Using logic programming (predicate calculus) we can
incorporate deduction, abduction, and induction.
Inductive Logic Programming

Inductive Logic Programming (ILP) uses logic
programs (first-order predicate calculus) to learn with:
describe examples, theories, and background
knowledge.

For certain types of problem ILP is a powerful data
analysis technique - more accurate, and more
comprehensible results than conventional methods.

Has been successfully applied to a number of
biological/chemical problems.
ILP for Science

The key advantage of ILP for scientific applications is
that it allows the application of compact relational
representations that are natural for scientists to use.
This allows domain understandable rules to be
automatically formed.

This advantage comes at a computational cost.
However, non-technical reasons are probably the
greatest barrier to adoption of ILP. For example, it is
very difficult to explain the benefits of ILP to domain
experts.
Prediction of Lethality

Instead of using microarray-data to prediction the
functional class of a gene we have been using the
same approach to predict whether a gene knock-out
will be lethal (grown in a rich medium).
If false: the function of the ORF is cell cycle
and true: the function of the ORF is rRNA transcription
and in the micro-array experiment (cell cycle) the ORF expression is > -0.79
then the knockout is lethal.
Example Rule: Test accuracy 82% (Default 21%).
Summary Results

Using voting (2 or more rules agree on a prediction)
– Level 2 :128 ORFs predicted - 87.5% accuracy
– Level 3 : 23 ORFs predicted - 91.3% accuracy

All predictions
– Level 2 :335 ORFs predicted - 64.5% accuracy
– Level 3: 204 ORFs predicted - 44.6% accuracy