trees for gene function prediction

Download Report

Transcript trees for gene function prediction

Leander Schietgat
Hendrik Blockeel
Jan Struyf
Katholieke Universiteit Leuven (Belgium)
Amanda Clare
University of Aberystwyth (Wales)
Sašo Džeroski
Jožef Stefan Institute Ljubljana (Slovenia)
Hierarchical multilabel classification
trees for gene function prediction
Probabilistic Modeling and Machine Learning in Structural and Systems Biology
Tuusula, Finland, 17-18 June 2006
Overview

The application


The machine learning context

PMSB

2/21
hierarchical multilabel classification
Decision trees for HMC

2006
gene function prediction
the algorithm: Clus-HMC

Experimental results

Conclusions
Gene Function Prediction

Task
Given a data set with descriptions of
genes and the functions they have
 Learn a model that can predict for a new
gene what functions it performs

PMSB

Genes can have
multiple functions
c1
2006

3/21
These functions are
hierarchically organised
c2
c21
c3
c22
Machine Learning

Classifier
predicts for unseen instances the class
to which they belong
 learned with already classified training
examples

PMSB
2006
4/21

Different techniques
decision trees
 support vector machines
 bayesian networks
…

Hierarchical Multilabel
Classification

Normal classification setting


only predicts a single class
HMC
predict multiple classes at once
 classes are organized in a hierarchy

PMSB
2006

Hierarchy constraint

5/21
instances of a class must be instances of
its superclasses
Two HMC approaches
1. Learn model for each class and
combine the predictions

Advantage

PMSB

2006
Disadvantages


6/21
a lot of machine learning algorithms
available

m1
m2
…
mn
c1 ?
c2 ?
…
cn ?
efficiency
skewed class distributions
hierarchical relationships
Two HMC approaches (c’ted)
2. Learn a single model that predicts
all the classes together

Advantages



PMSB

2006

7/21
faster to learn
easier to interpret
hierarchy constraint
automatically imposed
selection of features
relevant for all classes
M
[c1, c2, …, cn]
Disadvantage

may have worse predictive performance
Related work on HMC

Barutcuoglu et al. (2006)


Clare (2003)


learn classes separately with SVM’s and combine the
predictions with Naïve Bayes
extension of C4.5 decision tree method that learns all
classes together
A lot of work in the area of text classification

Rousu et al. (2005) give an overview on SVM-methods
that learn a single model for all classes
PMSB
Gene function prediction
Text classification
Approach 1
Barutcuoglu et al.
…
Approach 2
Clare
…
2006
8/21
Why decision trees?
training
examples
Nitrogen depletion <= -2.74?
yes
Gene
G1
G2
G3
G4
…
PMSB
2006
9/21
ND
25
32
19
44
…
HS
29
40
0
45
…
…
…
…
…
…
…
MF?
̶
+
̶
+
…
+ ̶
Positive
++ ̶
fast to build
 fast to use
 accurate predictions
 easy to interpret

no
Heat shock > 1.28?
yes
++
+
Positive
+
no
+ ̶̶
Negative
̶ ̶
Decision trees for HMC

The Clus system



created by Jan Struyf
propositional DT learner, implemented in Java
uses ideas of:


PMSB
2006
10/21

C4.5 [Quinlan93] and CART [Breiman84]
Predictive Clustering Trees [Blockeel98]
Heuristic for HMC

look for test that minimizes the intra-cluster
variance (= generalisation of CART)
Decision trees for HMC (c’ted)

can be used for HMC (Clus-HMC) …
c1
c1,c2,c21

PMSB
c1,c3 c2,c21,c22
c1
… as well as binary classification
(Clus-SC ~ CART)
1
2
2006
11/21
c1,c21,c22
n
…
c1?
c2?
cn?
Experiments in yeast functional
genomics

Saccharomyces cerevisiae or

MIPS FunCat hierarchy
baker’s/brewer’s yeast


12 datasets

PMSB


2006
250 functions of yeast genes


[Clare03] 1 METABOLISM
amino acid metabolism
structure1/1
(seq)
1/2 nitrogen and sulfur metabolisms
Sequence
Phenotype growth (pheno)
… (struc)
Secondary structure
2 ENERGY
Homology search (hom)
Microarray data 2/1 glycolysis and gluconeogenesis

12/21
cellcycle, church, derisi,
eisen, gasch1, gasch2, spo,
…
expr (all)
Example run
{1,5,5/1,3,3/5} {5,5/1,40}
description

{5,5/1,40,40/3} {40,40/16}
Name
{1,5}
{5,5/1,40}
G1
G2 
{40,40/3,40/16}
G3
G4
nitrogen_depletion > 5
G5 
G6
…
{40,40/3,
37C_to_25C_shock > 1.28
40/16}
{40,40/16}
{5,5/1,40}
{5,5/1,40,
40/3}
{5,5/1,40}
PMSB
2006
13/21
{1,5,5/1,3,
3/5}
{1,5}
Predictions
functions
each leaf contains multiple
classes
A A … A 1 … 5 5/1 … 40 40/3 40/16 …
1
2
n
… ……… x
x x
x
x
…
… … …classes
x to
x predict?
x x
which
… ……… x
x
… ………
x x
x
…
………
x x
x
problem:
different
class
… ………
x x
x
frequencies
… ……… … ……… … ………………

use of threshold

precision-recall curves:
independent of a specific
threshold
40,40/3,40/16
5,5/1,40,40/3
1,5,5/1,3,3/5
p=0%
40,40/3,40/16
5,5/1,40
1,5
p=50%
40,40/16
5,5/1,40
1,5
p=100%
Comparison of Clus-HMC with
[Clare03]

PRECISION
= proportion of
(instance, class)
predictions that is
correct
RECALL
= proportion of true
(instance,
PMSB class) cases
that are predicted
2006
14/21
Average precision-recall curves
Extracting rules

e.g. predictions for class 40/3 in
“gasch1” dataset
IF Nitrogen_Depletion_8_h <= -2.74
AND
Nitrogen_Depletion_2_h > -1.94 AND
1point5_mM_diamide_5_min > -0.03 AND
PMSB
1M_sorbitol___45_min_ > -0.36 AND
37C_to_25C_shock___60_min > 1.28
2006
THEN 40,40/3
Precision: 0.97
15/21
Recall: 0.15
HMC vs. single classification

Tree sizes

on average



Time to grow trees

PMSB
HMC tree: 24 nodes
SC tree: 33 nodes (250 of such trees)


single SC tree is grown faster than single HMC
but 250 single trees have to be built
HMC on average 37 times faster
2006

16/21
Predictive performance

next slide
HMC vs. single classification

PMSB
2006
17/21
Average precision-recall curves
Explanation of the results

The classes are not independent

different trees for different classes
actually share structure
 explains
some complexity reduction achieved
by Clus-HMC

PMSB
one class carries information on other
classes
 this

2006
18/21

increases the signal-to-noise ratio
provides better guidance when learning the tree
(explaining good predictive performance)
avoids overfitting (explaining further reduction of
tree size)
 this was confirmed empirically
Conclusions

HMC decision trees are a useful tool for
gene function prediction



PMSB
Compared to regular tree learning,
HMC tree learning:

2006
fast to learn
high interpretability

is even faster
yields trees that:


19/21

are smaller
are easier to interpret
have equal or better predictive performance
Further work

Comparison to other HMC learning
algorithms



PMSB
Use more advanced hierarchy such as Gene
Ontology


2006
kernel methods studied by Rousu et al. and
Barutcuoglu et al.
other suggestions are welcome!
thousands of classes, spread over 19 levels
how to handle the part_of relationship?


20/21
if a function A is part-of a function B then does a
gene with function A also have function B?
gene “has” function B X vs. gene “is involved” in
function B
Questions?
PMSB
2006
21/21