Transcript Talk Slides
Experiment Databases:
Towards better experimental research in
machine learning and data mining
Hendrik Blockeel
Katholieke Universiteit Leuven
Motivation
Much research in ML / DM involves
experimental evaluation
Interpreting results is more difficult than it may
seem
Typically, a few specific implementations of
algorithms, with specific parameter settings, are
compared on a few datasets, and then general
conclusions are drawn
How generalizable are these results really?
Evidence exists that too general conclusions are
often drawn
E.g., Perlich & Provost: different relative performance of
techniques depending on size of dataset
Very sparse evidence
Algorithm
parameter
space
(AP)
x
x
A few points in
an N-dim space,
where N is very large:
very sparse evidence!
x
Dataset space (DS)
An improved methodology
We here argue in favour of an improved
experimental methodology:
Perform much more experiments
Store results in an “experiment database”
Better reproducability
Mine that database for patterns
Better coverage of algorithm – dataset space
More advanced analysis possible
The approach shares characteristics of inductive
databases:
The database will be mined for specific kinds of
patterns: inductive queries, constraint based mining
Classical setup of experiments
Currently, performance evaluations of algorithms
rely on few specific instantiations of algorithms
(implementations, parameters), tested on few
datasets (with specific properties), often
focusing on specific evaluation criteria, and with
a specific research question in mind
Disadvantages:
Limited generalisability (see before)
Limited reusability of experiments
If we want to test another hypothesis, we need to run new
experiments, with a different setup, and now recording other
information
Setup of an experiment database
The ExpDB is filled with results from random
instantiations of algorithms, on random datasets
Algorithm parameters, dataset properties are
recorded
Performance criteria are measured and stored
These experiments cover the whole DS x AP-space
Choose alg.
CART
C4.5
Ripper
...
Choose
param
Generate
dataset
Leaf size > 2
Heuristic = gain
...
#examples=1000
#attr=20
...
Run
Store
Alg. par.,
dataset prop.,
results
Setup of an experiment database
When experimenting with 1 learner, e.g.,
C4.5:
Algorithm
parameters
MLS heur
2 gain
Dataset characteristics
...
...
Ex Attr Compl ...
1000 20
17
...
Performance
TP
350
FP
65
RT
17
...
...
Setup of an experiment database
When experimenting with multiple learners:
More complicated setting, will not be considered here
ExpDB
Alg. Inst. PI
Ex Attr Compl ...
DT C4.5 C45-1 1000 20
17
...
DT CARTCA-1 2000 50
12
...
C4.5ParInst
PI MLS heur
C45-1 2 gain
...
...
TP FP
1000 20
1000 20
CART-ParInst
PI BS heur
CA-1 yes Gini
...
...
RT
17
17
...
...
...
Experimental questions and
hypotheses
Example questions:
What is the effect of Parameter X on runtime ?
What is the effect of the number of examples in the
dataset on TP and FP?
....
With classical methodology:
Different sets of experiments needed for each
(Unless all questions known in advance, and experiments
designed in order to answer all of them)
ExpDB approach:
Just query the ExpDB table for the answer
New question = 1 new query, not new experiments
Inductive querying
To find the right patterns in the ExpDB, we
need a suitable query language
Many queries can be answered with standard
SQL, but (probably) not all (easily)
We illustrate this with some simple
examples
Investigating a simple effect
The effect of #Items on Runtime for
frequent itemset algorithms
SELECT NItems, Runtime
FROM ExpDB
SORT BY NItems
SELECT NItems, AVG Runtime
FROM ExpDB
GROUP BY NItems
SORT BY NItems
Runtime
x
x
x
x
x
x
x
x
x
x
NItems
Investigating a simple effect
Note:
Setting all parameters randomly creates more
variance in the results
In the classical approach, these other parameters would
simply be kept constant
This leads to clearer, but possibly less generalisable results
This can be simulated easily in the ExpDB setting!
+ : condition is explicit in the query
- : we use only a part of the ExpDB
So, ExpDB needs to have many experiments
SELECT NItems, Runtime
FROM ExpDB
WHERE MinSupport=0.05
SORT BY NItems
Investigating interaction of effects
E.g., does effect of NItems on Runtime change
with MinSupport and NTrans?
FOR a=0.01, 0.02, 0.05, 0.1 DO
FOR b=103,104, 105,106,107 DO
PLOT
SELECT Nitems, Runtime
FROM ExpDB
WHERE MinSupport=$a AND
$b <= NTrans < 10*$b
SORT BY NITems
Direct questions instead of repeated
hypothesis testing (“true” data mining)
What is the algorithm parameter that has the
strongest influence on the runtime of my
decision tree learner?
SELECT ParName, Var(A)/Avg(V) as Effect
FROM AlgorithmParameters,
(SELECT $ParName,
Var(Runtime) as V,
Avg(Runtime) as A
FROM ExpDB
GROUP BY $ParName)
GROUP BY ParName
SORT BY Effect
Not (easily) expressible
in standard SQL !
(pivoting: possible by
hardcoding all attribute names
in the query: not very readable
or reusable)
A comparison
Classical approach
1) Experiments are goal-oriented
2) Experiments seem more convincing
than they are
3) Need to do new experiments when new
research questions pop up
4) Conditions under which results are valid
are unclear
5) Relatively simple analysis of results
6) Mostly repeated hypothesis testing, rather
than direct questions
7) Low reusability and reproducibility
ExpDB approach
1) Experiments are general-purpose
2) Experiments seem as convincing
as they are
3) No new experiments needed when
new research questions pop up
4) Conditions under which results are
valid are explicit in the query
5) Sophisticated analysis of results
possible
6) Direct questions possible, given
suitable inductive query languages
7) Better reusability and reproducibility
Summary
ExpDB approach
Is more efficient
Is more precise and thrustworthy
Conditions under which the conclusions hold are explicitly
stated
Yields better documented experiments
The same set of experiments is reusable and reused
Precise information on all experiments is kept, experiments
are reproducible
Allows more sophisticated analysis of results
Interaction of effects, true data mining capacity
Note: interesting for meta-learning!
The challenges... (*)
Good dataset generators necessary
Extensive descriptions of datasets and
algorithms
Generating truly varying datasets is not easy
Could start from real-life datasets (build variations)
Vary as many possibly relevant properties as possible
Database schema for multi-algorithm ExpDB
Suitable inductive query languages
(*) note: even without solving all these problems, some improvement
over the current situation is feasible and easy to achieve