Artificial Intelligence

Download Report

Transcript Artificial Intelligence

How to do Experiments:
Empirical Methods for AI & CS
Paul Cohen
[email protected]
Ian P. Gent
Toby Walsh
[email protected]
[email protected]
Empirical Methods for CS
Can you do Empirical AI?
Can you do empirical AI?
 See if you can spot a pattern in the following real empirical data
 (326 dp (T 1 0))
 (327 dp (T 1 0))
 (328 dp (T 1 0))
 (329 dp (T 1 0))
 (330 dp (T 1 0))
 (331 dp (T 2 1))
 (332 dp (T 1 0))
 (333 dp (T 1 0))
 (334 dp (T 3 2))
 (335 dp (T 350163776 62))
 This led to an Artificial Intelligence journal paper
 Gent & Walsh, “Easy Problems are Sometimes Hard”, 1994
3
Experiments are Harder than you think!
 That pattern was pretty easy to spot but…
 To see the pattern you have to not
 kill the experiment in the middle of its run
 assuming that the pipe to the output had got lost!
 That’s what I did, but fortunately the effect occurred
again.
 that instance took a week to run
4
Overview of Tutorial
 Can you do Empirical AI?
 yes!
 Experiments are Harder than you think!
 What are empirical methods?
 Experiment design
 Some Problem Issues
 Data analysis & Hypothesis Testing
 Summary
5
Supplementary Material
 How not to do it
 Case Study
 Gregory, Gao, Rosenberg & Cohen
 Eight Basic Lessons
 The t-test
 Randomization
6
Our objectives
 Outline some of the basic issues
 exploration, experimental design, data analysis, ...
 Encourage you to consider some of the pitfalls
 we have fallen into all of them!
 Raise standards
 encouraging debate
 identifying “best practice”
 Learn from your & our experiences
 experimenters get better as they get older!
7
Empirical Methods for CS
Experiments are Harder than you think!
Experiments are Harder than you think!
 Flawed problems:
 A case study from Constraint Satisfaction
 40+ experimental papers over 5 years
 papers on the nature of hard random CSPs
 Authors include … (in alphabetical order!)
 Fahiem Bacchus, Christian Bessiere, Rina Dechter, Gene
Freuder, Ian Gent, Pedro Meseguer, Patrick Prosser, Barbara
Smith, Edward Tsang, Toby Walsh, and many more
 Achlioptas et al. spotted a flaw
 asymptotically almost all problems are trivial
 brings into doubt many experimental results
• some experiments at typical sizes affected
• fortunately not many
9
Flawed random constraints?
 e.g. “Model B”, domain size d.
Parameters p1 and p2
 Pick exactly p1C constraints (if
there are C possible)
 For each one
• pick exactly p2d2 pairs of
values as disallowed
 e.g. d=3, p2=4/9
 Constraints C1 & C2
 C2 is flawed
• it makes X=2 impossible
 For any p2 ≥ 1/d, p1 > 0
 as n  ∞, there will always be
one variable with all its values
removed
 asymptotically, all problems are
trivial!
C1
X=1
X=2
Y=1
X
X
Y=2
X=3
X
Y=3
X
C2
X=1
X=2
Y=1
X
X
Y=2
X
Y=3
X
X=3
10
Flawless random problems
 [Gent et al.] fix flaw ….
 introduce “flawless” model B
 choose d squares which must
always be allowed
 all in different rows &
columns
 choose p2d 2 X’s to disallow
in other squares
 For model B, I proved that these
problems are not flawed
asymptotically
 any p2 < ½
 so we think that we
understand how to generate
random problems
C1
X=1
X=2
X=3
Y=1
X
X
O
Y=2
O
X
Y=3
O
X
11
But it wasn’t that simple…
 Originally we had two different
definitions of “flawless”
problems
 An undergraduate student
showed they were inequivalent!
 after paper about it on the
web
 (journal paper reference
follows is correct )
C1
X=1
X=2
X=3
Y=1
X
X
O
Y=2
O
X
Y=3
O
X
12
Experiments are harder than you think!
 This tripped up all constraints researchers who thought about it
 It concerned the most fundamental part of the experiments
 i.e. generating the input data
 closely analogous flaw has turned up in SAT and in QBF
 The flaw was not found by constraints researchers
 fruitful (in the end!) interaction between theory and experiment
 experimental method justified theoretically
 Even the fix was wrong at first
 Most experiments still use “flawed” models
 (which is ok if you know what you’re doing:
 if you make a positive decision with a good reason )
13
Further reading
 D. Achlioptas, L.M. Kirousis, E. Kranakis, D. Krizanc, M.
Molloy, and Y. Stamatiou
Random Constraint Satisfaction: A More Accurate
Picture,
Constraints, 6 (4), (2001), pp. 329-344.
 I.P. Gent, E. MacIntyre, P. Prosser, B.M. Smith and T.
Walsh Random Constraint Satisfaction: flaws and
structures, Constraints, 6 (4), (2001), pp. 345-372.
 Coincidence of title and publication details not at all
coincidental
14
Empirical Methods for CS
What are Empirical Methods?
What does “empirical” mean?
 Relying on observations, data, experiments
 Empirical work should complement theoretical work
 Theories often have holes (e.g., How big is the
constant term? Is the current problem a “bad” one?)
 Theories are suggested by observations
 Theories are tested by observations
 Conversely, theories direct our empirical attention
 In addition (in this tutorial at least) empirical means
“wanting to understand behavior of complex systems”
16
Why We Need Empirical Methods
Cohen, 1990 Survey of 150 AAAI Papers
 Roughly 60% of the papers gave no evidence that the work they
described had been tried on more than a single example problem.
 Roughly 80% of the papers made no attempt to explain performance,
to tell us why it was good or bad and under which conditions it might
be better or worse.
 Only 16% of the papers offered anything that might be interpreted
as a question or a hypothesis.
 Theory papers generally had no applications or empirical work to
support them, empirical papers were demonstrations, not
experiments, and had no underlying theoretical support.
 The essential synergy between theory and empirical
work was missing
17
Theory, not Theorems
 Theory based science need not be all theorems
 otherwise science would be mathematics
 Consider theory of QED
 based on a model of behaviour of particles
 predictions accurate to 10 decimal places
 (distance from LA to NY to within 1 human hair)
 most accurate theory in the whole of science?
 success derived from accuracy of predictions
 not the depth or difficulty or beauty of theorems
 QED is an empirical theory!
18
Empirical CS/AI
 Computer programs are formal objects
 so let’s reason about them entirely formally?
 Two reasons why we can’t or won’t:
 theorems are hard
 some questions are empirical in nature
e.g. are Horn clauses adequate to represent the sort of
knowledge met in practice?
e.g. even though our problem is intractable in general, are
the instances met in practice easy to solve?
19
Empirical CS/AI
 Treat computer programs as natural objects
 like fundamental particles, chemicals, living organisms
 Build (approximate) theories about them
 construct hypotheses
e.g. greedy hill-climbing is important to GSAT
 test with empirical experiments
e.g. compare GSAT with other types of hill-climbing
 refine hypotheses and modelling assumptions
e.g. greediness not important, but hill-climbing is!
20
Empirical CS/AI
 Many advantage over other sciences
 Cost
 no need for expensive super-colliders
 Control
 unlike the real world, we often have complete command
of the experiment
 Reproducibility
 in theory, computers are entirely deterministic
 Ethics
 no ethics panels needed before you run experiments
21
Types of hypothesis
 My search program is better than yours
not very helpful beauty competition?
 Search cost grows exponentially with number of variables
for this kind of problem
better as we can extrapolate to data not yet seen?
 Constraint systems are better at handling overconstrained systems, but OR systems are better at
handling under-constrained systems
even better as we can extrapolate to new situations?
22
A typical conference conversation
What are you up to these days?
I’m running an experiment to compare the MAC-CBJ
algorithm with Forward Checking?
Why?
I want to know which is faster
Why?
Lots of people use each of these algorithms
How will these people use your result?
...
23
Keep in mind the BIG picture
What are you up to these days?
I’m running an experiment to compare the MAC-CBJ
algorithm with Forward Checking?
Why?
I have this hypothesis that neither will dominate
What use is this?
A portfolio containing both algorithms will be more
robust than either algorithm on its own
24
Keep in mind the BIG picture
...
Why are you doing this?
Because many real problems are intractable in theory but
need to be solved in practice.
How does your experiment help?
It helps us understand the difference between average
and worst case results
So why is this interesting?
Intractability is one of the BIG open questions in CS!
25
Why is empirical CS/AI in vogue?
 Inadequacies of theoretical analysis
 problems often aren’t as hard in practice as theory
predicts in the worst-case
 average-case analysis is very hard (and often based on
questionable assumptions)
 Some “spectacular” successes
 phase transition behaviour
 local search methods
 theory lagging behind algorithm design
26
Why is empirical CS/AI in vogue?
 Compute power ever increasing
 even “intractable” problems coming into range
 easy to perform large (and sometimes meaningful)
experiments
 Empirical CS/AI perceived to be “easier” than theoretical
CS/AI
 often a false perception as experiments easier to mess
up than proofs
 experiments are harder than you think!
27
Empirical Methods for CS
Experiment design
Experimental Life Cycle





Exploration
Hypothesis construction
Experiment
Data analysis
Drawing of conclusions
29
Checklist for experiment design*
 Consider the experimental procedure
 making it explicit helps to identify spurious effects and
sampling biases
 Consider a sample data table
 identifies what results need to be collected
 clarifies dependent and independent variables
 shows whether data pertain to hypothesis
 Consider an example of the data analysis
 helps you to avoid collecting too little or too much data
 especially important when looking for interactions
*From Chapter 3, “Empirical Methods for Artificial Intelligence”, Paul Cohen, MIT Press
30
Guidelines for experiment design
 Consider possible results and their interpretation
 may show that experiment cannot support/refute
hypotheses under test
 unforeseen outcomes may suggest new hypotheses
 What was the question again?
 easy to get carried away designing an experiment and
lose the BIG picture
 Run a pilot experiment to calibrate parameters (e.g.,
number of processors in Rosenberg experiment)
31
Types of experiment
 Manipulation experiment
 Observation experiment
 Factorial experiment
32
Manipulation experiment
 Independent variable, x
 x=identity of parser, size of dictionary, …
 Dependent variable, y
 y=accuracy, speed, …
 Hypothesis
 x influences y
 Manipulation experiment
 change x, record y
33
Observation experiment
 Predictor, x
 x=volatility of stock prices, …
 Response variable, y
 y=fund performance, …
 Hypothesis
 x influences y
 Observation experiment
 classify according to x, compute y
34
Factorial experiment
 Several independent variables, xi
 there may be no simple causal links
 data may come that way
e.g. individuals will have different sexes, ages, ...
 Factorial experiment
 every possible combination of xi considered
 expensive as its name suggests!
35
Designing factorial experiments
 In general, stick to 2 to 3 independent variables
 Solve same set of problems in each case
 reduces variance due to differences between problem
sets
 If this not possible, use same sample sizes
 simplifies statistical analysis
 As usual, default hypothesis is that no influence exists
 much easier to fail to demonstrate influence than to
demonstrate an influence
36
Empirical Methods for CS
Some Problem Issues
Some problem issues
 Control
 Ceiling and Floor effects
 Sampling Biases
38
Control
 A control is an experiment in which the hypothesised
variation does not occur
 so the hypothesized effect should not occur either
 BUT remember
 placebos cure a large percentage of patients!
39
Control: a cautionary tale
 Macaque monkeys given vaccine based on human T-cells
infected with SIV (relative of HIV)
 macaques gained immunity from SIV
 Later, macaques given uninfected human T-cells
 and macaques still gained immunity!
 Control experiment not originally done
 and not always obvious (you can’t control for all
variables)
40
Ceiling and Floor Effects
 Well designed experiments (with good controls) can still
go wrong
 What if all our algorithms do particularly well
 Or they all do badly?
 We’ve got little evidence to choose between them
41
Ceiling and Floor Effects
 Ceiling effects arise when test problems are insufficiently
challenging
 floor effects the opposite, when problems too
challenging
 A problem in AI because we often repeatedly use the
same benchmark sets
 most benchmarks will lose their challenge eventually?
 but how do we detect this effect?
42
Machine learning example
 14 datasets from UCI corpus of benchmarks
 used as mainstay of ML community
 Problem is learning classification rules
 each item is vector of features and a classification
 measure classification accuracy of method (max 100%)
 Compare C4 with 1R*, two competing
algorithms
Rob Holte, Machine Learning, vol. 3, pp. 63-91, 1993
www.site.uottawa.edu/~holte/Publications/simple_rules.ps
43
Floor effects: machine learning example
DataSet:
C4
1R*
BC
72
72.5
CH
GL
G2
HD
HE
99.2 63.2 74.3 73.6 81.2
69.2 56.4 77
78
85.1
…
...
...
Mean
85.9
83.8
Is 1R* above the floor of performance?
How would we tell?
44
Floor effects: machine learning example
DataSet:
C4
1R*
Baseline
BC
72
72.5
70.3
CH
99.2
69.2
52.2
GL
63.2
56.4
35.5
G2
74.3
77
53.4
HD
73.6
78
54.5
HE
81.2
85.1
79.4
…
...
...
…
Mean
85.9
83.8
59.9
“Baseline rule” puts all items in more popular category.
1R* is above baseline on most datasets
A bit like the prime number joke?
1 is prime. 3 is prime. 5 is prime. So, baseline rule is
that all odd numbers are prime.
45
Ceiling Effects: machine learning
DataSet:
C4
1R*
BC
GL
HY
LY
MU …
72
63.2 99.1 77.5 100.0 ...
72.5 56.4 97.2 70.7 98.4 ...
Mean
85.9
83.8
 How do we know that C4 and 1R* are not near the ceiling of
performance?
 Do the datasets have enough attributes to make perfect
classification?
 Obviously for MU, but what about the rest?
46
Ceiling Effects: machine learning
DataSet:
C4
1R*
max(C4,1R*)
max([Buntine])
BC
72
72.5
72.5
72.8
GL
63.2
56.4
63.2
60.4
HY
99.1
97.2
99.1
99.1
LY
77.5
70.7
77.5
66.0
MU
100.0
98.4
100.0
98.6
…
...
...
…
…
Mean
85.9
83.8
87.4
82.0
 C4 achieves only about 2% better than 1R*
 Best of the C4/1R* achieves 87.4% accuracy
 We have only weak evidence that C4 better
 Both methods performing appear to be near ceiling of
possible so comparison hard!
47
Ceiling Effects: machine learning




In fact 1R* only uses one feature (the best one)
C4 uses on average 6.6 features
5.6 features buy only about 2% improvement
Conclusion?
 Either real world learning problems are easy (use 1R*)
 Or we need more challenging datasets
 We need to be aware of ceiling effects in results
48
Sampling bias
 Data collection is biased against
certain data
 e.g. teacher who says “Girls don’t
answer maths question”
 observation might suggest:
 girls don’t answer many
questions
 but that the teacher doesn’t
ask them many questions
 Experienced AI researchers don’t
do that, right?
49
Sampling bias: Phoenix case study
 AI system to fight (simulated)
forest fires
 Experiments suggest that wind
speed uncorrelated with time to
put out fire
 obviously incorrect as high
winds spread forest fires
50
Sampling bias: Phoenix case study
 Wind Speed vs containment time (max 150 hours):
3: 120
55
79
10
140 26
15
110
54 10
103
6: 78
61
58
81
71
57
21
32
9: 62
48
21
55
101
 What’s the problem?
12
70
51
Sampling bias: Phoenix case study
 The cut-off of 150 hours introduces sampling bias
 many high-wind fires get cut off, not many low wind
 On remaining data, there is no correlation between wind
speed and time (r = -0.53)
 In fact, data shows that:
 a lot of high wind fires take > 150 hours to contain
 those that don’t are similar to low wind fires
 You wouldn’t do this, right?
 you might if you had automated data analysis.
52
Empirical Methods for CS
Data analysis & Hypothesis Testing
Kinds of data analysis
 Exploratory (EDA) – looking for patterns in data
 Statistical inferences from sample data
 Testing hypotheses
 Estimating parameters
 Building mathematical models of datasets
 Machine learning, data mining…
 We will introduce hypothesis testing and computerintensive methods
54
The logic of hypothesis testing
 Example: toss a coin ten times, observe eight heads. Is
the coin fair (i.e., what is it’s long run behavior?) and what
is your residual uncertainty?
 You say, “If the coin were fair, then eight or more heads is
pretty unlikely, so I think the coin isn’t fair.”
 Like proof by contradiction: Assert the opposite (the coin
is fair) show that the sample result (≥ 8 heads) has low
probability p, reject the assertion, with residual
uncertainty related to p.
 Estimate p with a sampling distribution.
55
Common tests
Tests that means are equal
Tests that samples are uncorrelated or independent
Tests that slopes of lines are equal
Tests that predictors in rules have predictive power
Tests that frequency distributions (how often events happen)
are equal
 Tests that classification variables such as smoking history and
heart disease history are unrelated
...
 All follow the same basic logic





56
Probability of a sample result under a null
hypothesis
 If the coin were fair (the null hypothesis) what is the
probability distribution of r, the number of heads,
obtained in N tosses of a fair coin? Get it analytically or
estimate it by simulation (on a computer):
 Loop K times
 r := 0
;; r is num.heads in N tosses
 Loop N times
;; simulate the tosses
• Generate a random 0 ≤ x ≤ 1.0
• If x < p increment r
;; p is the probability of a head
 Push r onto sampling_distribution
 Print sampling_distribution
57
The logic of hypothesis testing
 Establish a null hypothesis: H0: the coin is fair
 Establish a statistic: r, the number of heads in N tosses
 Figure out the sampling distribution of r given H0
0
1
2
3
4
5 6
7
8
9 10
 The sampling distribution will tell you the probability p of a
result at least as extreme as your sample result, r = 8
 If this probability is very low, reject H0 the null
hypothesis
 Residual uncertainty is p
58
The only tricky part is getting the sampling
distribution
 Sampling distributions can be derived...
 Exactly, e.g., binomial probabilities for coins are given
by the formula
N!
N
r!( N  r)!
p
 Analytically, e.g., the central limit theorem tells us that
the sampling distribution of the mean approaches a
Normal distribution as samples grow to infinity
 Estimated by Monte Carlo simulation of the null
hypothesis process
59
A common statistical test: The Z test for
different means
 A sample N = 25 computer science students has mean IQ
m=135. Are they “smarter than average”?
 Population mean is 100 with standard deviation 15
 The null hypothesis, H0, is that the CS students are
“average”, i.e., the mean IQ of the population of CS
students is 100.
 What is the probability p of drawing the sample if H0 were
true? If p small, then H0 probably false.
 Find the sampling distribution of the mean of a sample of
size 25, from population with mean 100
60
The sampling distribution for the CS student
example
 If sample of N = 25 students were drawn from a
population with mean 100 and standard deviation 15 (the
null hypothesis) then the sampling distribution of the
mean would asymptotically be normal with mean 100 and
standard deviation 15 25  3
The mean of the CS students falls nearly 12
standard deviations away from the mean of
the sampling distribution
Only ~1% of a normal distribution falls more
than two standard deviations away from the
mean
100
135
If the students were average, this would have
a roughly zero chance of happening.
61
The Z test
Mean of sampling
distribution
Sample
statistic
Mean of sampling
distribution
std=3
100
Test
statistic
std=1.0
135
0
11.67
x
135 100 35
Z

 11.67
 
15
3
N
25
62
Reject the null hypothesis?
 Commonly we reject the H0 when the probability of
obtaining a sample statistic (e.g., mean = 135) given the
null hypothesis is low, say < .05.
 A test statistic value, e.g. Z = 11.67, recodes the sample
statistic (mean = 135) to make it easy to find the
probability of sample statistic given H0.
 We find the probabilities by looking them up in tables, or
statistics packages provide them.
 For example, Pr(Z ≥ 1.67) = .05; Pr(Z ≥ 1.96) = .01.
 Pr(Z ≥ 11) is approximately zero, reject H0.
63
Summary of hypothesis testing
 H0 negates what you want to demonstrate; find probability p of
sample statistic under H0 by comparing test statistic to
sampling distribution; if probability is low, reject H0 with
residual uncertainty proportional to p.
 Example: Want to demonstrate that CS graduate students are
smarter than average. H0 is that they are average. t = 2.89, p ≤
.022
 Have we proved CS students are smarter? NO!
 We have only shown that mean = 135 is unlikely if they aren’t.
We never prove what we want to demonstrate, we only reject
H0, with residual uncertainty.
 And failing to reject H0 does not prove H0, either!
64
Computer-intensive Methods
 Basic idea: Construct sampling distributions by simulating
on a computer the process of drawing samples.
 Three main methods:
 Monte carlo simulation when one knows population parameters;
 Bootstrap when one doesn’t;
 Randomization, also assumes nothing about the population.
 Enormous advantage: Works for any statistic and makes
no strong parametric assumptions (e.g., normality)
65
The Bootstrap
 Monte Carlo estimation of sampling distributions assume
you know the parameters of the population from which
samples are drawn.
 What if you don’t?
 Use the sample as an estimate of the population.
 Draw samples from the sample!
 With or without replacement?
 Example: Sampling distribution of the mean; check the
results against the central limit theorem.
66
Bootstrapping the sampling distribution of the
mean
 S is a sample of size N:
Loop K = 1000 times
Draw a pseudosample S* of size N from S by sampling
with replacement
Calculate the mean of S* and push it on a list L
 L is the bootstrapped sampling distribution of the mean**
 This procedure works for any statistic, not just the mean.
67
Randomization
 Used to test hypotheses that involve association between
elements of two or more groups; very general.
 Not going to explain here, but it’s very nice
 A practical example with some explanation in
 Singer, J., Gent, I.P. and Smaill, A. (2000) "Backbone
Fragility and the Local Search Cost Peak", JAIR, Volume 12,
pages 235-270.
68
This is what I’d like more time for…
 Bootstrapping and Randomisation are very neat
 well worth learning about
 e.g. see Cohen’s book,
 easy in modern statistical packages, e.g. “R”
 sorry I didn’t have time to go into them
69
Empirical Methods for CS
Summary
Summary
 Empirical CS and AI are exacting sciences
 There are many ways to do experiments wrong
 We are experts in doing experiments badly
 As you perform experiments, you’ll make many mistakes
 Learn from those mistakes, and ours!
 And remember
 Experiments are Harder than you think!
71
Resources
 Web
www.cs.york.ac.uk/~tw/empirical.html (this link is correct!)
 Books
“Empirical Methods for AI”, Paul Cohen, MIT Press, 1995
 Journals
Journal of Experimental Algorithmics, www.jea.acm.org
 Conferences
Workshop on Empirical Methods in AI (IJCAI 01, future?)
Workshop on Algorithm Engineering and Experiments, ALENEX 03
72
Empirical Methods for CS
Appendix:
Supplementary Material
If I get to this bit the rest of the talk went faster than expected!
Supplementary Material
 How not to do it
 Case Study
 Gregory, Gao, Rosenberg & Cohen
 Eight Basic Lessons
 The t-test
 Randomization
74
Empirical Methods for CS
How Not To Do It
Tales from the coal face
 Those ignorant of history are doomed to repeat it
 we have committed many howlers
 We hope to help others avoid similar ones …
… and illustrate how easy it is to screw up!
 “How Not to Do It”
I Gent, S A Grant, E. MacIntyre, P Prosser, P Shaw,
B M Smith, and T Walsh
University of Leeds Research Report, May 1997
 Every howler we report committed by at least one of the
above authors!
76
How Not to Do It
 Do measure with many instruments
 in exploring hard problems, we used our best algorithms
 missed very poor performance of less good algorithms
better algorithms will be bitten by same effect on
larger instances than we considered
 Do measure CPU time
 in exploratory code, CPU time often misleading
 but can also be very informative
e.g. heuristic needed more search but was faster
77
How Not to Do It
 Do vary all relevant factors
 Don’t change two things at once
 ascribed effects of heuristic to the algorithm
 changed heuristic and algorithm at the same time
 didn’t perform factorial experiment
 but it’s not always easy/possible to do the “right”
experiments if there are many factors
78
How Not to Do It
 Do Collect All Data Possible …. (within reason)
 one year Santa Claus had to repeat all our experiments
 ECAI/AAAI/IJCAI deadlines just after new year!
 we had collected number of branches in search tree
 performance scaled with backtracks, not branches
 all experiments had to be rerun
 Don’t Kill Your Machines
 we have got into trouble with sysadmins
… over experimental data we never used
 often the vital experiment is small and quick
79
How Not to Do It
 Do It All Again … (or at least be able to)
 e.g. storing random seeds used in experiments
 we didn’t do that and might have lost important result
 Do Be Paranoid
 “identical” implementations in C, Scheme gave different
results
 Do Use The Same Problems
 reproducibility is a key to science (c.f. cold fusion)
 can reduce variance
80
Choosing your test data
 We’ve seen the possible problem of over-fitting
 remember machine learning benchmarks?
 Two common approaches
 benchmark libraries
 random problems
 Both have potential pitfalls
81
Benchmark libraries
 +ve
 can be based on real problems
 lots of structure
 -ve
 library of fixed size
possible to over-fit algorithms to library
 problems have fixed size
so can’t measure scaling
82
Random problems
 +ve
 problems can have any size
so can measure scaling
 can generate any number of problems
hard to over-fit?
 -ve
 may not be representative of real problems
 lack structure
 easy to generate “flawed” problems
 CSP, QSAT, …
83
Prototyping your algorithm
 Often need to implement an algorithm
 usually novel algorithm, or variant of existing one
 e.g. new heuristic in existing search algorithm
 novelty of algorithm should imply extra care
 more often, encourages lax implementation
 it’s only a preliminary version
84
How Not to Do It
 Don’t Trust Yourself
 bug in innermost loop found by chance
 all experiments re-run with urgent deadline
 curiously, sometimes bugged version was better!
 Do Preserve Your Code
 Or end up fixing the same error twice
Do use version control!
85
How Not to Do It
 Do Make it Fast Enough
 emphasis on enough
 it’s often not necessary to have optimal code
 in lifecycle of experiment, extra coding time not won
back
 e.g. we have published many papers with inefficient code
 compared to state of the art
• first GSAT version O(N2), but this really was too
slow!
• Do Report Important Implementation Details
 Intermediate versions produced good results
86
How Not to Do It
 Do Look at the Raw Data
 Summaries obscure important aspects of behaviour
 Many statistical measures explicitly designed to
minimise effect of outliers
 Sometimes outliers are vital
 “exceptionally hard problems” dominate mean
 we missed them until they hit us on the head
when experiments “crashed” overnight
old data on smaller problems showed clear
behaviour
87
How Not to Do It
 Do face up to the consequences of your results
 e.g. preprocessing on 450 problems
 should “obviously” reduce search
 reduced search 448 times
 increased search 2 times
 Forget algorithm, it’s useless?
 Or study in detail the two exceptional cases
 and achieve new understanding of an important
algorithm
88
Empirical Methods for CS
A Case Study:
Eight Basic Lessons
Rosenberg study
 “An Empirical Study of Dynamic
Scheduling on Rings of
Processors”
Gregory, Gao, Rosenberg &
Cohen
Proc. of 8th IEEE Symp. on
Parallel & Distributed
Processing, 1996
Linked to from
www.cs.york.ac.uk/~tw/empirical.html
90
Problem domain
 Scheduling processors on ring
network
 jobs spawned as binary
trees
 KOSO
 keep one, send one to my
left or right arbitrarily
 KOSO*
 keep one, send one to my
least heavily loaded
neighbour
91
Theory
 On complete binary trees, KOSO
is asymptotically optimal
 So KOSO* can’t be any better?
 But assumptions unrealistic
 tree not complete
 asymptotically not necessarily
the same as in practice!
Thm: Using KOSO on a ring of p
processors, a binary tree of height
n is executed within (2^n-1)/p +
low order terms
92
Benefits of an empirical study
 More realistic trees
 probabilistic generator that makes shallow trees, which
are “bushy” near root but quickly get “scrawny”
 similar to trees generated when performing Trapezoid
or Simpson’s Rule calculations
 binary trees correspond to interval bisection
 Startup costs
 network must be loaded
93
Lesson 1: Evaluation begins with claims
Lesson 2: Demonstration is good, understanding better
 Hypothesis (or claim): KOSO takes longer than KOSO*
because KOSO* balances loads better
 The “because phrase” indicates a hypothesis about why
it works. This is a better hypothesis than the beauty
contest demonstration that KOSO* beats KOSO
 Experiment design
 Independent variables: KOSO v KOSO*, no. of
processors, no. of jobs, probability(job will spawn),
 Dependent variable: time to complete jobs
94
Criticism 1: This experiment design includes no
direct measure of the hypothesized effect
 Hypothesis: KOSO takes longer than KOSO* because
KOSO* balances loads better
 But experiment design includes no direct measure of load
balancing:
 Independent variables: KOSO v KOSO*, no. of
processors, no. of jobs, probability(job will spawn),
 Dependent variable: time to complete jobs
95
Lesson 3: Exploratory data analysis means looking
beneath immediate results for explanations
 T-test on time to complete jobs: t = (2825-2935)/587 = .19
 KOSO* apparently no faster than KOSO (as theory
predicted)
70
 Why? Look more closely at the
data:
80
70
60
60
50
40
50
KOSO
40
30
30
20
20
10
10
10000
20000
KOSO*
10000
20000
 Outliers create excessive variance, so test isn’t
96
Lesson 4: The task of empirical work is to explain
variability
Empirical work assumes the variability in a dependent variable (e.g., run
time) is the sum of causal factors and random noise. Statistical methods
assign parts of this variability to the factors and the noise.
Algorithm (KOSO/KOSO*)
Number of processors
run-time
Number of jobs
“random noise” (e.g., outliers)
Number of processors and number of jobs explain 74% of the variance
in run time. Algorithm explains almost none.
97
Lesson 3 (again): Exploratory data analysis means
looking beneath immediate results for explanations
 Why does the KOSO/KOSO* choice account for so little
of the variance in run time?
Queue length at processor i
50
KOSO
40
Queue length at processor i
30
KOSO*
20
30
20
10
10
100
200
300
100
200
300
 Unless processors starve, there will be no effect of load
balancing. In most conditions in this experiment,
processors never starved. (This is why we run pilot
experiments!)
98
Lesson 5: Of sample variance, effect size, and sample
size – control the first before touching the last
magnitude of effect
x
t 
s
sample size
N
background
variance
This intimate relationship holds for all statistics
99
Lesson 5 illustrated: A variance reduction
method
Let N = num-jobs, P = num-processors, T = run time
Then T = k (N / P), or k multiples of the theoretical best time
And k = 1 / (N / P T)
1.61  1.4
t
 2.42, p  .02
.08
90
80
70
60
50
40
30
20
10
70
60
50
40
30
20
10
2
3
k(KOSO)
4
5
2
3
4
5
k(KOSO*)
100
Where are we?
 KOSO* is significantly better than KOSO when the
dependent variable is recoded as percentage of optimal
run time
 The difference between KOSO* and KOSO explains very
little of the variance in either dependent variable
 Exploratory data analysis tells us that processors aren’t
starving so we shouldn’t be surprised
 Prediction: The effect of algorithm on run time (or k)
increases as the number of jobs decreases or the number
of processors increases
 This prediction is about interactions between factors
101
Lesson 6: Most interesting science is about
interaction effects, not simple main effects
3
2
multiples of
optimal run-time
KOSO
KOSO*
1
3
6
10
20
number of processors
 Data confirm prediction
 KOSO* is superior on larger
rings where starvation is an
issue
 Interaction of independent
variables
 choice of algorithm
 number of processors
 Interaction effects are
essential to explaining how
things work
102
Lesson 7: Significant and meaningful are not
synonymous. Is a result meaningful?
 KOSO* is significantly better than KOSO, but can you use the result?
 Suppose you wanted to use the knowledge that the ring is controlled by
KOSO or KOSO* for some prediction.
 Grand median k = 1.11; Pr(trial i has k > 1.11) = .5
 Pr(trial i under KOSO has k > 1.11) = 0.57
 Pr(trial i under KOSO* has k > 1.11) = 0.43
 Predict for trial i whether it’s k is above or below the median:
 If it’s a KOSO* trial you’ll say no with (.43 * 150) = 64.5 errors
 If it’s a KOSO trial you’ll say yes with ((1 - .57) * 160) = 68.8 errors
 If you don’t know you’ll make (.5 * 310) = 155 errors
 155 - (64.5 + 68.8) = 22
 Knowing the algorithm reduces error rate from .5 to .43. Is this
enough???
103
Lesson 8: Keep the big picture in mind
Why are you studying this?
Load balancing is important to get good performance
out of parallel computers
Why is this important?
Parallel computing promises to tackle many of our
computational bottlenecks
How do we know this? It’s in the first paragraph of
the paper!
104
Case study: conclusions
 Evaluation begins with claims
 Demonstrations of simple main effects are
good, understanding the effects is better
 Exploratory data analysis means using your
eyes to find explanatory patterns in data
 The task of empirical work is to explain
variability
 Control variability before increasing sample
size
 Interaction effects are essential to
explanations
 Significant ≠ meaningful
 Keep the big picture in mind
105
The t test
 Same logic as the Z test, but appropriate when population
standard deviation is unknown, samples are small, etc.
 Sampling distribution is t, not normal, but approaches
normal as samples size increases
 Test statistic has very similar form but probabilities of
the test statistic are obtained by consulting tables of the
t distribution, not the normal
106
The t test
Suppose N = 5 students have mean IQ = 135, std = 27
Estimate the standard
deviation of sampling
distribution using the sample
standard deviation
Mean of sampling
distribution
x   135  100 35
t


 2.89
s
27
12.1
N
5
Sample
statistic
Mean of sampling
distribution
std=12.1
100
135
Test
statistic
std=1.0
0
2.89
107
Randomization
 Used to test hypotheses that involve association between
elements of two or more groups; very general.
 Example: Paul tosses H H H H, Carole tosses T T T T is
outcome independent of tosser?
 Example: 4 women score 54 66 64 61, six men score 23 28 27
31 51 32. Is score independent of gender?
 Basic procedure: Calculate a statistic f for your sample;
randomize one factor relative to the other and calculate your
pseudostatistic f*. Compare f to the sampling distribution for
f*.
108
Example of randomization
 Four women score 54 66 64 61, six men score 23 28 27 31 51 32. Is score
independent of gender?
 f = difference of means of men’s and women’s scores: 29.25
 Under the null hypothesis of no association between gender and score, the
score 54 might equally well have been achieved by a male or a female.
 Toss all scores in a hopper, draw out four at random and without
replacement, call them female*, call the rest male*, and calculate f*, the
difference of means of female* and male*. Repeat to get a distribution of
f*. This is an estimate of the sampling distribution of f under H0: no
difference between male and female scores.
109