Artificial Intelligence
Download
Report
Transcript Artificial Intelligence
Artificial Intelligence
Empirical Evaluation
of AI Systems
Ian Gent
[email protected]
Artificial Intelligence
Empirical Evaluation
of Computer Systems
Part I :
Part II:
Part III:
Philosophy of Science
Experiments in AI
Basics of Experimental Design
with AI case studies
Science as Refutation
Modern view of the progress of Science based on
Popper. (Sir Karl Popper, that is)
A scientific theory is one that can be refuted
I.e. it should make testable predictions
If these predictions are incorrect, the theory is false
theory may still be useful, e.g. Newtonian physics
Therefore science is hypothesis testing
Artificial intelligence aspires to be a science
3
Empirical Science
Empirical = “Relying upon or derived from
observation or experiment”
Most (all) of Science is empirical.
Consider theoretical computer science
study based on Turing machines, lambda calculus, etc
Founded on empirical observation that computer systems
developed to date are Turing-complete
Quantum computers might challenge this
if so, an empirically based theory of quantum
computing will develop
4
Theory, not Theorems
Theory based science need not be all theorems
otherwise science would be Mathematics
Compare Physics theory “QED”
most accurate theory in the whole of science?
based on a model of behaviour of particles
predictions accurate to many decimal places (9?)
success derived from accuracy of predictions
not the depth or difficulty or beauty of theorems
I.e. QED is an empirical theory
AI/CS has too many theorems and not enough theory
compare advice on how to publish in JACM
5
Empirical CS/AI
Computer programs are formal objects
so some use only theory that can be proved by theorems
but theorems are hard
Treat computer programs as natural objects
like quantum particles, chemicals, living objects
perform empirical experiments
We have a huge advantage over other sciences
no need for supercolliders (expensive) or animal
experiments (ethical problems)
we should have complete command of experiments
6
What are our hypotheses?
My search program is better than yours
Search cost grows exponentially with number of
variables for this kind of problem
Constraint search systems are better at handling
overconstrained systems, but OR systems are better
at handling underconstrained systems
My company should buy an AI search system rather
than an OR one
7
Why do experiments?
Too often AI experimenters might talk like this:
What is your experiment for?
is my algorithm better than his?
Why?
I want to know which is faster
Why?
Lots of people use each kind …
How will these people use your result?
?
8
Why do experiments?
Compare experiments on identical twins:
What is your experiment for?
I want to find out if twins reared apart to those reared
together and nonidentical twins too.
Why?
We can get estimates of the genetic and social
contributors to performance
Why?
Because the role of genetics in behavior is one of the
great unsolved questions.
Experiments should address research questions
otherwise they can just be “track meets”
9
Basic issues in Experimental
Design
From Paul R Cohen, Empirical Methods for Artificial
Intelligence, MIT Press, 1995, Chapter 3
Control
Ceiling and Floor effects
Sampling Biases
10
Control
A control is an experiment in which the hypothesised
variation does not occur
so the hypothesised effect should not occur either
e.g. Macaque monkeys given vaccine based on
human T-cells infected with SIV (relative of HIV)
macaques gained immunity from SIV
Later, macaques given uninfected human T-cells
and macaques still gained immunity!
Control experiment not originally done
and not always obvious (you can’t control for all variables)
11
Case Study: MYCIN
MYCIN was a medial expert system
recommended therapy for blood/meningitis infections
How to evaluate its recommendations?
Shortliffe used
10 sample problems
8 other therapy recommenders
5 faculty at Stanford Med. School, 1 senior resident, 1
senior postdoctoral researcher, 1 senior student
8 impartial judges gave 1 point per problem
Max score was 80
Mycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30
12
Case Study: MYCIN
What were controls?
Control for judge’s bias for/against computers
judges did not know who recommended each therapy
Control for easy problems
medical student did badly, so problems not easy
Control for our standard being low
e.g. random choice should do worse
Control for factor of interest
e.g. hypothesis in MYCIN that “knowledge is power”
have groups with different levels of knowledge
13
Ceiling and Floor Effects
Well designed experiments can go wrong
What if all our algorithms do particularly well (or they
all do badly)?
We’ve got little evidence to choose between them
Ceiling effects arise when test problems are
insufficiently challenging
floor effects the opposite, when problems too challenging
A problem in AI because we often use benchmark
sets
But how do we detect the effect?
14
Ceiling Effects: Machine Learning
14 datasets from UCI corpus of benchmarks
used as mainstay of ML community
Problem is learning classification rules
each item is vector of features and a classification
measure classification accuracy of method (max 100%)
Compare C4 with 1R*, two competing algorithms:
DataSet: BC CH GL G2 HD HE …
C4
72 99.2 63.2 74.3 73.6 81.2 ...
1R*
72.5 69.2 56.4 77 78 85.1 ...
Mean
85.9
83.8
15
Ceiling Effects
DataSet: BC
C4
72
1R*
72.5
Max
72.5
CH
99.2
69.2
99.2
GL
63.2
56.4
63.2
G2
74.3
77
77
HD
73.6
78
78
HE
81.2
85.1
85.1
…
...
...
…
Mean
85.9
83.8
87.4
C4 achieves only about 2% better than 1R*
If we take the best of the C4/1R* in each case, we can only
achieve 87.4% accuracy
We have only weak evidence that C4 better
both methods performing near ceiling of possible
Ceiling effect is that we can’t compare the two methods well
because both are achieving near the best practicable
16
Ceiling Effects
In fact 1R* only uses one feature (the best one)
C4 uses on average 6.6 features
5.6 features buy only about 2% improvement
Conclusion?
Either real world learning problems are easy (use 1R*)
Or we need more challenging datasets
We need to be aware of ceiling effects in results
17
Sampling Bias
Sampling bias is when data collection is biased
against certain data
e.g. teacher who says “Girls don’t answer maths question”
observation might suggest that …
indeed girls don’t answer many questions
but that the teacher doesn’t ask them many questions
Experienced AI researchers don’t do that, right?
18
Case Study: Phoenix
Phoenix = AI system to fight (simulated) forest fires
Experiments suggested that wind speed uncorrelated
with time to put out fire
obviously incorrect (high winds spread forest fires)
Wind Speed vs containment time (max 150 hours):
3: 120 55 79 10 140 26 15 110 12
54 10 103
6: 78 61 58 81 71 57 21 32 70
9: 62 48 21 55 101
What’s the problem?
19
Sampling bias in Phoenix
The cut-off of 150 hours introduces sampling bias
Many high-wind fires get cut off, not many low wind
On remaining data, there is no correlation between
wind speed and time (r = -0.53)
In fact, data shows that:
a lot of high wind fires take > 150 hours to contain
those that don’t are similar to low wind fires
You wouldn’t do this, right?
You might if you had automated data analysis.
20