Artificial Intelligence
Download
Report
Transcript Artificial Intelligence
Empirical Methods for AI (and CS)
CS1573: AI Application Development
(after a tutorial by Paul Cohen, Ian P. Gent, and Toby Walsh)
Overview
Introduction
What are empirical methods?
Why use them?
Case Study
Experiment design
Data analysis
2
Empirical Methods for CS
Part I :
Introduction
What does “empirical” mean?
Relying on observations, data, experiments
Empirical work should complement theoretical work
Theories often have holes
Theories are suggested by observations
Theories are tested by observations
Conversely, theories direct our empirical attention
In addition (in this course at least) empirical means “wanting
to understand behavior of complex systems”
4
Why We Need Empirical Methods
Cohen, 1990 Survey of 150 AAAI Papers
Roughly 60% of the papers gave no evidence that the work they
described had been tried on more than a single example problem.
Roughly 80% of the papers made no attempt to explain performance, to
tell us why it was good or bad and under which conditions it might be
better or worse.
Only 16% of the papers offered anything that might be interpreted as a
question or a hypothesis.
Theory papers generally had no applications or empirical work to support
them, empirical papers were demonstrations, not experiments, and had
no underlying theoretical support.
The essential synergy between theory and empirical
work was missing
5
Empirical CS/AI (Theory, not Theorems)
Computer programs are formal objects
so let’s reason about them entirely formally?
Two reasons why we can’t or won’t:
theorems are hard
some questions are empirical in nature
e.g. are finite state automata adequate to represent the sort
of knowledge met in practice?
e.g. even though our problem is intractable in general, are
the instances met in practice easy to solve?
6
Empirical CS/AI
Treat computer programs as natural objects
like fundamental particles, chemicals, living organisms
Build (approximate) theories about them
construct hypotheses
e.g. greedy search is important to chess
test with empirical experiments
e.g. compare search with planning
refine hypotheses and modelling assumptions
e.g. greediness not important, but search is!
7
Empirical CS/AI
Many advantage over other sciences
Cost
no need for expensive super-colliders
Control
unlike the real world, we often have complete command of
the experiment
Reproducibility
in theory, computers are entirely deterministic
Ethics
no ethics panels needed before you run experiments
8
Types of hypothesis
My search program is better than yours
not very helpful beauty competition?
Search cost grows exponentially with number of variables for
this kind of problem
better as we can extrapolate to data not yet seen?
Constraint systems are better at handling over-constrained
systems, but OR systems are better at handling underconstrained systems
even better as we can extrapolate to new situations?
9
A typical conference conversation
What are you up to these days?
I’m running an experiment to compare the Davis-Putnam
algorithm with GSAT?
Why?
I want to know which is faster
Why?
Lots of people use each of these algorithms
How will these people use your result?
...
10
Keep in mind the BIG picture
What are you up to these days?
I’m running an experiment to compare the Davis-Putnam
algorithm with GSAT?
Why?
I have this hypothesis that neither will dominate
What use is this?
A portfolio containing both algorithms will be more robust
than either algorithm on its own
11
Keep in mind the BIG picture
...
Why are you doing this?
Because many real problems are intractable in theory but
need to be solved in practice.
How does your experiment help?
It helps us understand the difference between average and
worst case results
So why is this interesting?
Intractability is one of the BIG open questions in CS!
12
Why is empirical CS/AI in vogue?
Inadequacies of theoretical analysis
problems often aren’t as hard in practice as theory
predicts in the worst-case
average-case analysis is very hard (and often based on
questionable assumptions)
Some “spectacular” successes
phase transition behaviour
local search methods
theory lagging behind algorithm design
13
Why is empirical CS/AI in vogue?
Compute power ever increasing
even “intractable” problems coming into range
easy to perform large (and sometimes meaningful)
experiments
Empirical CS/AI perceived to be “easier” than theoretical
CS/AI
often a false perception as experiments easier to mess up
than proofs
14
Empirical Methods for CS
Part II:
A Case Study
Case Study
Scheduling processors on ring
network
jobs spawned as binary trees
KOSO
keep one, send one to my left
or right arbitrarily
KOSO*
keep one, send one to my
least heavily loaded
neighbour
16
Theory
On complete binary trees, KOSO is
asymptotically optimal
So KOSO* can’t be any better?
But assumptions unrealistic
tree not complete
asymptotically not necessarily
the same as in practice!
17
Lesson: Evaluation begins with claims
Lesson: Demonstration is good, understanding better
Hypothesis (or claim): KOSO takes longer than KOSO*
because KOSO* balances loads better
The “because phrase” indicates a hypothesis about why it
works. This is a better hypothesis than the beauty contest
demonstration that KOSO* beats KOSO
Experiment design
Independent variables: KOSO v KOSO*, no. of
processors, no. of jobs, probability(job will spawn),
Dependent variable: time to complete jobs
18
Criticism: This experiment design includes no
direct measure of the hypothesized effect
Hypothesis: KOSO takes longer than KOSO* because
KOSO* balances loads better
But experiment design includes no direct measure of load
balancing:
Independent variables: KOSO v KOSO*, no. of
processors, no. of jobs, probability(job will spawn),
Dependent variable: time to complete jobs
19
Lesson: The task of empirical work is to explain
variability
Empirical work assumes the variability in a dependent variable (e.g., run
time) is the sum of causal factors and random noise. Statistical methods
assign parts of this variability to the factors and the noise.
Algorithm (KOSO/KOSO*)
Number of processors
run-time
Number of jobs
“random noise” (e.g., outliers)
Number of processors and number of jobs explain 74% of the variance
in run time. Algorithm explains almost none.
20
Lesson: Keep the big picture in mind
Why are you studying this?
Load balancing is important to get good performance out of
parallel computers
Why is this important?
Parallel computing promises to tackle many of our
computational bottlenecks
How do we know this? It’s in the first paragraph of the
paper!
21
Empirical Methods for CS
Part III :
Experiment design
Experimental Life Cycle
Exploration
Hypothesis construction
Experiment
Data analysis
Drawing of conclusions
23
Types of experiment designs
Manipulation experiment
Observation experiment
Factorial experiment
24
Manipulation experiment
Independent variable, x
x=identity of parser, size of dictionary, …
Dependent variable, y
y=accuracy, speed, …
Hypothesis
x influences y
Manipulation experiment
change x, record y
25
Observation experiment
Predictor, x
x=volatility of stock prices, …
Response variable, y
y=fund performance, …
Hypothesis
x influences y
Observation experiment
classify according to x, compute y
26
Factorial experiment
Several independent variables, xi
there may be no simple causal links
data may come that way
e.g. programs have different languages, algorithms, ...
Factorial experiment
every possible combination of xi considered
expensive as its name suggests!
27
Some problem issues
Control
Ceiling and Floor effects
Sampling Biases
28
Control
A control is an experiment in which the hypothesised variation
does not occur
so the hypothesized effect should not occur either
BUT remember
placebos cure a large percentage of patients!
29
Control: MYCIN case study
MYCIN was a medial expert system
recommended therapy for blood/meningitis infections
How to evaluate its recommendations?
Shortliffe used
10 sample problems, 8 therapy recommenders
5 faculty, 1 resident, 1 postdoc, 1 student
8 impartial judges gave 1 point per problem
max score was 80
Mycin 65, faculty 40-60, postdoc 60, resident 45, student 30
30
Control: MYCIN case study
What were controls?
Control for judge’s bias for/against computers
judges did not know who recommended each therapy
Control for easy problems
medical student did badly, so problems not easy
Control for our standard being low
e.g. random choice should do worse
Control for factor of interest
e.g. hypothesis in MYCIN that “knowledge is power”
have groups with different levels of knowledge
31
Ceiling and Floor Effects
Well designed experiments (with good controls) can still go
wrong
What if all our algorithms do particularly well
Or they all do badly?
We’ve got little evidence to choose between them
32
Ceiling and Floor Effects
Ceiling effects arise when test problems are insufficiently
challenging
floor effects the opposite, when problems too challenging
A problem in AI because we often repeatedly use the same
benchmark sets
most benchmarks will lose their challenge eventually?
but how do we detect this effect?
33
Ceiling Effects: machine learning
14 datasets from UCI corpus of benchmarks
used as mainstay of ML community
Problem is learning classification rules
each item is vector of features and a classification
measure classification accuracy of method (max 100%)
Compare C4 with 1R*, two competing algorithms
34
Ceiling Effects: machine learning
DataSet:
C4
1R*
Max
BC
72
72.5
72.5
CH
99.2
69.2
99.2
GL
63.2
56.4
63.2
G2
74.3
77
77
HD
73.6
78
78
HE
81.2
85.1
85.1
…
...
...
…
Mean
85.9
83.8
87.4
C4 achieves only about 2% better than 1R*
Best of the C4/1R* achieves 87.4% accuracy
We have only weak evidence that C4 better
Both methods performing near ceiling of possible so
comparison hard!
35
Ceiling Effects: machine learning
In fact 1R* only uses one feature (the best one)
C4 uses on average 6.6 features
5.6 features buy only about 2% improvement
Conclusion?
Either real world learning problems are easy (use 1R*)
Or we need more challenging datasets
We need to be aware of ceiling effects in results
36
Sampling bias
Data collection is biased against
certain data
e.g. teacher who says “Girls don’t
answer maths question”
observation might suggest:
girls don’t answer many
questions
but that the teacher doesn’t ask
them many questions
Experienced AI researchers don’t do
that, right?
37
Sampling bias: Phoenix case study
AI system to fight (simulated)
forest fires
Experiments suggest that wind
speed uncorrelated with time to
put out fire
obviously incorrect as high
winds spread forest fires
38
Sampling bias: Phoenix case study
Wind Speed vs containment time (max 150 hours):
3: 120
55
79
10
140 26
15
110
54 10
103
6: 78
61
58
81
71
57
21
32
9: 62
48
21
55
101
What’s the problem?
12
70
39
Sampling bias: Phoenix case study
The cut-off of 150 hours introduces sampling bias
many high-wind fires get cut off, not many low wind
On remaining data, there is no correlation between wind
speed and time (r = -0.53)
In fact, data shows that:
a lot of high wind fires take > 150 hours to contain
those that don’t are similar to low wind fires
You wouldn’t do this, right?
you might if you had automated data analysis.
40
Empirical Methods for CS
Part IV:
Data analysis
Kinds of data analysis
Exploratory (EDA) – looking for patterns in data
Statistical inferences from sample data
Testing hypotheses
Estimating parameters
Building mathematical models of datasets
Machine learning, data mining…
We will introduce hypothesis testing and computer-intensive
methods
42
The logic of hypothesis testing
Example: toss a coin ten times, observe eight heads. Is the
coin fair (i.e., what is it’s long run behavior?) and what is your
residual uncertainty?
You say, “If the coin were fair, then eight or more heads is
pretty unlikely, so I think the coin isn’t fair.”
Like proof by contradiction: Assert the opposite (the coin is
fair) show that the sample result (≥ 8 heads) has low probability
p, reject the assertion, with residual uncertainty related to p.
Estimate p with a sampling distribution.
43
The logic of hypothesis testing
Establish a null hypothesis: H0: p = .5, the coin is fair
Establish a statistic: r, the number of heads in N tosses
Figure out the sampling distribution of r given H0
0
1
2
3
4
5 6
7
8
9 10
The sampling distribution will tell you the probability p of a
result at least as extreme as your sample result, r = 8
If this probability is very low, reject H0 the null hypothesis
Residual uncertainty is p
44
A common statistical test: The Z test for
different means
A sample N = 25 computer science students has mean IQ
m=135. Are they “smarter than average”?
Population mean is 100 with standard deviation 15
The null hypothesis, H0, is that the CS students are “average”,
i.e., the mean IQ of the population of CS students is 100.
What is the probability p of drawing the sample if H0 were true?
If p small, then H0 probably false.
Find the sampling distribution of the mean of a sample of size
25, from population with mean 100
45
Reject the null hypothesis?
Commonly we reject the H0 when the probability of obtaining
a sample statistic (e.g., mean = 135) given the null
hypothesis is low, say < .05.
A test statistic value, e.g. Z = 11.67, recodes the sample
statistic (mean = 135) to make it easy to find the probability of
sample statistic given H0.
We find the probabilities by looking them up in tables, or
statistics packages provide them.
For example, Pr(Z ≥ 1.67) = .05; Pr(Z ≥ 1.96) = .01.
Pr(Z ≥ 11) is approximately zero, reject H0.
46
Summary of hypothesis testing
H0 negates what you want to demonstrate; find probability p of
sample statistic under H0 by comparing test statistic to sampling
distribution; if probability is low, reject H0 with residual uncertainty
proportional to p.
Example: Want to demonstrate that CS graduate students are
smarter than average. H0 is that they are average. t = 2.89, p ≤
.022
Have we proved CS students are smarter? NO!
We have only shown that mean = 135 is unlikely if they aren’t. We
never prove what we want to demonstrate, we only reject H0, with
residual uncertainty.
And failing to reject H0 does not prove H0, either!
47
Computer-intensive Methods
Basic idea: Construct sampling distributions by simulating on
a computer the process of drawing samples.
Three main methods:
Monte carlo simulation when one knows population parameters;
Bootstrap when one doesn’t;
Randomization, also assumes nothing about the population.
Enormous advantage: Works for any statistic and makes no
strong assumptions (e.g., normality)
48
Summary
Empirical CS and AI are exacting sciences
There are many ways to do experiments wrong
We are experts in doing experiments badly
As you perform experiments, you’ll make many mistakes
Learn from those mistakes, and ours!
49