Transcript slides

A Syntactic Justification of
Occam’s Razor
1
John Woodward, Jerry Swan
Foundations of Reasoning Group
University of Nottingham Ningbo, China
宁波诺丁汉大学
Email: [email protected]
Overview
2







Occam’s Razor
Definitions
Sampling of Program Spaces (Langdon)
Two Assumptions
Proof of Occam’s Razor
Further Work
Context
Occam’s Razor
3


Occam’s Razor says has been adopted by the
machine learning community to mean;
“Given two hypotheses which agree with the
observed data, pick the simplest, as this is more
likely to make the correct predictions”
Definitions
4
Program
Hypothesis: a program
(effective procedure
which allows us to
make predictions)
Size: the number of bits
needed to represent a
hypothesis.
Function
Concept: A set of
predictions (inputoutput pair)
Complexity: the size of
the smallest
program which
computes a given
function.
From Hypotheses to Concepts.
5
Langdon 1 (Foundation of Genetic
Programming)
6
1. The limiting distribution of functions is independent of program size!
There is a correlation between the frequency in the limiting distribution
and the complexity of a function.
Langdon 2 (Foundation of Genetic
Programming)
7
Hypothesis-Concept Spaces
8
Notation
9





P is the hypothesis space (i.e. a set of programs).
|P| is the size of the space (i.e. the cardinality of the set of
programs).
F is the concept space (i.e. a set of functions represented by
the programs in P).
|F| is the size of the space (i.e. the cardinality of the set of
functions).
If two programs pi and pj map to the same function (i.e. they
are interpreted as the same function, I(pi)=f=I(pj)), they
belong to the same equivalence class (i.e. pi is in [pj] ↔
I(pi)=I(pj)). The notation [p] denotes the equivalence class
which contains the program p (i.e. given I(pi)=I(pj), [pi]=[pj]).
The size of an equivalence class [p] is denoted by |[p]|.
Two assumptions
10
1.
2.



Uniformly sample the hypothesis space, probability of
sampling a given program is 1/|P|.
There are fewer hypotheses that represent complex
functions
|[p1]|>|[p2]|↔c(f1)<c(f2), where I(p1)=f1 and
I(p2)=f2.
Note that |[p1]|/|P| = p(I(p1))=p(f1), that is
|[p1]|/|P|=p(f1)
the probability of sampling a function is given by the ratio
of the size of the equivalence class containing all the
programs which are interpreted that function, divided by
the size of the hypothesis space.
Proof
11







Starting from a statement of the assumption
|[p1]|>|[p2]| ↔ c(f1) < c(f2)
Dividing the left hand side by |P|,
|[p1]|/|P|>|[p2]|/|P| ↔c(f1)< c(f2)
As |[p1]|/|P| = p(I(p1)) = p(f1), we can rewrite as
p(f1) > p(f2) ↔ c(f1) < c(f2)
Which is a statement of Occam’s razor.
Restatement of Occam’s Razor
12


Occam’s Razor is often stated as “prefer the shortest
consistent hypothesis”
Our restatement of Occam’s Razor: “The preferred
function is the one that is represented most frequently.
The equivalence class which contains the shortest
program is represented most frequently.”
Summary
13





Occam’s razor states “pick the simplest hypothesis
consistent with data”
We agree, but for a different reason.
Restatement. Pick the function that is represented most
frequently (i.e. belongs to the largest equivalence class).
Occam’s razor is concerned with probability, and we
presented a simple counting argument.
Unlike some interpretations of Occam’s razor we do not
discard more complex hypotheses we count them in [p].
We offer no reason to believe the world is simple, our
razor only gives a reason to predict using the simplest
hypothesis.
Further work
14



There are fewer hypotheses that represent complex
functions: |[p1]|>|[p2]|↔ c(f1) < c(f2),
Why are some functions represented more frequently
that other functions.
The base functions may contain functions which are:
Symmetrical i.e. f(x, y) = f(y, x), e.g. nand.
 Complementary, i.e. f(g(x))= g(f(x)) e.g. inc and dec.

Further work
15



Further work - to prove our assumptions.
Does the result depend on the primitive set?
How are the primitive functions linked together (e.g.
tree, lists, directed acyclic graphs…). Does this
make a difference to the result.
How does nature compute?
16



Heuristics such as Occam’s razor need not be
explicitly present as rules.
Random searches of an agents generating capacity
may implicitly carry heuristics.
Axiomatic reasoning probably comes late.
Thanks & Questions?
17
1)
2)
3)
4)
5)
6)
7)
Thomas M. Cover and Joy A. Thomas. Elements of information
theory. John Wiley and Sons 1991.
Michael J. Kearns and Umesh V. Vazirani. An introduction to
computational learning theory. MIT Press, 1994.
William B. Langdon. Scaling of program fitness spaces.
Evolutionary Computation, 7(4):399-428,1999.
Tom M. Mitchell. Machine Learning. McGraw-Hill 1997.
S. Russell and P. Norvig. Artificial Intelligence: A modern
approach. Prentice Hall, 1995.
G. I. Webb. Generality is more significant than complexity: Toward
an alternative to occam’s razor. In 7th Australian Joint Conference
on Artificial Intelligence – Artificial Intelligence: Sowing the Seeds
for the Future, 60-67, Singapore, 1994, World Scientific.
Ming Li and Paul Vitanyi . An Introduction to Kolmogorov
Complexity and Its Applications (2nd Ed.). Springer Verlag.