Presentation (PowerPoint File)

Download Report

Transcript Presentation (PowerPoint File)

Information theory, MDL and human cognition
Nick Chater
Department of Psychology
University College London
[email protected]
Overview




Bayes and MDL: An overview
Universality and induction
Some puzzles about model fitting
Cognitive science applications
Bayes and MDL: A simplified story

Shannon’s coding theorem.


For distribution Pr(A), optimal code will assign -log2Pr(A) to
code event A
MDL model selection:


choose, M, that yields the shortest code for D, i.e., minimize:
-log2Pr(D, M)
A simple equivalence

Minimize:


Maximize:


Pr(M|D)
Just what Bayes recommends


Pr(D, M) = Pr(M|D)Pr(D)
Maximize:


-log2Pr(D, M)
(if choosing a single model)
Equivalence generalizes to parametric M(θ) to ‘full’
Bayes; and in other ways

Chater, 1996, for application to the simplicity and likelihood
principles in perceptual organization
Codes or priors? Which comes first?
1. The philosophical issue

Bayesian viewpoint: Probabilities as basic



calculus for degrees of beliefs (probability theory)
decision theory (probabilities meet action)
brain as a probabilistic calculation machine
(whether belief propagation, dynamic programming…)

Simplicity/MDL viewpoint: Codes as basic



Rissanen: data is all there is; distributions are a
fiction
Code structure is primary; code interpretation is
secondary
Probabilities defined over events; but “events”
are cognitive constructs

Leeuwenberg & Boselie (1988)
Codes or priors? Which comes first?
2. The practical issue

Bayesian viewpoint:


Take probabilities as
basic…
…when we know
most about
probability, e.g,
image statistics:

MDL viewpoint:


Take codes as
basic…
…when we know
most about
representation, e.g.,
grammars
Bayesian viewpoint (e.g., Geisler et al, 2001)
Good continuation—most lines continue in
the same direction in real images
Simplicity/MDL viewpoint (e.g., Goldsmith, 2001)
S  NP VP
VP  V NP
VP  V NP PP
NP  Det Noun
NP  NP PP


“Binding contraints”

In, e.g., linguistics,
representations are given by
theory
And we can roughly assess
the complexity of grammars
(by length)
Not so clear how directly to
set a prior over all grammars

(though can define a generative
process in simple cases…)
Gzip as a handy approximation!?!


Simplicity/MDL and Bayes are closely
related
Lets now explore the simplicity
perspective
Overview




Bayes and MDL: An overview
Universality and induction
Some puzzles about model fitting
Cognitive science applications
The most neutral possible prior…



Suppose we want a prior so
neutral that it never rules
out a model
Possible, if limit to
computable models
Mixture of all (computable)
priors, with weights, i, that
decline fairly fast:
m( x)    i pi ( x)
i



Then, this multiplicatively
dominates all priors
though neutral priors will
mean slow learning
m(x) are “universal” priors
The most neutral possible coding
language



Universal programming languages (Java, matlab, UTMs, etc)
K(x) = length of shortest program in Java, matlab, UTM, that
generates x (K is uncomputable)
Invariance theorem



any languages L1, L2, c,
x |KL1(x)-KL2(x)| ≤ c
Mathematically justifies talk of K(x), not KJava(x) , KMatlab(x),…
So does this mean that choice of language
doesn’t matter?



Not quite!
 c can be large
And, for any L1, c0, L2, x such that
 |KL1(x)-KL2(x)| ≥ c0
The problem of the one-instruction code for the entire data
set…
But Kolmogorov complexity can be made concrete…
Compact Universal Turing machines

210 bits, λ-calculus

272, combinators
Due to Jon Tromp, 2007
Not much room to hide, here!
Neutral priors and Kolmogorov complexity

A key result:


K(x) = -log2m(x)  o(1)


Where m is a universal prior

Analogous to the Shannon’s
source coding theorem

And for any computable q,
K(x) ≤ -log2q(x)  o(1)
 For typical x drawn from
q(x)
Any data, x, that is likely for
any sensible probability
distribution has low K(x)
Prediction by simplicity


Find shortest ‘program/explanation’ for current ‘corpus’ (binary
string)
Predict using that program
 Strictly, use ‘weighted sum’ of explanations, weighted by
brevity
Prediction is possible
(Solomonoff, 1978)
Summed error has finite bound

log e 2
sj 
K ( )

2
j=1



sj is summed squared error between prediction and
true probability on item j
So prediction converges [faster than 1/nlog(n)], for
corpus size n
Computability assumptions only (no stationarity
needed)
Summary so far…





Simplicity/MDL - close and deep connections with
Bayes
Defines universal prior (i.e., based on simplicity)
Can be made “concrete”
General prediction results
A convenient “dual” framework to Bayes, when codes
are easier than probabilities
Li, M. & Vitanyi, P. (1997) (2nd Edition). Introduction to
Kolmogorov complexity theory and its applications. Berlin:
Springer.
Overview




Bayes and MDL: An overview
Universality and induction
Some puzzles about model fitting
Cognitive science applications
A problem of model selection?
Or: why simplicity won’t go away

Where do priors come from?



Well, priors can be given by hyper-priors
And hyper-priors by hyper-hyper-priors
But it can’t go on forever!

And we need priors over models we’re only just thought of

And, in some contexts, over models we haven’t yet thought of (!)

Code length in our representation language is a fixed basis

Nb. Building probabilistic models = augmenting our language with
new coding schemes
The hidden role of simplicity…




Bayesian model
selection prefers
y(x)=a2x2+a1x+a0
Not
y(x)=a125x125+a124x124+
…+a0
But who says how many parameters a function has got??
A trick…

Convert parameters to constants

y(x)=a125x125+a124x124+…+a0

126 parameters
y(x)=.003x125 + .02x124+…+3x – 24.3
 0 parameters


And hence is favoured by Bayesian (and all other)
model selection criteria
All the virtues of theft over honest toil
Zoubin’s problem for ML

ML Gaussian is a delta function on one point

ML: Gaussian is a delta functions on one point
An impressive fit!
A related problem for Bayes?

The mixture of delta functions model (!)
A related problem for Bayes?

The mixture of delta functions model (!)
An even more
impressive fit!
Should the “cheating” model get a huge
boost from this data

No!



Sense of moral
outrage
Model must be fitted
post-hoc

It would be different
if I’d thought of it
before the data
arrived (cf empirical
Bayes)
Yes!


!
But order of data
acquisition has no
role in Bayes

Confirmation is just
the same, whenever I
thought of the model
The models get a spectacular boost; but is even more
spectacularly unlikely…

So we need to take care with priors!




y=x
High prior; compact to state
y=.003x125 + .02x124+…+3x-24
Low prior; not compact to state

With a different representation language, could have the
opposite bias

But we start from where we are; our actual representations
We can discoverthat things are simpler than we though
(i.e., simplicity is not quite so subjective…)
Overview




Bayes and MDL: An overview
Universality and induction
Some puzzles about model fitting
Cognitive science applications
There are quite a few…
Domain
Principle
References
Perceptual organization
Find grouping that minimizes cost
Koffka, 1935; Leeuwenberg, 1971;
Attneave & Frost, 1969;
Early vision
Efficient coding & transmission
Blakemore, 1990; Barlow, 1974;
Srivinisan, Laughlin
Causal reasoning
Find minimal belief network
Wedelind
Similarity
Similarity between representations
measured by code length between
them
Chater & Vitányi, 2003; Hahn,
Chater & Richardson, 2003
Categorization
Categorize items to find shortest
code (high level perceptual
organization)
Pothos & Chater, 2002; Feldman,
2000
Memory storage
Shorter codes easier to store
Chater, 1999
Memory retrieval
Explain interference by cuetrace
complexity; a rational foundation
for distinctiveness models
Rational foundation for
distinctiveness models: SIMPLE
(Brown, Neath & Chater, 2005)
Language acquisition
Find grammar that best explains
child’s input
Chomsky, 1955; J. D. Fodor &
Crain; Chater, 2004; Chater &
Vitányi, 2005
Here:



Perceptual organization
Language Acquisition
Similarity and generalization
Long tradition of simplicity in perception
(Mach, Koffka, Leeuwenberg); e.g., Gestalt laws
(x)
(x,v)
(x)
(x)
(x)
(x)
(x)
Grouped
6+1
vectors
+
+ (x,v)(x,v)+
+(x,v)
(x,v)
+
(v)
+
(x,v)
Ungrouped
6 x 2 vectors
And language acquisition: where it helps
resolve an apparent learnability paradox



Undergeneral grammars
predict that good sentences
are not allowed
just wait til one turns up

Overgeneral grammars
predict that bad sentences
are actually ok
Need negative evidence--say a bad sentence, and get
corrected
The logical problem of language acquisition
(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)





Without negative evidence can never eliminate overgeneral
grammars
“Mere” non-occurrence of sentences is not enough…
…because almost all acceptable sentences also never
occur
Backed-up by formal results (Gold, 1967; though Feldman,
Horning et al)
Argument for innateness?
An “ideal” learning set-up (cf ideal observers)

Linguistic
environment

Measures of
learning
performance

Learning method

Positive evidence only;
computability

Statistical

Simplicity
Overgeneralization Theorem
(Chater & Vitányi)

Suppose learner has probability j of erroneously
guessing an ungrammatical jth word


j 1

j
K ( )

log e 2
Intuitive explanation:


overgeneralization underloads probabilities of
grammatical sentences;
Small probabilities implies longer code lengths
Absence as implicit negative evidence


Overgeneral grammars predict missing sentences
And their absence is a clue that the grammar is
wrong
Method can be “scaled-down” to consider
learnability of specific linguistic constructions
Similarity and categorization

Cognitive dissimilarity: representational “distortion”
required to get from x to y


DU(x,y) = K(y|x)
Not symmetrical
 K(y|x) > K(x|y) when
 K(y) > K(x)

Deletion is easy…
Shepard’s (1987) Universal Law

Generalization (strictly
confusability) is an
exponention function of
psychological “distance”
G(a, b)  Ae  BDU ( Sa , Sb )
A derivation
 Pr( Ra | S b ) Pr( Rb | S a ) 

G (a, b)  
 Pr( Ra | S a ) Pr( Rb | S b ) 
1/ 2
Shepard’s generalization
measure
log 2 G(a, b)
 1 / 2(log 2 Pr( Ra | Sb )  log 2 Pr( Rb | S a )  log 2 Pr( Ra | S a )  log 2 Pr( Rb | Sb ))
 1 / 2K ( Ra | Sb )  K ( Rb | S a ) o(1)
for “typical” items
log 2 G(a, b)  1 / 2K ( Ra | S b )  K ( Rb | S a ) o(1)
  DU ( S a , Sb )  o(1)
Assuming items roughly the
same complexity
G(a, b)  Ae  BDU ( Sa , Sb )
The universal law
The asymmetry of similarity

What thing is this like?

And what is this like?
A heuristic measure of amount of
information: Shannon’s guessing game…
1. Pony?
2. Cow?
3. Dog?
…
345. Pegasus√
345!
1. Pony?
2. Cow?
3. Dog?
…
345. Pegasus√
Asymmetry of codelengths
asymmetry of similarity

Horse: guess #345 gets Pegasus. log2Pr(#345) is very small.

Pegasus: guess #2 gets Horse. log2(Pr(#2)) is very small.

So Pegasus is more like Horse, than Horse is like Pegasus

Many other examples of asymmetry, and many measures
(search times, memory confusions…), which seem to fit this
pattern
Treisman & Souther (1985)
A simple array
A complex array
Summary



MDL/Kolmogorov complexity close
relation with Bayes
Basis for a “universal” prior
Variety of applications to cognitive
science