Inductive inference in perception and cognition
Download
Report
Transcript Inductive inference in perception and cognition
Markov chain Monte Carlo
with people
Tom Griffiths
Department of Psychology
Cognitive Science Program
UC Berkeley
with Mike Kalish, Stephan Lewandowsky, and Adam Sanborn
Inductive problems
Learning languages from utterances
blicket toma
dax wug
blicket wug
SXY
X {blicket,dax}
Y {toma, wug}
Learning categories from instances of their members
Learning functions from (x,y) pairs
Computational cognitive science
Identify the underlying computational problem
Find the optimal solution to that problem
Compare human cognition to that solution
For inductive problems, solutions come from statistics
Statistics and inductive problems
Cognitive science
Statistics
Categorization
Density estimation
Causal learning
Graphical models
Function learning
Regression
Language
Probabilistic grammars
…
…
Statistics and human cognition
• How can we use statistics to understand cognition?
• How can cognition inspire new statistical models?
– applications of Dirichlet process and Pitman-Yor
process models to natural language
– exchangeable distributions on infinite binary matrices
via the Indian buffet process (priors on causal structure)
– nonparametric Bayesian models for relational data
Statistics and human cognition
• How can we use statistics to understand cognition?
• How can cognition inspire new statistical models?
– applications of Dirichlet process and Pitman-Yor
process models to natural language
– exchangeable distributions on infinite binary matrices
via the Indian buffet process (priors on causal structure)
– nonparametric Bayesian models for relational data
Statistics and human cognition
• How can we use statistics to understand cognition?
• How can cognition inspire new statistical models?
– applications of Dirichlet process and Pitman-Yor
process models to natural language
– exchangeable distributions on infinite binary matrices
via the Indian buffet process
– nonparametric Bayesian models for relational data
Are people Bayesian?
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Reverend Thomas Bayes
Bayes’ theorem
Posterior
probability
Likelihood
Prior
probability
P(d | h)P(h)
P(h | d)
P(d | h)P(h)
h H
h: hypothesis
d: data
Sum over space
of hypotheses
People are stupid
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Predicting the future
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
How often is Google News updated?
t = time since last update
ttotal = time between updates
What should we guess for ttotal given t?
The effects of priors
Evaluating human predictions
• Different domains with different priors:
–
–
–
–
–
a movie has made $60 million
your friend quotes from line 17 of a poem
you meet a 78 year old man
a movie has been running for 55 minutes
a U.S. congressman has served for 11 years
[power-law]
[power-law]
• Prior distributions derived from actual data
• Use 5 values of t for each
• People predict ttotal
[Gaussian]
[Gaussian]
[Erlang]
people
Gott’s rule
empirical prior
parametric prior
A different approach…
Instead of asking whether people are rational, use
assumption of rationality to investigate cognition
If we can predict people’s responses, we can design
experiments that measure psychological variables
Two deep questions
• What are the biases that guide human learning?
– prior probability distribution P(h)
• What do mental representations look like?
– category distribution P(x|c)
limP(x i | x ) i
(t )
t
(0)
Two deep questions
• What are the biases that guide human learning?
– prior probability distribution on hypotheses, P(h)
• What do mental representations look like?
– distribution over objects x in category c, P(x|c)
Develop ways to sample from these distributions
Outline
Markov chain Monte Carlo
Sampling from the prior
Sampling from category distributions
Outline
Markov chain Monte Carlo
Sampling from the prior
Sampling from category distributions
Markov chains
x
x
x
x
x
x
x
x
Transition matrix
T = P(x(t+1)|x(t))
• Variables x(t+1) independent of history given x(t)
• Converges to a stationary distribution under
easily checked conditions (i.e., if it is ergodic)
Markov chain Monte Carlo
• Sample from a target distribution P(x) by
constructing Markov chain for which P(x) is
the stationary distribution
• Two main schemes:
– Gibbs sampling
– Metropolis-Hastings algorithm
Gibbs sampling
For variables x = x1, x2, …, xn and target P(x)
Draw xi(t+1) from P(xi|x-i)
x-i = x1(t+1), x2(t+1),…, xi-1(t+1), xi+1(t), …, xn(t)
Gibbs sampling
(MacKay, 2002)
Metropolis-Hastings algorithm
(Metropolis et al., 1953; Hastings, 1970)
Step 1: propose a state (we assume symmetrically)
Q(x(t+1)|x(t)) = Q(x(t))|x(t+1))
Step 2: decide whether to accept, with probability
Metropolis acceptance
function
Barker acceptance
function
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t+1)) = 0.5
Metropolis-Hastings algorithm
p(x)
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t+1)) = 1
Outline
Markov chain Monte Carlo
Sampling from the prior
Sampling from category distributions
Iterated learning
(Kirby, 2001)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
What are the consequences of learners
learning from other learners?
Analyzing iterated learning
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
PP(d|h)
PL(h|d): probability of inferring hypothesis h from data d
PP(d|h): probability of generating data d from hypothesis h
Iterated Bayesian learning
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
PP(d|h)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Assume learners sample from their posterior distribution:
PP (d | h)P(h)
PL (h | d)
PP (d | h)P(h)
h H
Analyzing iterated learning
d0
PL(h|d)
h1
PP(d|h)
d1
PL(h|d)
h2
PP(d|h)
d2
PL(h|d)
h3
A Markov chain on hypotheses
h1
d PP(d|h)PL(h|d)
h2
d PP(d|h)PL(h|d)
h3
A Markov chain on data
d0
h PL(h|d) PP(d|h)
d1
h PL(h|d) PP(d|h)
d2
h PL(h|d) PP
Stationary distributions
• Markov chain on h converges to the prior, P(h)
• Markov chain on d converges to the “prior
predictive distribution”
P(d) P(d | h)P(h)
h
(Griffiths & Kalish, 2005)
Explaining convergence to the prior
PL(h|d)
PL(h|d)
PP(d|h)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
PP(d|h)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• Intuitively: data acts once, prior many times
• Formally: iterated learning with Bayesian
agents is a Gibbs sampler on P(d,h)
(Griffiths & Kalish, in press)
Revealing inductive biases
• Many problems in cognitive science can be
formulated as problems of induction
– learning languages, concepts, and causal relations
• Such problems are not solvable without bias
(e.g., Goodman, 1955; Kearns & Vazirani, 1994; Vapnik, 1995)
• What biases guide human inductive inferences?
If iterated learning converges to the prior, then it
may provide a method for investigating biases
Serial reproduction
(Bartlett, 1932)
• Participants see stimuli,
then reproduce them
from memory
• Reproductions of one
participant are stimuli
for the next
• Stimuli were interesting,
rather than controlled
– e.g., “War of the Ghosts”
General strategy
• Use well-studied and simple stimuli for which
people’s inductive biases are known
– function learning
– concept learning
– color words
• Examine dynamics of iterated learning
– convergence to state reflecting biases
– predictable path to convergence
Iterated function learning
data
hypotheses
• Each learner sees a set of (x,y) pairs
• Makes predictions of y for new x values
• Predictions are data for the next learner
(Kalish, Griffiths, & Lewandowsky, in press)
Function learning experiments
Stimulus
Feedback
Response
Slider
Examine iterated learning with different initial data
Initial
data
Iteration
1
2
3
4
5
6
7
8
9
Identifying inductive biases
• Formal analysis suggests that iterated learning
provides a way to determine inductive biases
• Experiments with human learners support this idea
– when stimuli for which biases are well understood are
used, those biases are revealed by iterated learning
• What do inductive biases look like in other cases?
– continuous categories
– causal structure
– word learning
Statistics and cultural evolution
• Iterated learning for MAP learners reduces to a
form of the stochastic EM algorithm
– Monte Carlo EM with a single sample
• Provides connections between cultural evolution
and classic models used in population genetics
– MAP learning of multinomials = Wright-Fisher
• More generally, an account of how products of
cultural evolution relate to the biases of learners
Outline
Markov chain Monte Carlo
Sampling from the prior
Sampling from category distributions
Categories are central to cognition
Sampling from categories
Frog
distribution
P(x|c)
A task
Ask subjects which of two alternatives
comes from a target category
Which animal is a frog?
A Bayesian analysis of the task
Assume:
Response probabilities
If people probability match to the posterior,
response probability is equivalent to the Barker
acceptance function for target distribution p(x|c)
Collecting the samples
Which is the frog?
Which is the frog?
Which is the frog?
Trial 1
Trial 2
Trial 3
Verifying the method
Training
Subjects were shown schematic fish of different sizes
and trained on whether they came from the ocean
(uniform) or a fish farm (Gaussian)
Between-subject conditions
Choice task
Subjects judged which of the two fish came from
the fish farm (Gaussian) distribution
Examples of subject MCMC chains
Estimates from all subjects
• Estimated means and standard deviations are
significantly different across groups
• Estimated means are accurate, but standard
deviation estimates are high
– result could be due to perceptual noise or response gain
Sampling from natural categories
Examined distributions for four natural categories:
giraffes, horses, cats, and dogs
Presented stimuli with nine-parameter stick figures
(Olman & Kersten, 2004)
Choice task
Samples from Subject 3
(projected onto plane from LDA)
Mean animals by subject
S1
giraffe
horse
cat
dog
S2
S3
S4
S5
S6
S7
S8
Marginal densities
(aggregated across subjects)
Giraffes are
distinguished by
neck length,
body height and
body tilt
Horses are like
giraffes, but with
shorter bodies
and nearly
uniform necks
Cats have longer
tails than dogs
Relative volume of categories
Convex Hull
Minimum Enclosing Hypercube
Convex hull content divided by enclosing
hypercube content
Giraffe
Horse
Cat
Dog
0.00004
0.00006
0.00003
0.00002
Discrimination method
(Olman & Kersten, 2004)
Parameter space for discrimination
Restricted so that most random draws were animal-like
MCMC and discrimination means
Conclusion
• Markov chain Monte Carlo provides a way to
sample from subjective probability distributions
• Many interesting questions can be framed in
terms of subjective probability distributions
– inductive biases (priors)
– mental representations (category distributions)
• Other MCMC methods may provide further
empirical methods…
– Gibbs for categories, adaptive MCMC, …
A different approach…
Instead of asking whether people are rational, use
assumption of rationality to investigate cognition
If we can predict people’s responses, we can design
experiments that measure psychological variables
Randomized algorithms Psychological experiments
From sampling to maximizing
r
PP (d | h)P(h)
PL (h | d)
P
(d
|
h
)P(
h
)
P
h H
r=1
r=2
r=
From sampling to maximizing
• General analytic results are hard to obtain
– (r = is Monte Carlo EM with a single sample)
• For certain classes of languages, it is possible to
show that the stationary distribution gives each
hypothesis h probability proportional to P(h)r
– the ordering identified by the prior is preserved, but
not the corresponding probabilities
(Kirby, Dowman, & Griffiths, in press)
Implications for linguistic universals
• When learners sample from P(h|d), the distribution
over languages converges to the prior
– identifies a one-to-one correspondence between
inductive biases and linguistic universals
• As learners move towards maximizing, the
influence of the prior is exaggerated
– weak biases can produce strong universals
– cultural evolution is a viable alternative to traditional
explanations for linguistic universals
Iterated concept learning
data
hypotheses
• Each learner sees examples from a species
• Identifies species of four amoebae
• Iterated learning is run within-subjects
(Griffiths, Christian, & Kalish, in press)
Two positive examples
data (d)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
hypotheses (h)
Bayesian model
(Tenenbaum, 1999; Tenenbaum & Griffiths, 2001)
P(d | h)P(h)
P(h | d)
P(d | h)P(h)
d: 2 amoebae
h: set of 4 amoebae
h H
m: # of amoebae in
1/ h m
dh
the set d (= 2)
P(d | h)
|h|: # of amoebae in
otherwise
0
the set h (= 4)
P(h)
Posterior is renormalized prior
P(h | d)
P(h)
What is the prior?
h'|d h'
Classes of concepts
(Shepard, Hovland, & Jenkins, 1958)
color
size
shape
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Experiment design (for each subject)
6
iterated
learning
chains
6
independent
learning
“chains”
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Estimating the prior
hypotheses (h)
data (d)
Estimating the prior
Human subjects
Bayesian model
Prior
Class 1
Class 2
0.861
0.087
Class 3
0.009
Class 4
0.002
Class 5
0.013
Class 6
0.028
r = 0.952
Two positive examples
(n = 20)
Human learners
Probability
Probability
Bayesian model
Iteration
Iteration
Two positive examples
(n = 20)
Probability
Human learners
Bayesian model
Three positive examples
data (d)
hypotheses (h)
Three positive examples
(n = 20)
Human learners
Probability
Probability
Bayesian model
Iteration
Iteration
Three positive examples
(n = 20)
Human learners
Bayesian model
Classification objects
Parameter space for discrimination
Restricted so that most random draws were animal-like
MCMC and discrimination means
Problems with classification objects
Category 1
Category 1
Category 2
Category 2
Problems with classification objects
Minimum Enclosing Hypercube
Convex Hull
Convex hull content divided by enclosing
hypercube content
Giraffe
Horse
Cat
Dog
0.00004
0.00006
0.00003
0.00002
Allowing a Wider Range of Behavior
An exponentiated choice rule results in a
Markov chain with stationary distribution
corresponding to an exponentiated version of
the category distribution, proportional to p(x|c)
Category drift
• For fragile categories, the MCMC procedure could
influence the category representation
• Interleaved training and test blocks in the training
experiments