Transcript X - MIT

Bayesian models of human
learning and inference
(http://web.mit.edu/cocosci/Talks/nips06-tutorial.ppt)
Josh Tenenbaum
MIT
Department of Brain and Cognitive Sciences
Computer Science and AI Lab (CSAIL)
Thanks to Tom Griffiths, Charles Kemp, Vikash Mansinghka
The probabilistic revolution in AI
• Principled and effective solutions for
inductive inference from ambiguous data:
–
–
–
–
–
Vision
Robotics
Machine learning
Expert systems / reasoning
Natural language processing
• Standard view: no necessary connection to
how the human brain solves these problems.
Probabilistic inference in
human cognition?
• “People aren’t Bayesian”
– Kahneman and Tversky (1970’s-present): “heuristics and
biases” research program. 2002 Nobel Prize in
Economics.
– Slovic, Fischhoff, and Lichtenstein (1976): “It appears
that people lack the correct programs for many important
judgmental tasks.... it may be argued that we have not had
the opportunity to evolve an intellect capable of dealing
conceptually with uncertainty.”
– Stephen Jay Gould (1992): “Our minds are not built (for
whatever reason) to work by the rules of probability.”
The probability of breast cancer is 1% for a woman at 40 who
participates in a routine screening. If a woman has breast cancer,
the probability is 80% that she will have a positive mammography.
If a woman does not have breast cancer, the probability is 9.6%
that she will also have a positive mammography.
A woman in this age group had a positive mammography in a
routine screening. What is the probability that she actually has
breast cancer?
A. greater than 90%
B. between 70% and 90%
C. between 50% and 70%
D. between 30% and 50%
E. between 10% and 30%
F. less than 10%
95 out of 100 doctors
Correct answer
“Base rate
neglect”
Availability biases in
probability judgment
• How likely is that a randomly chosen word
– ends in “g”?
– ends in “ing”?
• When buying a car, how much do you
weigh your friend’s experience relative to
consumer satisfaction surveys?
Probabilistic inference in
human cognition?
• “People aren’t Bayesian”
– Kahneman and Tversky (1970’s-present): “heuristics
and biases” research program. 2002 Nobel Prize in
Economics.
• Psychology is often drawn towards the
mind’s errors and apparent irrationalities.
• But the computationally interesting question
remains: How does mind work so well?
Bayesian models of cognition
Visual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman,
Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget, ...]
Language acquisition and processing [Brent, de Marken, Niyogi, Klein,
Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …]
Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording,
Kawato, Doya, Todorov, Shadmehr, …]
Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …]
Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …]
Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …]
Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro,
Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …]
Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …]
Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …]
Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker,
Goodman, Tenenbaum, …]
Learning concepts from examples
• Word learning
“horse”
“horse”
“horse”
Learning concepts from examples
“tufa”
“tufa”
“tufa”
Everyday inductive leaps
How can people learn so much about the
world . . .
–
–
–
–
–
Kinds of objects and their properties
The meanings of words, phrases, and sentences
Cause-effect relations
The beliefs, goals and plans of other people
Social structures, conventions, and rules
. . . from such limited evidence?
Contributions of Bayesian models
• Principled quantitative models of human behavior,
with broad coverage and a minimum of free
parameters and ad hoc assumptions.
• Explain how and why human learning and
reasoning works, in terms of (approximations to)
optimal statistical inference in natural
environments.
• A framework for studying people’s implicit
knowledge about the structure of the world: how it
is structured, used, and acquired.
• A two-way bridge to state-of-the-art AI and
machine learning.
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Algorithm:
Cognitive psychology
• Implementation:
Neurobiology
What about those errors?
• The human mind is not a universal Bayesian engine.
• But, the mind does appear adapted to solve
important real-world inference problems in
approximately Bayesian ways, e.g.
– Predicting everyday events
– Causal learning and reasoning
– Learning concepts from examples
• Like perceptual tasks, adults and even young
children solve these problems mostly unconsciously,
effortlessly, and successfully.
Technical themes
• Inference in probabilistic models
– Role of priors, explaining away.
• Learning in graphical models
– Parameter learning, structure learning.
• Bayesian model averaging
– Being Bayesian over network structures.
• Bayesian Occam’s razor
– Trade off model complexity against data fit.
Technical themes
• Structured probabilistic models
– Grammars, first-order logic, relational schemas.
• Hierarchical Bayesian models
– Acquire abstract knowledge, supports transfer.
• Nonparametric Bayes
– Flexible models that grow in complexity as new
data warrant.
• Tractable approximate inference
– Markov chain Monte Carlo (MCMC),
Sequential Monte Carlo (particle filtering).
Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
Basics of Bayesian inference
P ( d | h) P ( h)
• Bayes’ rule: P(h | d ) 
 P(d | hi ) P(hi )
• An example
hi H
– Data: John is coughing
– Some hypotheses:
1. John has a cold
2. John has lung cancer
3. John has a stomach flu
– Likelihood P(d|h) favors 1 and 2 over 3
– Prior probability P(h) favors 1 and 3 over 2
– Posterior probability P(h|d) favors 1 over 2 and 3
Bayesian inference in perception and
sensorimotor integration
(Weiss, Simoncelli & Adelson 2002)
(Kording & Wolpert 2004)
Memory retrieval as Bayesian inference
(Anderson & Schooler, 1991)
Spacing effects
in forgetting:
Additive effects
of practice & delay:
Mean # recalled
Log memory strength
Power law of
forgetting:
Log delay
(hours)
Log delay
(seconds)
Retention interval
(days)
Memory retrieval as Bayesian inference
(Anderson & Schooler, 1991)
For each item in
memory, estimate
the probability that it
will be useful in the
present context.
Use priors based on
the statistics of
natural information
sources.
Memory retrieval as Bayesian inference
(Anderson & Schooler, 1991)
Spacing effects
in forgetting:
Additive effects
of practice & delay:
Log need odds
Log need odds
Power law of
forgetting:
Log # days since
last occurrence
Log # days since
last occurrence
Log # days since
last occurrence
[New York Times data; c.f. email sources, child-directed speech]
Everyday prediction problems
(Griffiths & Tenenbaum, 2006)
• You read about a movie that has made $60 million
to date. How much money will it make in total?
• You see that something has been baking in the
oven for 34 minutes. How long until it’s ready?
• You meet someone who is 78 years old. How long
will they live?
• Your friend quotes to you from line 17 of his
favorite poem. How long is the poem?
• You see taxicab #107 pull up to the curb in front of
the train station. How many cabs in this city?
Making predictions
• You encounter a phenomenon that has
existed for tpast units of time. How long will
it continue into the future? (i.e. what’s ttotal?)
• We could replace “time” with any other
quantity that ranges from 0 to some
unknown upper limit.
Bayesian inference
P(ttotal|tpast)  P(tpast|ttotal) P(ttotal)
posterior
probability
likelihood
prior
Bayesian inference
P(ttotal|tpast)  P(tpast|ttotal) P(ttotal)
posterior
probability
likelihood

1/ttotal
prior
1/ttotal
Assume
“Uninformative”
random
prior
sample
(0 < tpast < ttotal)
(e.g., Jeffreys,
Jaynes)
Bayesian inference
P(ttotal|tpast)  1/ttotal
posterior
probability
Random
sampling
1/ttotal
“Uninformative”
prior
P(ttotal|tpast)
tpast
ttotal
Bayesian inference
P(ttotal|tpast)  1/ttotal
posterior
probability
Random
sampling
1/ttotal
“Uninformative”
prior
P(ttotal|tpast)
tpast
ttotal
Best guess for ttotal: t such that P(ttotal > t|tpast) = 0.5:
Bayesian inference
P(ttotal|tpast)  1/ttotal
posterior
probability
Random
sampling
1/ttotal
“Uninformative”
prior
P(ttotal|tpast)
tpast
ttotal
Yields Gott’s Rule: P(ttotal > t|tpast) = 0.5 when t = 2tpast
i.e., best guess for ttotal = 2tpast .
Evaluating Gott’s Rule
• You read about a movie that has made $78 million
to date. How much money will it make in total?
– “$156 million” seems reasonable.
• You meet someone who is 35 years old. How long
will they live?
– “70 years” seems reasonable.
• Not so simple:
– You meet someone who is 78 years old. How long will
they live?
– You meet someone who is 6 years old. How long will
they live?
The effects of priors
• Different kinds of priors P(ttotal) are
appropriate in different domains.
e.g., wealth,
contacts
[Gott: P(ttotal) ttotal-1 ]
e.g., height,
lifespan
The effects of priors
Evaluating human predictions
• Different domains with different priors:
–
–
–
–
–
–
A movie has made $60 million
Your friend quotes from line 17 of a poem
You meet a 78 year old man
A move has been running for 55 minutes
A U.S. congressman has served for 11 years
A cake has been in the oven for 34 minutes
• Use 5 values of tpast for each.
• People predict ttotal .
You learn that in ancient
Egypt, there was a great
flood in the 11th year of
a pharaoh’s reign. How
long did he reign?
You learn that in ancient
Egypt, there was a great
flood in the 11th year of
a pharaoh’s reign. How
long did he reign?
How long did the typical
pharaoh reign in ancient
egypt?
Summary: prediction
• Predictions about the extent or magnitude of
everyday events follow Bayesian principles.
• Contrast with Bayesian inference in perception,
motor control, memory: no “universal priors” here.
• Predictions depend rationally on priors that are
appropriately calibrated for different domains.
– Form of the prior (e.g., power-law or exponential)
– Specific distribution given that form (parameters)
– Non-parametric distribution when necessary.
• In the absence of concrete experience, priors may
be generated by qualitative background knowledge.
Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
Bayesian networks
Nodes: variables
Links: direct dependencies
Four random variables:
X1
X2
X3
X4
Each node has a conditional
probability distribution
Data: observations of X1, ..., X4
P(x4) X4
P(x1|x3, x4)
coughing
high body temperature
flu
lung cancer
X3 P(x3)
X1
X2 P(x2|x3)
Causal Bayesian networks
Nodes: variables
Links: causal mechanisms
Four random variables:
X1
X2
X3
X4
Each node has a conditional
probability distribution
Data: observations of and
interventions on X1, ..., X4
P(x4) X4
P(x1|x3, x4)
(Pearl; Glymour
& Cooper)
coughing
high body temperature
flu
lung cancer
X3 P(x3)
X1
X2 P(x2|x3)
Inference in causal graphical models
• Explaining away or “discounting” in
social reasoning (Kelley; Morris & Larrick)
B
A
C
• “Screening off” in intuitive causal reasoning
(Waldmann, Rehder & Burnett, Blok & Sloman, Gopnik & Sobel)
B
A
B
C
A
C
P(c|b) vs. P(c|b, a)
P(c|b, not a)
– Better in chains than common-cause structures; common-cause
better if mechanisms clearly independent
• Understanding and predicting the effects of
interventions (Sloman & Lagnado; Gopnik & Schulz)
Learning graphical models
• Structure learning: what causes what?
P(x4) X4
P(x1|x3, x4) X1
X3 P(x3)
X2 P(x2|x3)
• Parameter learning: how do causes work?
P(x4) X4
P(x1|x3, x4) X1
X3 P(x3)
X2 P(x2|x3)
Bayesian learning of causal structure
Data d
Causal hypotheses h
d1  X 1  1, X 2  1, X 3  1, X 4  1
d 2  X 1  1, X 2  0, X 3  0, X 4  1
X3
X4
d 3  X 1  0, X 2  1, X 3  0, X 4  1
X3
X4
X1
X2
X1
X2
d 4  X 1  1, X 2  0, X 3  1, X 4  1
1. What is the most likely
network h given
observed data d ?
2. How likely
is there to be a
link X4
X2 ?
P(d | hi )P(hi )
P(hi | d) 
 P(d | h j )P(h j )
P( X 4  X 2 | d ) 
j
 P( X
h j H
4
 X 2 | h j ) P( h j | d )
(Bayesian model averaging)
Bayesian Occam’s Razor
p(D = d | M )
M1
(MacKay, 2003;
Ghahramani
tutorials)
M2
All possible data sets d
For any model M,

p(D  d | M )  1
all d D
Law of “conservation of belief”: A model that can predict many
possible data sets must assign each of them low probability.
Learning causation from contingencies
C present C absent
(c+)
(c-)
(e+)
a
c
E absent (e-)
b
d
E present
e.g., “Does injecting
this chemical cause
mice to express a
certain gene?”
Subjects judge the extent C to which causes E
(rate on a scale from 0 to 100)
Two models of causal judgment
• Delta-P (Jenkins & Ward, 1965):
P  P(e  | c  )  P(e  | c  )
• Power PC (Cheng, 1997):
P
Powerp 
1  P (e  | c  )
Judging the probability that C
E
(Buehner & Cheng, 1997; 2003)
P
0.00
0.25
0.50
0.75 1.00
People
P
Power
• Independent effects of both P and causal power.
• At P=0, judgments decrease with base rate.
(“frequency illusion”)
Learning causal strength
(parameter learning)
Assume this causal structure:
B
C
w0
w1
E
P and causal power are maximum likelihood
estimates of the strength parameter w1, under
different parameterizations for P(E|B,C):
linear  P, Noisy-OR  causal power
Learning causal structure
(Griffiths & Tenenbaum, 2005)
• Hypotheses:
h1:
B
C
w0
w1
E
h0:
C
B
w0
E
P (d | h1 ) likelihood ratio
• Bayesian causal support: log
(Bayes factor)
P (d | h0 ) gives evidence
P(d | h0 )   P(d | w0 ) p(w0 | h0 ) dw0
1
in favor of h1
0
P(d | h1) 
 
1
1
0
0
P(d | w0 ,w1) p(w0,w1 | h1) dw0 dw1
noisy-OR
(assume uniform parameter priors, but see Yuille et al., Danks et al.)
Buehner and Cheng (1997)
People
P (r = 0.89)
Power (r = 0.88)
Support (r = 0.97)
Implicit background theory
• Injections may or may not cause gene expression,
but gene expression does not cause injections.
– No hypotheses with E
C
• Other naturally occurring processes may also
cause gene expression.
– All hypotheses include an always-present background
cause B
C
• Causes are generative, probabilistically sufficient
and independent, i.e. each cause independently
produces the effect in some proportion of cases.
– Noisy-OR parameterization
Sensitivity analysis
People
Support (Noisy-OR)
2
Support (generic parameterization)
Generativity is essential
P(e+|c+)
P(e+|c-)
100
50
0
8/8
8/8
6/8
6/8
4/8
4/8
2/8
2/8
0/8
0/8
Support
• Predictions result from “ceiling effect”
– ceiling effects only matter if you believe a cause
increases the probability of an effect
Both objects activate
the detector
Blicket detector
Object A does not
activate the detector
by itself
Chi
each
Then
they
maket
(Sobel, Gopnik,Backward
and colleagues)
Blocking Condition
Procedure used in Sobel et al. (2002), Experiment 2
e Condition
See this? It’s a
blicket machine.
Blickets
Blocking
Condition make it go.
s activate
tector
Object A does not
activate the detector
by itself
s activate
tector
Object Aactivates the
detector by itself
Both objects activate
the detector
Children are asked if
each is a blicket
Thenare asked to
they
makethe machine go
Let’s put this one
on the machine.
Children are asked if
each is a blicket
Thenare asked to
they
Object Aactivates the
detector by itself
Oooh, it’s a
blicket!
Chi
each
Then
they
maket
Both objects activate
the detector
Object A does not
activate the detector
by itself
“Backwards blocking”
el et al. (2002), Experiment 2Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
Backward Blocking Condition
One-Cause Condition
Children are a
each is a blicke
Thenare asked t
they
makethe machin
(Sobel, Tenenbaum & Gopnik, 2004)
t A does not
e the detector
by itself
–
–
–
–
Aactivates the
tor by itself
A
B
Children are asked
if objects activate
Both
each is a blicket
the detector
Thenare asked to
they
makethe machine go
AB Trial
Both objects activate
Object A does not
the detector
activate the detector
by itself
A Trial
Object Aactivates the
Children are asked if
detector by itself
each is a blicket
Thenare asked to
they
makethe machine go
Children are a
each is a blicke
Thenare asked t
they
makethe machin
Initially: Nothing on detector – detector silent (A=0, B=0, E=0)
Backward Blocking Condition
Trial 1: A B on detector – detector active (A=1, B=1, E=1)
Trial 2: A on detector – detector active (A=1, B=0, E=1)
4-year-olds judge if each object is a blicket
A B
A: a blicket
(100%
say yes)
? ?
Children are asked
if
Both objects activate
Object Aactivates the
Children are asked if
each is a blicket
each is a blicket
the detector
detector by itself
Thenare asked to not a blicket (34% say yes)
they
B: probably
ThenareE
they
asked to
make
the machine go
makethe machine go
(cf. “explaining away in weight space”, Dayan & Kakade)
Possible hypotheses?
A
B
A
E
A
A
E
B
A
A
A
B
A
A
A
B
A
A
A
B
A
A
A
B
A
A
A
B
A
A
B
E
B
E
B
E
E
B
E
B
E
E
B
E
B
E
E
B
E
B
E
E
B
E
B
E
E
B
E
B
E
E
B
E
A
E
B
A
B
A
B
E
Bayesian causal learning
With a uniform prior on hypotheses,
generic parameterization:
Probability of
being a blicket:
A
B
0.32
0.32
0.34
0.34
A stronger hypothesis space
• Links can only exist from blocks to detectors.
• Blocks are blickets with prior probability q.
• Blickets always activate detectors, detectors never activate on their
own (i.e., deterministic OR parameterization, no hidden causes).
P(h00) = (1 – q)2
A
P(E=1 | A=0, B=0):
P(E=1 | A=1, B=0):
P(E=1 | A=0, B=1):
P(E=1 | A=1, B=1):
B
P(h01) = (1 – q) q
A
B
P(h10) = q(1 – q)
A
B
P(h11) = q2
A
B
E
E
E
E
0
0
0
0
0
0
1
1
0
1
0
1
0
1
1
1
Manipulating prior probability
(Tenenbaum, Sobel, Griffiths, & Gopnik)
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
One-Cause Condition
Both objects activate
the detector
Object A does not
activate the detector
by itself
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
ed
et al.
(2002), Experiment
2
et in
al.Sobel
(2002),
Experiment
2
One-Cause Condition
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
Backward Blocking Condition
One-Cause Condition
Both objects activate
the detector
A does not
A does notObject
activate the detector
he detector
by itself
y itself
Object A does not
Children are asked if
each
is a blicket
activate
the
detector
Both
objects
activate
Object Aactivates the
are asked if
Children are askedChildren
if
Then
they
are
asked
to
by
itself
the
detector
detector by itself
each is a blicket each is a blicket
Both objects activate
Object A does not
make
t
he
machine
go
Thenare asked to the detector
Thenare asked to they
they
activate the detector
makethe machine go
makethe machine go
by itself
Initial
AB Trial
A Trial
Children are ask
each is a blicket
Thenare asked to
they
makethe machine
Children are ask
each
Children are asked ifis a blicket
Thenare asked to
each is a blicketthey
make
Thenare asked to the machine
they
Learning more complex structures
• Tenenbaum et al., Griffiths & Sobel: detectors with
more than two objects and noisy mechanisms
• Steyvers et al., Sobel & Kushnir: active learning
with interventions (c.f. Tong & Koller, Murphy)
• Lagnado & Sloman: learning
from interventions on
continuous dynamical
systems
Inferring hidden causes
Common unobserved cause
4x
2x
2x
Independent unobserved causes
1x
2x
2x
2x
2x
One observed cause
2x
4x
The “stick ball”
machine
(Kushnir, Schulz, Gopnik, & Danks, 2003)
Bayesian learning with unknown
number of hidden variables
(Griffiths
et al 2006)
Common unobserved cause
Independent unobserved causes
One observed cause
a = 0.3
w = 0.8
r = 0.94
Inferring latent causes in
classical conditioning
(Courville, Daw, Gordon, Touretzky 2003)
e.g.,
A noise
X tone
B click
US shock
Training:
A US
A X
B US
Test:
X
X B
Inferring latent causes in
perceptual learning
(Orban, Fiser, Aslin, Lengyel 2006)
Learning to recognize objects and segment scenes:
Inferring latent causes in
sensory integration
(Kording et al. 2006, NIPS 06)
Coincidences
(Griffiths & Tenenbaum, in press)
• The birthday problem
– How many people do you need to have in the
room before the probability exceeds 50% that
two of them have the same birthday?
23.
• The bombing of London
How much of
a coincidence?
P(d | latent )
Bayesian coincidence factor: log
P(d | random)
Chance:
Latent common cause:
C
x
x
x
x
x
x
x
x
x
x
August
Alternative hypotheses:
proximity in date, matching
days of the month, matching
month, ....
How much of
a coincidence?
P(d | latent )
Bayesian coincidence factor: log
P(d | random)
Latent common cause:
Chance:
C
x
x
x
x
x
uniform
x
x
x
x
x
uniform
+
regularity
Summary: causal inference & learning
• Human causal induction can be explained using
core principles of graphical models.
– Bayesian inference (explaining away, screening off)
– Bayesian structure learning (Occam’s razor,
model averaging)
– Active learning with interventions
– Identifying latent causes
Summary: causal inference & learning
• Crucial constraints on hypothesis spaces come
from abstract prior knowledge, or “intuitive
theories”.
– What are the variables?
– How can they be connected?
– How are their effects parameterized?
• Big open questions…
– How can these theories be described formally?
– How can these theories be learned?
Hierarchical Bayesian framework
Abstract
Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
A theory for blickets
(c.f. PRMs, BLOG, FOPL)
Learning with a
uniform prior on
network structures:
True network
attributes (1-12)
Sample 75 observations…
patients
observed data
Learning a blockstructured prior on
network structures:
0.0 0.8 0.01
0.0 0.0 0.75
h
z
1 2
3 4
9 10
11 12
0.0 0.0 0.0
(Mansinghka et al. 2006)
True network
attributes (1-12)
Sample 75 observations…
patients
observed data
5 6
7 8
The “blessing of abstraction”
True structure of
graphical model G:
7
8
# of samples: 20
Graph G
Data D
Abstract theory Z
1
2
3
4
5
6
9
10
11
12
13
14
80
15
16
1000
edge
(G)
edge
(G)
Graph G
Data D
class
(z)
Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
“tufa”
“tufa”
“tufa”
Learning from just one or a few examples, and mostly
unlabeled examples (“semi-supervised learning”).
Simple model of concept learning
“This is a blicket.”
“Can you show me the
other blickets?”
Simple model of concept learning
“This is a blicket.”
Other blickets.
Simple model of concept learning
“This is a blicket.”
Other blickets.
Learning from just one positive example is possible if:
– Assume concepts refer to clusters in the world.
– Observe enough unlabeled data to identify clear clusters.
(c.f. Learning with mixture models and EM, Ghahramani &
Jordan, 1994; Nigam et al. 2000)
Concept learning with mixture
models in cognitive science
• Fried & Holyoak (1984)
– Modeled unsupervised and
semi-supervised categorization
as EM in a Gaussian mixture.
• Anderson (1990)
– Modeled unsupervised and semi-supervised
categorization as greedy sequential search in an
infinite (Chinese restaurant process) mixture.
Infinite (CRP) mixture models
• Construct from k-component mixtures by integrating
out mixing weights, collapsing equivalent partitions,
and taking the limit as k  .
• Does not require that we commit to a fixed – or even
finite – number of classes.
• Effective number of classes can grow with number
of data points, balancing complexity with data fit.
• Computationally much simpler than applying
Bayesian Occam’s razor or cross-validation.
• Easy to learn with standard Monte Carlo
approximations (MCMC, particle filtering),
hopefully avoiding local minima.
High school lunch room analogy
Sampling from
the CRP:
“punks”
“preppies”
“jocks”
“nerds”
Gibbs sampler (Neal):
Assign to
larger groups
Group with
similar objects
“punks”
“preppies”
“jocks”
“nerds”
A typical cognitive experiment
F1
F2
F3
F4
Label
Training stimuli:
1
1
0
0
0
1
1
0
1
0
1
0
1
1
0
0
0
1
1
0
1
0
0
1
1
1
1
0
0
0
Test stimuli:
0
1
1
1
0
0
1
1
1
0
0
0
1
0
1
0
1
0
1
1
0
0
0
1
?
?
?
?
?
?
Anderson (1990), “Rational
model of categorization”:
Greedy sequential search
in an infinite mixture
model.
Sanborn, Griffiths, Navarro
(2006), “More rational
model of categorization”:
Particle filter with a
small # of particles
Towards more natural concepts
CrossCat: Discovering multiple structures
that capture different subsets of features
(Shafto, Kemp, Mansinghka, Gordon & Tenenbaum, 2006)
Infinite relational models
(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)
concept
(c.f. Xu,
Tresp, et al.
SRL 06)
predicate
concept
Biomedical predicate data from UMLS (McCrae et al.):
– 134 concepts: enzyme, hormone, organ, disease, cell function ...
– 49 predicates: affects(hormone, organ), complicates(enzyme, cell
function), treats(drug, disease), diagnoses(procedure, disease) …
Infinite relational models
(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)
e.g., Diseases affect
Organisms
Chemicals interact
with Chemicals
Chemicals cause
Diseases
Learning from very few examples
“tufa”
• Word learning
“tufa”
“tufa”
• Property induction
Cows have T9 hormones.
Seals have T9 hormones.
Squirrels have T9 hormones.
Cows have T9 hormones.
Sheep have T9 hormones.
Goats have T9 hormones.
All mammals have T9 hormones.
All mammals have T9 hormones.
The computational problem
(c.f., semi-supervised learning)
?
Horse
Cow
Chimp
Gorilla
Mouse
Squirrel
Dolphin
Seal
Rhino
Elephant
?
?
?
?
?
?
?
?
Features
(85 features from Osherson et al., e.g., for Elephant: ‘gray’, ‘hairless’,
‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’,
‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)
New property
Many sources of priors
Chimps have T9 hormones.
Gorillas have T9 hormones.
Poodles can bite through wire.
Taxonomic
similarity
Jaw strength
Dobermans can bite through wire.
Salmon carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Food web
relations
Hierarchical Bayesian Framework
(Kemp & Tenenbaum)
P(form)
F: form
P(structure | form)
Tree
mouse
squirrel
S: structure
chimp
Has T9
hormones
gorilla
F1
F2
F3
F4
P(data | structure)
D: data
mouse
squirrel
chimp
gorilla
…
?
?
?
P(D|S): How the structure constrains the
data of experience
• Define a stochastic process over structure S that
generates hypotheses h.
– For generic properties, prior should favor hypotheses that
vary smoothly over structure.
– Many properties of biological species were actually
generated by such a process (i.e., mutation + selection).
Smooth: P(h) high
Not smooth: P(h) low
P(D|S): How the structure constrains the
data of experience
S
Gaussian Process
(~ random walk,
diffusion)
~
  1 ( S )
[Zhu, Ghahramani
& Lafferty 2003]
y
Threshold
h
Structure S
Data D
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
Features
(85 features from Osherson et al., e.g., for Elephant: ‘gray’, ‘hairless’,
‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’,
‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)
Structure S
Data D
Species 1
Species 2
Species 3
Species 4
Species 5
Species 6
Species 7
Species 8
Species 9
Species 10
?
?
?
?
?
?
?
?
Features
(85 features from Osherson et al., e.g., for Elephant: ‘gray’, ‘hairless’,
‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’,
‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)
New property
Cows have property P.
Elephants have property P.
2D
Tree
Horses have property P.
Gorillas have property P.
Mice have property P.
Seals have property P.
All mammals have property P.
Reasoning about spatially
varying properties
“Native American artifacts” task
Property type
“has T9 hormones”
“can bite through wire”
Theory Structure
taxonomic tree
+ diffusion process
directed chain
+ drift process
“carry E. Spirus bacteria”
directed network
+ noisy transmission
Class D
Class A
Class B
Class D
Class A
Class A
Class F
Class C
Class D
Class E
Class C
Class C
Class E
Class B
Class G
Class E
Class B
Class F
Class G
Hypotheses
Class A
Class B
Class C
Class D
Class E
Class F
Class G
Class F
Class G
...
...
...
Herring
Tuna
Mako shark
Sand shark
Dolphin
Human
Kelp
Sand shark
Kelp
Herring
Tuna
Dolphin
Mako shark
Human
Hierarchical Bayesian Framework
F: form
S: structure
Chain
Tree
chimp
mouse
gorilla
squirrel
squirrel
chimp
Space
mouse
squirrel
gorilla
chimp
gorilla
F1
F2
F3
F4
mouse
D: data
mouse
squirrel
chimp
gorilla
Discovering structural forms
Ostrich Robin Crocodile Snake Turtle Bat
Orangutan
Discovering structural forms
“Great chain of being”
Linnaeus
Ostrich Robin Crocodile Snake Turtle Bat
Orangutan
People can discover structural forms
• Scientists
– Tree structure for living kinds (Linnaeus)
– Periodic structure for chemical elements
(Mendeleev)
• Children
–
–
–
–
Hierarchical structure of category labels
Clique structure of social groups
Cyclical structure of seasons or days of the week
Transitive structure for value
Typical structure learning algorithms
assume a fixed structural form
Flat Clusters
Line
Circle
K-Means
Mixture models
Competitive learning
Guttman scaling
Ideal point models
Circumplex models
Tree
Grid
Euclidean Space
Hierarchical clustering
Bayesian phylogenetics
Self-Organizing Map
Generative topographic
mapping
MDS
PCA
Factor Analysis
Hierarchical Bayesian Framework
F: form
Favors simplicity
mouse
squirrel
S: structure
Favors smoothness
chimp
gorilla
D: data
F1
F2
F3
F4
[Zhu et al., 2003]
mouse
squirrel
chimp
gorilla
Structural forms as graph grammars
Form
Process
Form
Process
Development of structural forms
as more data are observed
Beyond “Nativism” versus “Empiricism”
• “Nativism”: Explicit knowledge of structural
forms for core domains is innate.
– Atran (1998): The tendency to group living kinds into
hierarchies reflects an “innately determined cognitive
structure”.
– Chomsky (1980): “The belief that various systems of mind are
organized along quite different principles leads to the natural
conclusion that these systems are intrinsically determined, not
simply the result of common mechanisms of learning or
growth.”
• “Empiricism”: General-purpose learning systems
without explicit knowledge of structural form.
– Connectionist networks (e.g., Rogers and McClelland, 2004).
– Traditional structure learning in probabilistic graphical models.
Summary: concept learning
– How does abstract domain
knowledge guide learning of
new concepts?
– How can this knowledge be
represented, and how might it
be learned?
F: form
mouse
squirrel
S: structure
chimp
gorilla
F1
F2
F3
F4
Models based on
Bayesian inference over
hierarchies of structured
representations.
D: data
mouse
squirrel
chimp
gorilla
– How can probabilistic inference work together with
flexibly structured representations to model complex,
real-world learning and reasoning?
Contributions of Bayesian models
• Principled quantitative models of human behavior,
with broad coverage and a minimum of free
parameters and ad hoc assumptions.
• Explain how and why human learning and
reasoning works, in terms of (approximations to)
optimal statistical inference in natural
environments.
• A framework for studying people’s implicit
knowledge about the structure of the world: how it
is structured, used, and acquired.
• A two-way bridge to state-of-the-art AI and
machine learning.
Looking forward
• What we need to understand: the mind’s ability to build
rich models of the world from sparse data.
–
–
–
–
–
Learning about objects, categories, and their properties.
Causal inference
Language comprehension and production
Scene understanding
Understanding other people’s actions, plans, thoughts, goals
• What do we need to understand these abilities?
–
–
–
–
Bayesian inference in probabilistic generative models
Hierarchical models, with inference at all levels of abstraction
Structured representations: graphs, grammars, logic
Flexible representations, growing in response to observed data
Learning word meanings
Abstract
Principles
Whole-object principle
Shape bias
Taxonomic principle
Contrast principle
Basic-level bias
Structure
Data
(Tenenbaum & Xu)
Causal learning and reasoning
Abstract
Principles
Structure
Data
(Griffiths, Tenenbaum, Kemp et al.)
“Universal Grammar”
Hierarchical phrase structure
grammars (e.g., CFG, HPSG, TAG)
P(grammar | UG)
Grammar
P(phrase structure | grammar)
Phrase structure
P(utterance | phrase structure)
Utterance
P(speech | utterance)
Speech signal
(c.f. Chater and Manning, 2006)
S  NP VP
NP  Det [ Adj ] Noun [ RelClause ]
RelClause  [ Rel ] NP V
VP  VP NP
VP  Verb
Vision as probabilistic parsing
(Han & Zhu, 2006; c.f.,
Zhu, Yuanhao
& Yuille NIPS 06 )
Goal-directed action
(production and comprehension)
(Wolpert et al., 2003)
Open directions and challenges
• Effective methods for learning structured knowledge
– How to balance expressiveness/learnability tradeoff?
• More precise relation to psychological processes
– To what extent do mental processes implement boundedly
rational methods of approximate inference?
• Relation to neural computation
– How to implement structured representations in brains?
• Modeling individual subjects and single trials
– Is there a rational basis for probability matching?
• Understanding failure cases
– Are these simply “not Bayesian”, or are people using a
different model? How do we avoid circularity?
Want to learn more?
• Special issue of Trends in Cognitive
Sciences (TiCS), July 2006 (Vol. 10, no. 7),
on “Probabilistic models of cognition”.
• Tom Griffiths’ reading list, a/k/a
http://bayesiancognition.com
• Summer school on probabilistic models of
cognition, July 2007, Institute for Pure and
Applied Mathematics (IPAM) at UCLA.
Extra slides
Bayesian prediction
P(ttotal|tpast) 
posterior
probability
1/ttotal
Random
sampling
P(tpast)
Domain-dependent
prior
What is the best guess for ttotal?
Compute t such that P(ttotal > t|tpast) = 0.5:
P(ttotal|tpast)
ttotal
We compared the median
of the Bayesian posterior
with the median of subjects’
judgments… but what about
the distribution of subjects’
judgments?
Sources of individual differences
• Individuals’ judgments could by noisy.
• Individuals’ judgments could be optimal,
but with different priors.
– e.g., each individual has seen only a sparse
sample of the relevant population of events.
• Individuals’ inferences about the posterior
could be optimal, but their judgments could
be based on probability (or utility) matching
rather than maximizing.
Proportion of judgments below predicted value
Individual differences in prediction
P(ttotal|tpast)
ttotal
Quantile of Bayesian posterior distribution
Individual differences in prediction
P(ttotal|tpast)
ttotal
Average over all
prediction tasks:
•
•
•
•
•
•
movie run times
movie grosses
poem lengths
life spans
terms in congress
cake baking times
Individual differences in concept learning
Why probability matching?
• Optimal behavior under some
(evolutionarily natural) circumstances.
–
–
–
–
Optimal betting theory, portfolio theory
Optimal foraging theory
Competitive games
Dynamic tasks (changing probabilities or utilities)
• Side-effect of algorithms for approximating
complex Bayesian computations.
– Markov chain Monte Carlo (MCMC): instead of integrating over
complex hypothesis spaces, construct a sample of high-probability
hypotheses.
– Judgments from individual (independent) samples can on average
be almost as good as using the full posterior distribution.
Markov chain Monte Carlo
(Metropolis-Hastings algorithm)
The puzzle of coincidences
Discoveries of hidden causal structure are
often driven by noticing coincidences. . .
• Science
– Halley’s comet (1705)
(Halley, 1705)
(Halley, 1705)
The puzzle of coincidences
Discoveries of hidden causal structure are
often driven by noticing coincidences. . .
• Science
– Halley’s comet (1705)
– John Snow and the cause of cholera (1854)
Rational analysis of cognition
• Often can show that apparently irrational behavior
is actually rational.
Which cards do you have to turn over to test this rule?
“If there is an A on one side, then there is a 2 on the other side”
Rational analysis of cognition
• Often can show that apparently irrational behavior
is actually rational.
• Oaksford & Chater’s rational analysis:
– Optimal data selection based
on maximizing expected
information gain.
– Test the rule “If p, then q”
against the null hypothesis
that p and q are independent.
– Assuming p and q are rare
predicts people’s choices:
Integrating multiple forms of reasoning
(Kemp, Shafto, Berke & Tenenbaum NIPS 06)
2) Causal relations
between features
1) Taxonomic
relations
between
categories
T9 hormones cause elevated heart rates.
Elevated heart rates cause faster metabolisms.
Mice have T9 hormones.
…?
… Parameters of causal
relations vary smoothly
over the category hierarchy.
Integrating multiple forms of reasoning
Infinite relational models
(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)
concept
(c.f. Xu,
Tresp, et al.
SRL 06)
predicate
concept
Biomedical predicate data from UMLS (McCrae et al.):
– 134 concepts: enzyme, hormone, organ, disease, cell function ...
– 49 predicates: affects(hormone, organ), complicates(enzyme, cell
function), treats(drug, disease), diagnoses(procedure, disease) …
Learning relational theories
e.g., Diseases affect
Organisms
Chemicals interact
with Chemicals
Chemicals cause
Diseases
Learning annotated hierarchies
from relational data
(Roy, Kemp, Mansinghka, Tenenbaum NIPS 06)
Learning abstract relational structures
Dominance
hierarchy
Primate troop
“x beats y”
Tree
Bush administration
“x told y”
Cliques
Prison inmates
“x likes y”
Ring
Kula islands
“x trades with y”
Bayesian inference in
neural networks
(Rao, in press)
The big problem of intelligence
• The development of intuitive theories in
childhood.
– Psychology: How do we learn to understand
others’ actions in terms of beliefs, desires,
plans, intentions, values, morals?
– Biology: How do we learn that people, dogs,
bees, worms, trees, flowers, grass, coral, moss
are alive, but chairs, cars, tricycles, computers,
the sun, Roomba, robots, clocks, rocks are not?
The big problem of intelligence
• Common sense reasoning.
Consider a man named Boris.
– Is the mother of Boris’s father his grandmother?
– Is the mother of Boris’s sister his mother?
– Is the son of Boris’s sister his son?
(Note: Boris and his family were
stranded on a desert island when he
was a young boy.)