Bayesian Models of Human Learning and Reasoning Josh

Download Report

Transcript Bayesian Models of Human Learning and Reasoning Josh

Bayesian Models of Human
Learning and Inference
Josh Tenenbaum
MIT
Department of Brain and
Cognitive Sciences
Shiffrin Says
“Progress in science is driven by new tools,
not great insights.”
Outline
• Part I. Brief survey of Bayesian modeling
in cognitive science.
• Part II. Bayesian models of everyday
inductive leaps.
Collaborators
Tom Griffiths
Charles Kemp
Tevye Krynski
Sourabh Niyogi
Neville Sanjana
Mark Steyvers
Sean Stromsten
Fei Xu
Wheeler Ruml
Dave Sobel
Alison Gopnik
Collaborators
Tom Griffiths
Charles Kemp
Tevye Krynski
Sourabh Niyogi
Neville Sanjana
Mark Steyvers
Sean Stromsten
Fei Xu
Wheeler Ruml
Dave Sobel
Alison Gopnik
Outline
• Part I. Brief survey of Bayesian modeling
in cognitive science.
– Rational benchmark for descriptive models of
probability judgment.
– Rational analysis of cognition
– Rational tools for fitting cognitive models
Normative benchmark for
descriptive models
• How does human probability judgment
compare to the Bayesian ideal?
– Peterson & Beach, Edwards, Tversky &
Kahneman, . . . .
• Explicit probability judgment tasks
– Drawing balls from an urn, rolling dice,
medical diagnosis, . . . .
• Alternative descriptive models
– Heuristics and Biases, Support Theory, . . . .
Rational analysis of cognition
• Develop Bayesian models for core aspects of
cognition not traditionally thought of in
terms of statistical inference.
• Examples:
– Memory retrieval: Anderson; Shiffrin et al, . . . .
– Reasoning with rules: Oaksford & Chater, . . . .
Rational analysis of cognition
• Often can explain a wider range of phenomena
than previous models, with fewer free parameters.
Power laws of practice
and retention
Spacing effects on
retention
Rational analysis of cognition
• Often can explain a wider range of phenomena
than previous models, with fewer free parameters.
• Anderson’s rational analysis of memory:
– For each item in memory,
estimate the probability that
it will be useful in the present
context.
– Model of need probability
inspired by library book
access. Corresponds to
statistics of natural
information sources:
Rational analysis of cognition
– For each item in memory,
estimate the probability that
it will be useful in the present
context.
– Model of need probability
inspired by library book
access. Corresponds to
statistics of natural
information sources:
Log need odds
• Often can explain a wider range of phenomena
than previous models, with fewer free parameters.
• Anderson’s rational analysis of memory:
Short lag
Long lag
Log days since
last occurrence
Rational analysis of cognition
• Often can show that apparently irrational behavior
is actually rational.
Which cards do you have to turn over to test this rule?
“If there is an A on one side, then there is a 2 on the other side”
Rational analysis of cognition
• Often can show that apparently irrational behavior
is actually rational.
• Oaksford & Chater’s rational analysis:
– Optimal data selection based
on maximizing expected
information gain.
– Test the rule “If p, then q”
against the null hypothesis
that p and q are independent.
– Assuming p and q are rare
predicts people’s choices:
Rational tools for fitting
cognitive models
• Use Bayesian Occam’s Razor to solve the
problem of model selection: trade off fit to
the data with model complexity.
• Examples:
– Comparing alternative cognitive models:
Myung, Pitt, . . . .
– Fitting nested families of models of mental
representation: Lee, Navarro, . . . .
Rational tools for fitting
cognitive models
• Comparing alternative cognitive models via an
MDL approximation to the Bayesian Occam’s
Razor takes into account the functional form of a
model as well as the number of free parameters.
Rational tools for fitting
cognitive models
• Fit models of mental representation to similarity
data, e.g. additive clustering, additive trees,
common and distinctive feature models.
• Want to choose the
complexity of the model
(number of features, depth
of tree) in a principled way,
and search efficiently
through the space of nested
models. Using Bayesian
Occam’s Razor:
Outline
• Part I. Brief survey of Bayesian modeling
in cognitive science.
• Part II. Bayesian models of everyday
inductive leaps.
Rational models of cognition where
Bayesian model selection, Bayesian
Occam’s Razor play central
explanatory role.
Everyday inductive leaps
How can we learn so much about . . .
–
–
–
–
–
–
Properties of natural kinds
Meanings of words
Future outcomes of a dynamic process
Hidden causal properties of an object
Causes of a person’s action (beliefs, goals)
Causal laws governing a domain
. . . from such limited data?
Learning concepts and words
Learning concepts and words
“tufa”
“tufa”
“tufa”
Can you pick out the tufas?
Inductive reasoning
Input:
Cows can get Hick’s disease.
Gorillas can get Hick’s disease.
(premises)
All mammals can get Hick’s disease.
(conclusion)
Task: Judge how likely conclusion is to be
true, given that premises are true.
Inferring causal relations
Input:
Day 1
Day 2
Day 3
Day 4
...
Took vitamin B23
Headache
yes
yes
no
yes
...
no
yes
yes
no
...
Does vitamin B23 cause headaches?
Task: Judge probability of a causal link
given several joint observations.
The Challenge
• How do we generalize successfully from very
limited data?
– Just one or a few examples
– Often only positive examples
• Philosophy:
– Induction is a “problem”, a “riddle”, a “paradox”,
a “scandal”, or a “myth”.
• Machine learning and statistics:
– Focus on generalization from many examples,
both positive and negative.
Rational statistical inference
(Bayes, Laplace)
Posterior
probability
Likelihood
Prior
probability
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
History of Bayesian Approaches
to Human Inductive Learning
• Hunt
History of Bayesian Approaches
to Human Inductive Learning
• Hunt
• Suppes
– “Observable changes of hypotheses under positive
reinforcement”, Science (1965), w/ M. Schlag-Rey.
“A tentative interpretation is that, when the set of hypotheses is
large, the subject ‘samples’ or attends to several hypotheses
simultaneously. . . . It is also conceivable that a subject might
sample spontaneously, at any time, or under stimulations other
than those planned by the experimenter. A more detailed
exploration of these ideas, including a test of Bayesian
approaches to information processing, is now being made.”
History of Bayesian Approaches
to Human Inductive Learning
• Hunt
• Suppes
• Shepard
– Analysis of one-shot stimulus generalization, to
explain the universal exponential law.
• Anderson
– Rational analysis of categorization.
Theory-Based Bayesian Models
• Explain the success of everyday inductive
leaps based on rational statistical inference
mechanisms constrained by domain theories
well-matched to the structure of the world.
• Rational statistical inference (Bayes):
p ( d | h) p ( h)
p(h | d ) 
 p(d | h) p(h)
hH
• Domain theories generate the necessary
ingredients: hypothesis space H, priors p(h).
Questions about theories
• What is a theory?
– Working definition: an ontology and a system of
abstract (causal) principles that generates a
hypothesis space of candidate world structures
(e.g., Newton’s laws).
• How is a theory used to learn about the
structure of the world?
• How is a theory acquired?
– Probabilistic generative model
learning.
statistical
Alternative approaches to
inductive generalization
•
•
•
•
•
Associative learning
Connectionist networks
Similarity to examples
Toolkit of simple heuristics
Constraint satisfaction
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Representation and algorithm:
Cognitive psychology
• Implementation:
Neurobiology
Descriptive Goals
• Principled mathematical models, with a minimum of
arbitrary assumptions.
• Close quantitative fits to behavioral data.
• Unified models of cognition across domains.
Explanatory Goals
• How do we reliably acquire knowledge about the
structure of the world, from such limited experience?
• Which processing models work, and why?
• New views on classic questions in cognitive science:
– Symbols (rules, logic, hierarchies, relations) versus Statistics.
– Theory-based inference versus Similarity-based inference.
– Domain-specific knowledge versus Domain-general
mechanisms.
• Provides a route to studying people’s hidden (implicit
or unconscious) knowledge about the world.
The plan
•
•
•
•
Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
– Intuitive biology: Taxonomic structure
– Intuitive physics: Causal law
The plan
•
•
•
•
Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
– Intuitive biology: Taxonomic structure
– Intuitive physics: Causal law
Learning a single causal relation
Given a random
sample of mice:
Injected
with X
Not injected
with X
Expressed Y
45
30
Did not
express Y
15
30
• “To what extent does chemical X cause gene Y
to be expressed?”
• Or, “What is the probability that X causes Y?”
Associative models of causal
strength judgment
c+
c-
(injected (not injected
with X)
with X)
e+ (expressed Y)
e- (did not
express Y)
a
c
b
d
• Delta-P (or Asymptotic Rescorla-Wagner):
P  P(e | c  ) - P(e | c - )  a (a  b) - c c  d
• Power PC (Cheng, 1997):
P
P
p

 1 - P (e | c ) d (c  d )
Some behavioral data
Buehner & Cheng, 1997
People
P
Power PC
• Independent effects of both causal power and P.
• Neither theory explains the trend for P=0.
Bayesian causal inference
• Hypotheses: h1 =
B
C
w0
w1
E
h0 =
C
B
w0
E
w0, w1: strength parameters for B, C
Bayesian causal inference
• Hypotheses: h1 =
B
C
w0
w1
h0 =
C
B
w0
E
E
w0, w1: strength parameters for B, C
• Probabilistic model: “noisy-OR”
C B
0
1
0
1
0
0
1
1
h1: P( E  1 | C , B ; w1, w0 )
0
w1
w0
w1+ w0 – w1 w0
h0: P( E  1 | C , B ; w0 )
0
0
w0
w0
Bayesian causal inference
• Hypotheses: h1 =
Background
cause B unobserved,
always present (B=1)
B
C
w0
w1
h0 =
C
B
w0
E
E
w0, w1: strength parameters for B, C
• Probabilistic model: “noisy-OR”
C B
0
1
0
1
0
0
1
1
h1: P( E  1 | C , B ; w1, w0 )
0
w1
w0
w1+ w0 – w1 w0
h0: P( E  1 | C , B ; w0 )
0
0
w0
w0
Inferring structure versus
estimating strength
• Hypotheses: h1 =
B
C
w0
w1
h0 =
C
B
w0
E
E
• Both causal power and P correspond to maximum
likelihood estimates of the strength parameter w1, under
different parameterizations for p(E|B,C):
– linear
P, Noisy-OR
causal power
• Causal support model: people are judging the
probability that a causal link exists, rather than
assuming it exists and estimating its strength.
Role of domain theory
(c.f. PRMs, ILP, Knowledge-based model construction)
Generates hypothesis space of causal graphical
models:
• Causally relevant attributes of objects:
– Constrains random variables (nodes).
• Causally relevant relations between attributes:
– Constrains dependence structure of variables (arcs).
• Causal mechanisms – how effects depend
functionally on their causes:
– Constrains local probability distribution for each variable
conditioned on its direct causes (parents).
Role of domain theory
• Injections may or may not cause gene expression,
but gene expression does not cause injections.
– No hypotheses with E
C
• Other naturally occurring processes may also
cause gene expression.
– All hypotheses include an always-present background
cause B
C
• Causes are probabilistically sufficient and
independent (Cheng): Each cause independently
produces the effect in some proportion of cases.
– “Noisy-OR” causal mechanism
• Hypotheses: h1 =
B
C
w0
w1
h0 =
C
B
w0
E
E
• Bayesian causal inference:
P(data | h1 ) P(h1 )
P(h1 | data ) 
P(data | h1 ) P(h1 )  P(data | h0 ) P(h0 )
1
P(data | h0 )   P(data | w0 ) p( w0 | h0 ) dw0
0
1 1
P(data | h1 )  
P(data | w0 , w1 ) p ( w0 , w1 | h1 ) dw0 dw1

0 0
noisy-OR
Assume all priors uniform....
Bayesian Occam’s Razor
C
B
P( data | model )
w0
E
low w1
B
C
w0
w1
E
All possible data sets
increasing P
high w1
Bayesian Occam’s Razor
C
B
P( data | model )
w0
E
low w1
B
C
w0
w1
E
P(e | c  )  80 / 100
P(e | c - )  20 / 100
high w1
Bayesian Occam’s Razor
C
B
P( data | model )
w0
E
low w1
P(e | c  )  80 / 100
P(e | c - )  77 / 100
B
C
w0
w1
E
high w1
Buehner & Cheng, 1997
People
P
Power PC
Bayes
Sensitivity analysis
• How much work does domain theory do?
– Alternative model: Bayes with arbitrary P(E|B,C)
Bayes without noisy-OR theory
• How much work does Bayes do?
– Alternative model: c2 measure of independence.
c2
People
P
Power PC
(MLE w/ noisy-OR)
Bayes w/ noisy-OR theory
Bayes without noisy-OR theory
c2
Varying number of observations
People (n=8)
Bayes (n=8)
People (n=60)
Bayes (n=60)
Data for inhibitory causes
People
P
Power PC
(MLE w/ noisy-AND-NOT)
Bayes w/ noisy-AND-NOT
Causal inference with rates
Rate(e | c  )
Rate(e | c - )
People
R
Power PC (N=150)
Bayes w/ Poisson parameterization
Causal induction: summary
• People’s judgments closely reflect optimal
Bayesian model selection, constrained by a
minimal domain theory.
• Beyond elemental causal induction:
– More complex inferences, with causal
networks, hidden variables, active learning.
– Stronger inferences, with richer prior
knowledge.
– Discovery of causal domain theories.
The plan
•
•
•
•
Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
– Intuitive biology: Taxonomic structure
– Intuitive physics: Causal law
The number game
• Program input: number between 1 and 100
• Program output: “yes” or “no”
The number game
• Learning task:
– Observe one or more positive (“yes”) examples.
– Judge whether other numbers are “yes” or “no”.
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
The number game
Examples of
“yes” numbers
Generalization
judgments (n = 20)
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
60 52 57 55
Focused similarity:
numbers near 50-60
The number game
Examples of
“yes” numbers
Generalization
judgments (N = 20)
16
Diffuse similarity
16 8 2 64
Rule:
“powers of 2”
16 23 19 20
Focused similarity:
numbers near 20
The number game
60
Diffuse similarity
60 80 10 30
Rule:
“multiples of 10”
60 52 57 55
Focused similarity:
numbers near 50-60
Main phenomena to explain:
– Generalization can appear either similaritybased (graded) or rule-based (all-or-none).
– Learning from just a few positive examples.
Rule/similarity hybrid models
• Category learning
– Nosofsky, Palmeri et al.: RULEX
– Erickson & Kruschke: ATRIUM
Divisions into “rule” and
“similarity” subsystems
• Category learning
– Nosofsky, Palmeri et al.: RULEX
– Erickson & Kruschke: ATRIUM
• Language processing
– Pinker, Marcus et al.: Past tense morphology
• Reasoning
– Sloman
– Rips
– Nisbett, Smith et al.
Rule/similarity hybrid models
• Why two modules?
• Why do these modules work the way that they do,
and interact as they do?
• How do people infer a rule or similarity metric
from just a few positive examples?
Bayesian model
• H: Hypothesis space of possible concepts.
–
–
–
–
–
h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
...
Representational interpretations for H:
– Candidate rules
– Features for similarity
– “Consequential subsets” (Shepard, 1987)
Three hypothesis subspaces for
number concepts
• Mathematical properties (24 hypotheses):
– Odd, even, square, cube, prime numbers
– Multiples of small integers
– Powers of small integers
• Raw magnitude (5050 hypotheses):
– All intervals of integers with endpoints between
1 and 100.
• Approximate magnitude (10 hypotheses):
– Decades (1-10, 10-20, 20-30, …)
Hypothesis spaces and theories
• Why a hypothesis space is like a domain
theory:
– Represents one particular way of classifying
entities in a domain.
– Not just an arbitrary collection of hypotheses,
but a principled system.
• What’s missing?
– Explicit representation of the principles.
– [Causality.]
• Hypothesis space is generated by theory.
Bayesian model
• H: Hypothesis space of possible concepts.
– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.
• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) 
 p( X | h) p(h)
hH
– p(h) [“prior”]: domain knowledge, pre-existing biases
– p(X|h) [“likelihood”]: statistical information in examples.
– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
 1 
p ( X | h)  
if x1 ,  , xn  h

 size( h) 
 0 if any xi  h
• Follows from assumption of randomly sampled examples.
• Captures the intuition of a representative sample.
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Data slightly more of a coincidence under h1
Illustrating the size principle
h1
2
12
22
32
42
52
62
72
82
92
4
14
24
34
44
54
64
74
84
94
6
16
26
36
46
56
66
76
86
96
8 10
18 20
28 30
38 40
48 50
58 60
68 70
78 80
88 90
98 100
h2
Data much more of a coincidence under h1
Bayesian Occam’s Razor
p(D = d | M )
M1
M2
All possible data sets d
For any model M,

all d D
p(D  d | M )  1
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
A domain-general approach to priors?
• Start with a base set of regularities R and combination
operators C.
• Hypothesis space = closure of R under C.
– C = {and, or}: H = unions and intersections of regularities in R (e.g.,
“multiples of 10 between 30 and 70”).
– C = {and-not}: H = regularities in R with exceptions (e.g., “multiples
of 10 except 50 and 70”).
• Two qualitatively similar priors:
– Description length: number of combinations in C needed to generate
hypothesis from R.
– Bayesian Occam’s Razor, with model classes defined by number of
combinations: more combinations
more hypotheses
lower prior
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.
• p(h) encodes relative plausibility of alternative theories:
– Mathematical properties: p(h) ~ 1
– Approximate magnitude: p(h) ~ 1/10
– Raw magnitude:
p(h) ~ 1/50 (on average)
p(s)  (s  ) e- s  ,   10
p(s)
• Also degrees of plausibility within a theory,
e.g., for magnitude intervals of size s:
s
Posterior:
p ( X | h) p ( h)
p(h | X ) 
 p( X | h) p(h)
hH
• X = {60, 80, 10, 30}
• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 20”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)
Bayesian Occam’s Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
Generalizing to new objects
Given p(h|X), how do we compute p( y  C | X ) ,  p( y 
the probability that C applies to some new
hH
stimulus y?
Generalizing to new objects
Hypothesis averaging:
Compute the probability that C applies to some
new object y by averaging the predictions of all
hypotheses h, weighted by p(h|X):
p( y  C | X ) 
y  C | h) p ( h | X )
 p(

hH


 1 if yh

 0 if yh
h { y, X }
p(h | X )
Examples:
16
Examples:
16
8
2
64
Examples:
16
23
19
20
+ Examples
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
Human generalization
Bayesian Model
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior  likelihood  prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging  size principle
broad p(h|X): similarity gradient
narrow p(h|X): all-or-none rule
Summary of the Bayesian model
• How do the statistics of the examples interact with
prior knowledge to guide generalization?
posterior  likelihood  prior
• Why does generalization appear rule-based or
similarity-based?
hypothesis averaging  size principle
Many h of similar size: broad p(h|X)
One h much smaller: narrow p(h|X)
Alternative models
• Neural networks
– Supervised learning inapplicable.
– Simple unsupervised learning not sufficient:
even
60
80
10
30
multiple
of 10
multiple
of 3
power
of 2
Alternative models
• Neural networks
• Similarity to exemplars
1
sim ( y, x j )
– Average similarity: p( y  C | X ) 

| X | x X
j
60
60 80 10 30
60 52 57 55
Data
Model (r = 0.80)
Alternative models
• Neural networks
• Similarity to exemplars
– Average similarity
– Max similarity: p( y  C | X )  max sim ( y, x j )
x j X
60
60 80 10 30
60 52 57 55
Data
Model (r = 0.64)
Alternative models
• Neural networks
• Similarity to exemplars
– Average similarity
– Max similarity
– Flexible similarity? Bayes.
Explaining similarity
• Hypothesis: A principal function of
similarity is generalization.
• A theory of generalization can thus explain
(some aspects of) similarity:
– The similarity of X to Y is to a significant
degree determined by the probability of
generalizing from X to Y, or from Y to X, or
both.
• Opposite of traditional approach: similarity
explains generalization.
Explaining similarity
• Spatial models
– Why exponential decay with distance?
• Common feature models
– Why additive measure?
– What determines feature weights, and why?
•
•
•
•
Specificity
Relational preference
Diagnosticity
Context-sensitivity
• Contrast model
– Why (and when) are both common & distinctive features
relevant?
– When is similarity asymmetric?
Alternative models
• Neural networks
• Similarity to exemplars
– Average similarity
– Max similarity
– Flexible similarity? Bayes.
• Toolbox of simple heuristics
– 60: “general” similarity
– 60 80 10 30: most specific rule (“subset principle”).
– 60 52 57 55: similarity in magnitude
Why these heuristics? When to use which heuristic? Bayes.
Numbers: Summary
• Theory-based statistical inference explains
inductive generalization from one or a few
examples.
• Explains the dynamics of both rule-like and
similarity-like generalization through the
interaction of:
– Structure of domain-specific knowledge.
– Domain-general principles of rational
inference.
Limitations of the number game
• No sense in which the theory is the “right”
or “wrong” description of world structure.
– Number game is conventional, not natural.
• Purely logical structure of the theory does
much of the work, with statistics just
selecting among hypotheses.
– Theory itself is not probabilistic.
• Theory just amounts to a systematization
for a set of hypotheses.
– No causal mechanisms.
The plan
•
•
•
•
Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
– Intuitive biology: Taxonomic structure
– Intuitive physics: Causal law
Which argument is stronger?
Horses have biotinic acid in their blood
Cows have biotinic acid in their blood
Rhinos have biotinic acid in their blood
All mammals have biotinic acid in their blood
Squirrels have biotinic acid in their blood
Dolphins have biotinic acid in their blood
Rhinos have biotinic acid in their blood
All mammals have biotinic acid in their blood
Osherson, Smith, Wilkie, Lopez, Shafir (1990):
• 20 subjects rated the strength of 45 arguments:
X1 have property P.
X2 have property P.
X3 have property P.
All mammals have property P.
• 40 different subjects rated the similarity of all
pairs of 10 mammals.
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Sum-Similarity:
sim( i, X ) 

sim( i, j )
x
x
jX
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Similarity-based models
(Osherson et al.)
strength(“all mammals” | X )

 sim( i, X )
imammals
• Max-Similarity:
sim( i, X )  max sim( i, j )
jX
x
x
x
Mammals:
Examples:
x
Sum-Sim versus Max-Sim
• Two models appear functionally similar:
– Both increase monotonically as new examples
are observed.
• Reasons to prefer sum-sim:
– Standard form of exemplar models of
categorization, memory, and object recognition.
– Analogous to kernel density estimation
techniques in statistical pattern recognition.
• Reasons to prefer max-sim:
– Fit to generalization judgments . . . .
Data
Data vs. models
Model
.
Each “ ” represents one argument:
X1 have property P.
X2 have property P.
X3 have property P.
All mammals have property P.
Three data sets
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Explaining similarity
• Why does max-sim fit so well?
• Why does sum-sim fit so poorly?
• Are there cases where max-sim will fail?
Marr’s Three Levels of Analysis
• Computation:
“What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?”
• Representation and algorithm:
Max-Sim, Sum-Sim
• Implementation:
Neurobiology
Scientific theory of biology
• Species generated by an evolutionary
branching process.
– A tree-structured taxonomy of species.
Scientific theory of biology
• Species generated by an evolutionary
branching process.
– A tree-structured taxonomy of species.
• Features generated by stochastic mutation
process and passed on to descendants.
– Similarity a function of distance in tree.
An intuitive theory of biology
• Species generated by an evolutionary
branching process.
– A tree-structured taxonomy of species.
• Features generated by stochastic mutation
process and passed on to descendants.
– Similarity a function of distance in tree.
Sources:
Cognitive anthropology: Atran, Medin
Cognitive development: Keil, Carey
A model of theory-based induction
1. Reconstruct intuitive taxonomy from
similarity judgments:
A model of theory-based induction
2. Hypothesis space H: each taxonomic
cluster is a possible hypothesis for the
extension of a novel feature.
h0: “all mammals”
h1
h3
h6
h17
...
h0: “all mammals”
p(all mammals | X ) 
p( X | h0 )
 p( X | h) p(h)
hH
p(h): uniform
n
 1 
p ( X | h)  
if x1 ,  , xn  h

 size( h) 
 0 if any xi  h
Bayes
(taxonomic)
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Cows have property P.
Dolphins have property P.
Squirrels have property P.
Bayes
(taxonomic)
All mammals have property P.
Seals have property P.
Dolphins have property P.
Squirrels have property P.
Max-sim
All mammals have property P.
Sum-sim
Conclusion
kind:
“all mammals”
Number of
examples:
3
h0: “all mammals”
Cows have property P.
Dolphins have property P.
Squirrels have property P.
All mammals have property P.
h0: “all mammals”
Seals have property P.
Dolphins have property P.
Squirrels have property P.
All mammals have property P.
Scientific theory of biology
• Species generated by an evolutionary
branching process.
– A tree-structured taxonomy of species.
• Features generated by stochastic mutation
process and passed on to descendants.
– Similarity a function of distance in tree.
Scientific theory of biology
• Species generated by an evolutionary
branching process.
– A tree-structured taxonomy of species.
• Features generated by stochastic mutation
process and passed on to descendants.
– Similarity a function of distance in tree.
– Novel features can appear anywhere in tree, but
some distributions are more likely than others.
A model of theory-based induction
2. Hypothesis space H: each taxonomic
cluster is a possible hypothesis for the
extension of a novel feature.
h0: “all mammals”
h1
h3
h6
h17
...
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
Induced prior p(h):
• Every subset of objects
is a possible hypothesis
• Prior p(h) depends on
the number and length
of branches needed to
span h.
Bayesian Occam’s Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
Induced prior p(h)
• Monophyletic properties more likely than
polyphyletic properties:
p( {horse, cow, elephant, rhino} )
>
p( {chimp, gorilla, elephant, rhino} )
Induced prior p(h)
• Novel properties more likely to occur on long
branches than on short branches:
p( {dolphin, seal} )
>
p( {horse, cow} )
h0: “all mammals”
p(all mammals | X ) 
p( X | h0 )
 p( X | h) p(h)
hH
p(h): “evolutionary” process
(mutation + inheritance)
n
 1 
p ( X | h)  
if x1 ,  , xn  h

 size( h) 
 0 if any xi  h
Bayes
(taxonomic)
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Bayes
(taxonomy+
mutation)
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Model variants
• Version 1:
Simple taxonomic hypothesis space instead of
full hypothesis space with prior based on
mutation process.
• Version 2:
Simple taxonomic hypothesis space with Hebbian
learning instead of Bayesian inference.
• Version 3:
Taxonomy based on actual evolutionary tree
rather than psychological similarity.
r=0.51
r=0.41
r=0.90
r= -0.41
r=0.88
r=0.45
r=0.40
r=0.60
r=0.61
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Bayes
(taxonomic)
Hebb
(taxonomic)
Bayes
(actual
evolutionary
tree)
Mutation principle versus pure
Occam’s Razor
• Mutation principle provides a version of
Occam’s Razor, by favoring hypotheses that
span fewer disjoint clusters.
• Could we use a more generic Bayesian
Occam’s Razor, without the biological
motivation of mutation?
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
- b
branches b: p( F develops along b)  1 - e
Induced prior p(h):
• Every subset of objects
is a possible hypothesis
• Prior p(h) depends on
the number and length
of branches needed to
span h.
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over
branches b: p( F develops along b)  
Induced prior p(h):
• Every subset of objects
is a possible hypothesis
• Prior p(h) depends on
the number and length
of branches needed to
span h.
Bayes
(taxonomy+
Occam)
Premise typicality effect (Rips,
1975; Osherson et al., 1990):
Strong:
Max-sim
Horses have property P.
All mammals have property P.
Sum-sim
Weak:
Seals have property P.
Conclusion
kind:
“all mammals”
Number of
examples:
1
All mammals have property P.
Bayes
(taxonomy+
mutation)
Premise typicality effect (Rips,
1975; Osherson et al., 1990):
Strong:
Max-sim
Horses have property P.
All mammals have property P.
Sum-sim
Weak:
Seals have property P.
Conclusion
kind:
“all mammals”
Number of
examples:
1
All mammals have property P.
Intuitive versus scientific
theories of biology
• Same structure for how species are related.
– Tree-structured taxonomy.
• Same probabilistic model for traits
– Small probability of occurring along any branch
at any time, plus inheritance.
• Different features
– Scientist: genes
– People: coarse anatomy and behavior
Bayes
(taxonomy+
mutation)
Max-sim
Sum-sim
Conclusion
kind:
“all mammals”
“horses”
“horses”
Number of
examples:
3
2
1, 2, or 3
Explaining similarity
• Why does max-sim fit so well?
• Why does sum-sim fit so poorly?
• Are there cases where max-sim will fail?
Explaining similarity
• Why does max-sim fit so well?
– An efficient and accurate approximation to
Bayesian (evolution) model.
Correlation with Bayes
on three-premise
general arguments,
over 100 simulated tree
structures:
Mean r = 0.94
Correlation (r)
Explaining similarity
• Why does max-sim fit so well?
– Approximation is domain specific.
c.f., number game:
60
60 80 10 30
60 52 57 55
Data
Model (r = 0.64)
Explaining similarity
• Why does sum-sim fit so poorly?
– Prefers sets of the most typical examples, which
are not representative of category as a whole.
Correlation with Bayes
on three-premise
general arguments,
over 100 simulated tree
structures:
Mean r = – 0.26
Correlation (r)
Explaining similarity
• Are there cases where max-sim will fail?
– An example from Medin et al. (in press):
Brown bears have property P
Brown bears have property P
Polar bears have property P
Grizzly bears have property P
Horses have property P.
Horses have property P.
Bayesian model makes the correct prediction, due to the size
principle (assumption of examples sampled randomly from concept).
A more systematic test of the
Size Principle
Biology: Summary
• Theory-based statistical inference explains
taxonomic inductive reasoning in folk biology.
• Reveals essential principles of domain theory.
– Category structure: taxonomic tree.
– Feature distribution: stochastic mutation process +
inheritance.
• Clarifies processing-level models.
– Why max-sim over sum-sim?
– When is max-sim a good heuristic approximation
to full Bayesian inference?