Transcript PPTX

Computational Psycholinguistics
Lecture 2: surprisal, incremental syntactic processing,
and approximate surprisal
Florian Jaeger & Roger Levy
LSA 2011 Summer Institute
Boulder, CO
12 July 2011
Comprehension: Theoretical Desiderata
• Realistic models of human sentence
comprehension must account for:
• Robustness to arbitrary input
• Accurate disambiguation
• Inference on basis of incomplete input
(Tanenhaus et al 1995, Altmann and Kamide 1999,
Kaiser and Trueswell 2004)
how to get from here…
…to here?
the boy will eat…
• Processing difficulty is differential and localized
Review
• Garden-pathing under Jurafsky 1996
• Scoring relative probability of incremental trees
• An incremental tree is a fully connected sequence of nodes
from the root category (typically, S) to all the terminals
(words) that have been seen so far
• Nodes on the right frontier of an incremental tree are still
“open” (could accrue further daughters)
• What kind of uncertainty does the Jurafsky 1996 model
of garden-pathing deal with?
• Uncertainty about what has already been said
Generalizing incremental disambiguation
• Another type of uncertainty
The old man stopped and stared at the statue? dog?
view? woman?
The squirrel stored some nuts in the
tree
• This is uncertainty about what has not yet been said
• Reading-time (Ehrlich & Rayner, 1981) and EEG
(Kutas & Hillyard, 1980, 1984) evidence shows this
affects processing rapidly
• A good model should account for expectations about
how this uncertainty will be resolved
Non-probabilistic complexity
• On the traditional view, resource limitations,
especially memory, drive processing
complexity
• Gibson 1998, 2000 (DLT): multiple and/or
more distant dependencies are harder to
process
the reporter who attacked
Processing
the senator
Easy
the reporter who the senator attacked
Hard
Probabilistic complexity: surprisal
• Hale (2001) proposed that a word’s complexity in
sentence comprehension is determined by its surprisal
• This idea can actually be traced back (at least) to
Mandelbrot (1953)
• (Cognitive science in the 1950s was extremely
interesting -- many ideas to be mined!]
Surprisal (-log P)
The surprisal graph
4
3
2
1
0
0
0.2
0.4
0.6
Probability
0.8
1
Garden-pathing under surprisal
• Another type of local syntactic ambiguity
When the dog scratched the vet and his new assistant removed the muzzle.
• Compare with:
When the dog scratched, the vet and his new assistant removed the muzzle.
When the dog scratched its owner the vet and his new assistant removed the muzzle.
A small PCFG for this sentence type
Two incremental trees
Surprisal for the two variants
Expectations versus memory
•
Suppose you know that some event class X has to
happen in the future, but you don’t know:
1. When X is going to occur
2. Which member of X it’s going to be
•
The things W you see before X can give you hints
about (1) and (2)
•
•
If expectations facilitate processing, then seeing W
should generally speed processing of X
But you also have to keep W in memory and retrieve
it at X
•
This could slow processing at X
Study 1: Verb-final domains
• Konieczny 2000 looked at reading times at
German final verbs in a self-paced reading expt
Er hat die Gruppe geführt
He has the group led
“He led the group”
Er hat die Gruppe auf den Berg
geführt
He has the group to the mountain led
“He led the group to the mountain”
Er hat die Gruppe auf den SEHR SCHÖNEN
Berg geführt
He has the group to the
VERY BEAUTIFUL mtn. led
“He led the group to the very beautiful mountain”
Locality predictions and empirical results
• Locality-based models (Gibson 1998) predict
difficulty for longer clauses
• But Konieczny found that final verbs were
read faster in longer clauses
Er hat die Gruppe geführt
He led the group
Prediction Result
easy slow
Er hat die Gruppe auf den Berg geführt
hard fast
He led the group to the mountain
...die Gruppe auf den sehr schönen Berg geführt hard fastest
He led the group to the very beautiful mountain
Predictions of surprisal
520
510
Reading time (ms)
500
490
480
470
460
450
Locality-based difficulty (ordinal)
Er hat die Gruppe (auf den (sehr schönen) Berg) geführt
16.2
3
Reading time at final verb
Negative Log probability
16
15.8
2
15.6
15.4
15.2
1
15
14.8
No PP
Short PP
Long PP
Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity
Levy 2008
Deriving Konieczny’s results
• Seeing more = having more information
• More information = more accurate expectations
S
VP
NP Vfin
NP
PP
V
Er hat die Gruppe auf den Berg geführt



NP?
PP-goal?
PP-loc?
Verb?
ADVP?
Once we’ve seen a PP goal we’re unlikely to see
another
So the expectation of seeing anything else goes up
pi(w) obtained via a PCFG derived empirically from
a syntactically annotated corpus of German (the
NEGRA treebank)
Study 2: Final verbs, effect of dative
daß
...
...that
der Freund DEM Kunden das Auto verkaufte
the friend
the client
the car
sold
‘...that the friend sold the client a car...’
...daß
...that
der Freund DES Kunden das Auto verkaufte
the friend
the client
the car
sold
‘...that the friend of the client sold a car...’
Locality: final verb read faster in DES condition
Observed: final verb read faster in DEM condition
(Konieczny & Döring 2003)
Next:
SBAR
COMP
S
VP
NPnom
NPdat
daß
der Freund
DEM Kunden
NPacc
das Auto
V
verkaufte
Next:
SBAR
COMP
S
NPnom
NPnom
daß
der Freund
NPnom
NPacc
NPdat
PP
ADVP
Verb
VP
NPgen
DES Kunden
NPacc
das Auto
V
verkaufte
NPnom
NPacc
NPdat
PP
ADVP
Verb
Model results
Reading
time (ms)
P(wi): word
probability
Locality-based
predictions
dem Kunden
(dative)
555
8.3810-8
slower
des Kunden
(genitive)
793
6.3510-8
faster
~30% greater expectation
in dative condition
once again, wrong
monotonicity
Theoretical bases for surprisal
• So far, we have simply stipulated that complexity ~
surprisal
• To a mathematician, surprisal is a natural cost metric
• But as a cognitive scientist, it would be nice to derive
surprisal from prior principles
• I’ll present three derivations of surprisal in this section
(1) Surprisal as relative entropy
•
Relative entropy: a fundamental information-theoretic
measure of the distance between two probability
distributions
•
Intuitively, the penalty paid by encoding one
distribution with a different one
It turns out that relative entropy over interpretation
1
log
distributions before and after wi = Pi1(wi ) (surprisal!)
Surprisal can thus be thought of as reranking cost
•
•
 Relative entropy independently proposed as a measure of
 (Itti & Baldi 2005)
surprise in visual scene perception
Levy 2008
(2) Surprisal as optimal discrimination
• Many theories of reading posit lexical access as key bottleneck
• E-Z Reader (Reichle et al., 1998); SWIFT (Engbert et al., 2005)
• Same bottleneck should hold for auditory comprehension as well
• Norris (2006)’s Bayesian Reader: lexical access involves a
probabilistic judgment about the word’s identity from noisy input
• Certainty takes a “random walk” in probability space, and
surprisal determines starting point of the walk
• Connections with diffusion
model (Ratcliff 1978) and
MSPRT (Baum & Veeravalli 1994)
• Also connections w/ cortical
decision-process models (e.g.,
Usher & McClelland 2001)
Decision
Threshold
(3) Surprisal as optimal preparation
• Are all RT differences best modeled as discrimination?
• Intuitively, it makes sense to prepare for events you
expect to happen
• Such preparation allows increased avg. response speed
• Smith & Levy (2008) formalize this intuition as an
optimization of response speed against (fixed)
preparation costs:
• Let the brain choose response times, but faster is
costlier
• + scale-freeness: a unit’s processing cost is sum of costs
of its subunits
• = surprisal, under very general conditions
Smith & Levy, 2008
Is probabilistic facilitation logarithmic?
• What I’ve shown you so far:
• More expected = faster
• What the theoretical derivations I’ve shown promised:
• More expected = faster in a logarithmic scale
• Established for frequency, not for probability
• Focused look at subtleties of specific constructions
may not be the best way to investigate this issue
• highly refined probability distributions are challenging to
estimate
• we need a lot of data to get a good view of the picture
• Solution: broad-coverage model, reading over free text
Smith & Levy, 2008
Log-probability: methods
• Dataset
• the Dundee Corpus (Kennedy et al., 2003)
• 50K words of British newspaper text, read by 10
speakers
• Measures of interest:
• “Frontier” fixations (all fixations beyond the farthest
fixation thus far)
• First fixations (frontier fixations falling on a new word)
fox jumped over the lazy dog
Frontier fixations
First fixations
Deconfounding frequency & probability
• Major confound: logfrequency, widely
recognized to have
linear effect on RT
• Unfortunately, freq &
prob are heavily
correlated (=0.8)
• Fortunately, there’s
still a big cloud of data
to help us discriminate
between the two
(N≈200,000)
Log-probability: results
• Facilitation is essentially
linear in log-probability
• True even after controlling
conservatively for frequency
and word-length effects
nonparametric
regression
binned median log-probs and
frontier-fixation RTs
Aggregation across words & spillover
Eye-tracking
Self-paced reading
When ambiguity facilitates comprehension
• Sometimes, ambiguity seems to facilitate processing:
The daughteri of the colonelj who shot himself*i/j
The daughteri of the colonelj who shot herselfi/*j
slower
faster
The soni
of the colonelj who shot himselfi/j
• Argued to be problematic for parallel constraint-based
competition models (Macdonald, Pearlmutter, &
Seidenberg 1994)
• (though see rebuttal by Green & Mitchell 2006)
(Traxler et al. 1998; Van Gompel et al. 2001, 2005)
Traditional account: stochastic race model
NP
NP
PP
the daughter P
RC
NP
who shot…himself 

of the colonel
• Sometimes the reader attaches the RC low...
• and everything’s OK
• But sometimes the reader attaches the RC high…
• and the continuation is anomalous
• So we’re seeing garden-pathing ‘some’ of the time
(Traxler et al. 1998; Van Gompel et al. 2001, 2005)
Surprisal as a parallel alternative
• Surprisal marginalizes over possible syntactic structures
NP
NP
NP
NP
the daughter
RC
PP
P
who shot…
NP
PP
the daughter
NP
P
of
of the colonel
NP
NP
RC
the colonel who shot…
pi ( w)   pi (T ) p( w | T )
T
• assume a generative model where
choice between herself and himself
determined only by antecedent’s gender
self
herself
1
y low
x low
Pi (himself)  Pi (RC low )P(self | RC low )P(himself | self ,RC low )
 Pi (RC high
| self ,RC high )

 )P(self | RC high )P(himself

x high

y high


x low
0
y low
1
Pi (himself)  Pi (RC low )P(self | RC low )P(himself | self ,RC low )

P
(RC
)P(self
|
RC
)P(himself
|
self
,RC
)

i
high
high
high


x high
y 
high
1
Ambiguity reduces the surprisal
daughter…who shot… can’t contribute
probability mass to himself
But son…who shot… can
pi (himself | daughter )  x high y high  0  x low y low 1
pi (himself | son
)  x high y 
high 1 x low y low 1

pi (himself | daughter )  pi (himself | son )
Ambiguity/surprisal conclusion
• Cases where ambiguity reduces difficulty aren’t
problematic for parallel constraint satisfaction
• Although they may be problematic for
competition
• Surprisal can be thought of as a revision of
constraint-based theories with competition
• Same: a variety of constraints immediately
brought to bear on syntactic comprehension
• Different: linking hypothesis from probabilistic
constraints to behavioral observables
Competition versus surpisal: speculation
• Swets et al. (submitted): question type can affect
behavioral responses to ambiguous RCs:
“Did the colonel get shot?”
• Asking about RC slowed RC reading time across the
board
• And speed of response interacted with question type
• RC questions answered slowest in ambiguous condition
• Speculation:
• Comprehension is generally parallel & surprisal-based
• Competition emerges when comprehender is forced into
a serial channel
Memory constraints: a theoretical puzzle
• # Logically possible analyses grows at best exponentially
in sentence length
• Exact probabilistic inference with context-free grammars
can be done efficiently in O(n3)
• But…
• Requires probabilistic locality, limiting conditioning context
• Human parsing is linear—that is, O(n)—anyway
• So we must be restricting attention to some subset of
analyses
• Puzzle: how to choose and manage this subset?
• Previous efforts: k-best beam search
• Here, we’ll explore the particle filter as a model of limitedparallel approximate inference
Levy, Reali, & Griffiths, 2009, NIPS
The particle filter: general picture
• Sequential Monte Carlo for incremental observations
• Let xi be observed data, zi be unobserved states
• For parsing: xi are words, zi are incremental structures
• Suppose that after n-1 observations we have the
distribution over interpretations P(zn-1|x1…n-1)
• After next observation xn, represent the next
distribution P(zn|x1…n) inductively:
• Approximate P(zi|x1…i) by samples
• Sample zn from P(zn|zn-1), and reweight by P(xn|zn)
Particle filter with probabilistic grammars
S
 NP VP
1.0
V
 brought
0.4
NP
 N
0.8
V
 broke
0.3
NP
 N RRC
0.2
V
 tripped
0.3
RRC
 Part N
1.0
Part
 brought
0.1
VP
 VN
1.0
Part
 broken
0.7
N
 women
0.7
Part
 tripped
0.2
N
 sandwiches
0.3
Adv
 quickly
1.0
S
*
S*
*
NP *
*
N*
*
NP *
*
VP *
*
V*
*
N *
*
women brought sandwiches tripped
0.7
0.4
0.3
N*
*
VP
*
RRC*
*
Part*
*
women brought
0.7
0.1
V*
*
N *
*
sandwiches tripped
0.3
0.3
Resampling in the particle filter
• With the naïve particle filter, inferences are highly
dependent on initial choices
• Most particles wind up with small weights
• Region of dense posterior poorly explored
• Especially bad for parsing
• Space of possible parses grows (at best)
exponentially with input length
input
Resampling in the particle filter
• With the naïve particle filter, inferences are highly
dependent on initial choices
• Most particles wind up with small weights
• Region of dense posterior poorly explored
• Especially bad for parsing
• Space of possible parses grows (at best)
exponentially with input length
• We handle this by
resampling at each
input word
input
Simple garden-path sentences
The woman brought the sandwich from the kitchen tripped
MAIN VERB (it was the woman who brought the sandwich)
REDUCED RELATIVE (the woman was brought the sandwich)
• Posterior initially misled away from ultimately correct interpretation
• With finite # of particles, recovery is not always successful
Solving a puzzle
A-S Tom heard the gossip wasn’t true.
A-L Tom heard the gossip about the neighbors wasn’t true.
U-S Tom heard that the gossip wasn’t true.
U-L Tom heard that the gossip about the neighbors wasn’t
true.
• Previous empirical finding: ambiguity induces
difficulty…
• …but so does the length of the ambiguous region
• Our linking hypothesis:
Proportion of parse failures at the disambiguating region
should increase with sentence difficulty
Frazier & Rayner,1982; Tabor & Hutchins, 2004
Another example (Tabor & Hutchins 2004)
As the author wrote the essay the book grew.
As the author wrote the book grew.
As the author wrote the essay the book describing Babylon grew.
As the author wrote the book describing Babylon grew.
Resampling-induced drift
• In ambiguous region, observed words aren’t strongly
informative (P(xi|zi) similar across different zi)
• But due to resampling, P(zi|xi) will drift
• One of the interpretations may be lost
• The longer the ambiguous region, the more likely this
is
Model Results
Ambiguity matters…
But the length of the ambiguous region also matters!
Human results (offline rating study)