Information-theoretic models

Download Report

Transcript Information-theoretic models

Day 4: Reranking/Attention shift;
surprisal-based sentence
processing
Roger Levy
University of Edinburgh
&
University of California – San Diego
Overview for the day
• Reranking & Attention shift
• Crash course in information theory
• Surprisal-based sentence processing
Reranking & Attention shift
• Suppose an input prefix w1…I determines a
ranked set of incremental structural analyses, call
it Struct(w1…i)
• In general, adding a new word wi+1 to the input
will determine a new ranked set of analysis
Struct(w1…i+1)
• A reranking theory attributes processing
difficulty to some function comparing the
structural analyses
• An attention shift theory is a special case where
difficulty is predicted only when the highestranked analysis differs between Struct(w ) and
Conceptual issues
• Granularity: what precisely is specified in an
incremental structural analysis?
• Ranking metric: how are analyses ranked?
• e.g.in terms of conditional probabilities P( T | w1…i)
• Degree of parallelism: how many (and which)
analyses are retained in Struct(w1…i)?
Attention shift: an example
• Parallel comprehension: two or more analyses
entertained simultaneously
The warehouse fires many workers each spring…
• Disambiguation comes at following context,
“many workers…”
• There is an extra cost paid (reading is slower) at
disambiguating context
• Eye-tracking (Frazier and Rayner 1987)
• Self-paced reading (MacDonald 1993)
Pruning isn’t enough
• Jurafsky analyzed NN/NV ambiguity for
“warehouse fires” and concluded no pruning
could happen
267 : 1
3.8 : 1
Idea of attention shift
• Suppose that a change in the top-ranked
candidate induces empirically-observed
“difficulty”
• Not the same as serial parsing, which doesn’t
even entertain alternate parses unless the current
parse breaks down
• Why would this happen?
• People could be gathering more information about
the preferred parse, and need extra time to do this
when the preferred parse changes
• People could simply be surprised, and this could
interrupt “normal reading processes”
Crocker & Brants 2000
• Adopt an attention-shift linking hypothesis
• (page 660; unfortunately not stated very explicitly)
• Architectural aspects of their system:
•
•
•
•
Bottom-up, incremental parsing architecture
Some pruning at every “layer” from bottom on up
No lexicalization in the grammar
Skip other details…
N/V ambiguity under attention
shift
• Crocker & Brants 2000: relative strength of each
interpretation changes from word to word
N/V attention shift: which probs?
• This analysis relies on lexical & syntactic
probabilities
• P(fires|NN) is higher than P(fires|VBZ)
• P(NP -> Det NN NN) is low, and putting “many” after
a subject NP is low-probability
• Is corporation
this a satisfactory
analysis?
(c.f.
day
1!)
The
fires many
workers
each
spring
• MacDonald 1993 found no disambiguatingcontext difficulty when noun (corporation)
doesn’t support noun-compound analysis
• These are, at the least, bilexical affinities
Results from MacDonald 1993
• Difficulty only with “warehouse” not “corporation”
“fires”
• Observed difficulty delayed a bit (spillover)
relative difficulty in ambiguous case
How to estimate parse probs
• In an attention-shift model, conditional
probabilities are of primary interest
• “warehouse fires” vs. “corporation fires” creates a
practical problem
• Model should include P(fires|warehouse,{NN,NV})
and P(fires|corporation,{NN,NV})
• But no parsed corpus even contains “fires” in the
same sentence with either of these words
• What do we do here?
How to estimate parse probs (2)
• MacDonald 1993’s approach: collect relevant
quantitative norm data and correlate with RTs
• warehouse head vs. modifying noun freq
• corresponds to P(NN|warehouse) fires noun/verb
ambiguous word usage
• corresponds (indirectly) to P(fires|NN)
• warehouse fires modifier+head cooccurrence rate
• corresponds to P(fires|warehouse,NN)
• warehouse fires plausibility ratings as NV vs. as NN
• “how plausible is it to have a fire in a warehouse”
• “how plausible is it to have a warehouse fire
someone?”
How to estimate parse probs (3)
• In the era of gigantic corpora (e.g., the Web),
another approach: the counting method
• To estimate P(NN|the warehouse fires), simply
collect a sample of the warehouse fires and count
how many of them are NN usages
• Many pitfalls!
• often can’t hold external sentence context constant
• vulnerable to undisclosed workings of search
engines
• hand-filtering the results is imperative
• assumes human prob. estimates will match corpus
freqs
How to estimate parse probs (3)
• Crude method: we’ll use a corpus search (Google)
to estimate P(NN|warehouse,fires)
• 21 instances (excluding psycholinguistics hits!) of
“warehouse fires” found; all were NN
• two of these were potentially NV contexts
I heard an interview on NPR of a Vieux Carre (French Quarter
native who explained how the warehouse fires started...
Not all the warehouse fires were so devastating, ...
• At least some evidence that P(NN|warehouse,fires)
is above 0.5
• Supports attention-shift analysis
Attention shift in MV/RR
ambiguity?
• McRae et al. 1998 also has an attention-shift
interpretation (pursued by Narayanan & Jurafsky
2002)
shift to RR for
good patients
the {crook/cop}
shift to RR for
good agents
Reranking/Attention shift
summary
• Reranking attributes difficulty to changes in the
ranking over interpretations caused by a given
word
• Attention shift is a special form in which changes
in the highest-ranked candidate matter
Overview for the day
• Reranking & Attention shift
• Tiny introduction to information theory
• Surprisal-based sentence processing
Tiny intro to information theory
• Shannon information content, or surprisal, of an
1
event:
(sometimes called the
h( x)  log2
P( x)
  log2 P( x)
entropy of event x)
• Example: a bent 1coin with P(heads)=0.4
h( heads)  log2
 1.32
0 .4
1
h(tails)  log2
 0.74
0 .6
• A loaded die with P(1)=0.4 also has h(1)=1.32
Tiny intro to information theory (2)
• The entropy of a discrete probability distribution
is the expected value of its Shannon information
content
1
H ( X )   P( x) log2
x
P( x)
• Example: the entropy
1 of a fair coin
1 is
H ( X )  0.5 log 2
0.5
 0.5 log 2
0.5
 log 2 2  1
1
1
• Our
bent
P(heads)=0.4
coin
less
H ( X ) 0.4 log 2
 0.6 log 2 has entropy
0.53  0.44
 0.than
97
1:
0.4
0.6
0.0
0.2
0.4
h2(p)
0.6
0.8
1.0
Entropy of a loaded coin
0.0
0.2
0.4
0.6
p
0.8
1.0
Tiny intro to information theory (3)
• Our loaded die with P(1)=0.4 doesn’t have its
entropy completely determined yet. Two
examples:
• A fair die has entropy of 2.58
Overview for the day
• Reranking & Attention shift
• Crash course in information theory
• Surprisal-based sentence processing
Hale 2001, Levy 2005: surprisal
• Let the difficulty of a word be its surprisal given
its context:
• Captures the expectation intuition: the more we
expect an event, the easier it is to process
• Many probabilistic formalisms, including PCFGs
(Jelinek & Lafferty 1991, Stolcke 1995), can give us word
surprisals
Intuitions for surprisal & PCFGs
• Consider the following PCFG
P(S → NP VP)
P(NP → DT N)
P(NP → DT N N)
P(NP → DT Adj N)
P(N → warehouse)
P(N → fires)
=
=
=
=
=
=
1.0
0.4
0.3
0.3
0.03
0.02
P(DT → the)
P(VP → V)
P(VP → V NP)
P(VP → V PP)
P(V → fires)
P(V → destroyed)
• Calculate surprisal at destroyed in these
sentences:
= 0.3
= 0.3
= 0.4
= 0.1
= 0.05
= 0.04
the warehouse fires destroyed the neighborhood.
the fires destroyed the neighborhood.
Connection with reranking models
• Levy 2005 shows that surprisal is a special form
of reranking model
• In particular, if reranking cost is taken as the KL
divergence* between old & new parse
distributions…
• …then reranking cost turns out equivalent to
surprisal of the new word wi
• Thus representation neutrality is an interesting
consequence of the surprisal theory
*a measure of the penalty incurred by encod
one probability distribution with another
Levy 2006: syntactically constrained
contexts
• In many cases, you know that you have to
encounter a particular category C
• But you don’t know when you’ll encounter it, or
which member of C will actually appear
• Call these syntactically constrained contexts
• In these contexts, the more information related
to C you obtain, the sharper your expectations
about C generally turn out to be
• Interesting contrast to some non-probabilistic
theories that say holding onto the related
information is hard
Constrained contexts: final verbs
• Konieczny 2000 looked at reading times
at German final verbs
Er hat die Gruppe geführt
He has the group led
“He led the group”
Er hat die Gruppe auf den Berg
geführt
He has the group to the mountain led
“He led the group to the mountain”
Er hat die Gruppe auf den SEHR SCHÖNEN
Berg geführt
He has the group to the VERY BEAUTIFUL mtn. led
“He led the group to the very beautiful mountain”
Surprisal’s predictions
Er hat die Gruppe (auf den (sehr schönen) Berg) geführt
520
16.2
Reading time at final verb
Reading time (ms)
Negative Log probability
510
16
500
15.8
490
15.6
480
15.4
470
15.2
460
15
450
14.8
No PP
Short PP
Long PP
Deriving Konieczny’s results
• Seeing more = having more information
• More information = more accurate expectations
S
VP
NPVfin
NP
PP
V
Er hat die Gruppe auf den Berggeführt



NP?
PP-goal?
PP-loc?
Verb?
ADVP?
Once we’ve seen a PP goal we’re unlikely to see
another
So the expectation of seeing anything else goes up
For pi(w), used a PCFG derived empirically from a
syntactically annotated corpus of German (the NEGRA
Facilitative ambiguity and
surprisal
• Review of when ambiguity facilitates processing:
The daughteri of the colonelj who shot himself*i/j
The daughteri of the colonelj who shot herselfi/*j
harder
easier
The soni
of the colonelj who shot himselfi/j
(Traxler et al. 1998; Van Gompel et al. 2001,
Traditional account:
probabilistic serial disambiguation
NP
NP
PP
the daughter P
RC
NP
who shot…himself 

of the colonel
• Sometimes the reader attaches the RC low...
• and everything’s OK
• But sometimes the reader attaches the RC high…
• and the continuation is anomalous
• So we’re seeing garden-pathing ‘some’ of the
time
Surprisal as a parallel alternative
• Surprisal marginalizes over possible syntactic structures
NP
NP
NP
RC
PP who shot… the daughter
NP
the daughter
NP
P
NP
of the colonel
PP
P
of
NP
NP
RC
the colonel who shot…
pi ( w)   pi (T ) p ( w | T )
T
• assume a generative model where
choice between herself and himself
determined only by antecedent’s
self
herself
xlow
ylow
1
pi (himself) pi (TRC_low ) * p(" self"| TRC_low ) * p(himself |" self", TRC_low )
 pi (TRC_high ) * p(" self"| TRC_high ) * p(himself |" self", TRC_high )
xhigh
yhigh
0
xlow
ylow
1
pi (himself) pi (TRC_low ) * p(" self"| TRC_low ) * p(himself |" self", TRC_low )
 pi (TRC_high ) * p(" self"| TRC_high ) * p(himself |" self", TRC_high )
xhigh

yhigh
1
Ambiguity reduces the surprisal
daughter…who shot… can’t
contribute probability mass to
himself
But son…who shot… can
pi (himself | daughter)  xhigh yhigh  0  xlow ylow 1
pi (himself | son
 1  xlow ylow 1
)  xhigh yhigh

pi (himself | daughter)  pi (himself | son)
Ambiguity/surprisal conclusion
• Cases where ambiguity reduces difficulty
aren’t problematic for parallel constraint
satisfaction
• Although they are problematic for
competition
• Attributing difficulty to surprisal rather
than competition is a satisfactory revision
of constraint-based theories
Surprisal and garden paths: theory
• Revisiting the horse raced past the barn fell
• After the horse raced past the barn, assume 2
parses:
• Jurafsky 1996 estimated the probability ratio of
these parses as 82:1
• The surprisal differential of fell in reduced versus
unreduced conditions should thus be log2 83
=
*(assuming independence between RC reduction and main
verb)
6.4 bits
Surprisal and garden paths:
practice
• An unlexicalized PCFG (from Brown corpus) gets right
monotonicity of surprisals at disambiguating
word “fell”
• But there are some unwanted results too
these are way
too high!
this is right
but diff. is
small
Surprisal and garden paths
• raced has high surprisal because the grammar is
unlexicalized – no connection with horse
• Unfortunately, lexicalization in practice wouldn’t
help: race as a verb never co-occurs with horse in
Penn Treebank!
• surprisal differential at fell is small for the same
reason
• failure to account for lexical preferences of raced
means that probability of RR alternative is likely
overestimated
• Is surprisal a plausible source of explanation for
most dramatic garden-path effects? Still seems
unclear.
Surprisal summary
• Motivation: expectations affect processing
• When people encounter something unexpected,
they are surprised
• Translates into slower reading (=processing
difficulty?)
• This intuition can be captured and formalized
using tools from probability theory, information
theory, and statistical NLP
Tomorrow
• Other information-theoretic approaches to on-line
sentence processing
• Brief look at connectionist approaches to
sentence processing
• General discussion & course wrap-up