Big brown funny dogs are hungry

Download Report

Transcript Big brown funny dogs are hungry

Overview of PCFGs and the
inside/outside algorithms
based on the book by E. Charniak, 1993
Presented by Jeff Bilmes
Outline
• Shift/Reduce & Chart Parsing
• PCFGs
• Inside/Outside
– inside
– outside
– use for EM training.
Shift-Reduce Parsing
• Top down parsing
– start at non-terminal symbols and “recurse” down
trying to find a parse of a given sentence.
– Can get stuck in loops (left-recursion)
• Bottom-up parsing
– start at the terminal level and build up non-terminal
constituents as we go along.
– shift-reduce parsing
• (used often in machine languages which are designed to be as
unambiguous as possible) are quite useful
• Find sequences of terminals that match right-hand side of CFG
productions, and reduce them to non-terminals. Do same for
sequences of terminal/non-terminals to reduce to nonterminals.
Example: Arithmetic
Shift-Reduce Parsing
• This is great for compiler writers (grammars are often
unambiguous, and sometimes can be designed to not be).
• Problems: ambiguity. One rule might suggest a shift while
another might suggest a reduce.
– S  if E then S | if E then S else S
• On stack: if E then S
• Input: else
– Should we shift getting 2nd rule or reduce getting 1st rule??
– Can get reduce-reduce conflicts as well, given two production rules
of form A   B  
• Solution: backtracking (a general search method)
– In ambiguous situations, choose one (using a heuristic)
– whenever we get to situation where we can’t parse, we backtrack to
location of ambiguity
Chart Parsing vs. Shift-Reduce
• Shift-reduce parsing: computation wasted if
back-tracking occurs:
– the long np in Big brown funny dogs are hungry …
remains parsed while we search for next constituent,
but it becomes undone if a category below it on the
stack would need to be reparsed
• Chart parsing avoids reparsing constituents
that have already been found grammatical
by storing all grammatical substrings for
duration of parse.
Chart Parsing
• Uses three main data structures:
– 1) a key list, list of next constituents to insert in chart
(e.g., terminals or non-terminals)
– 2) a chart (a triangular table that keeps track of starting
position (x-axis) and length (y-axis) of a constituent
– 3) set of “edges” (things that mark positions on the
chart rules/productions, and how far along in each rule
the parse currently is (it is this last category that helps
avoid reparsing).
• We can use chart-parsing for producing (and
summing) parses for PCFGs and for producing
inside/outside probabilities. But what are these?
PCFGs
• A CFG with a P in front.
• A (typically inherently ambiguous) grammar with
probabilities associated with each production rule
for a non-terminal symbol.
PCFGs
• Example parse:
0.8
0.3
0.2
0.4
0.05
0.45
0.4
0.4
0.5
• Prob of this parse is:
– 0.8£0.2£0.4£0.05£0.45£0.3£0.4£0.4£0.5=3.5e-10
PCFGs
• Gives probabilities for string of words, where t1:n
varies over all possible parse trees for word
string w1:n
Domination & Notation
• Non-terminal dominating terminal symbols
• So in general, Njk,l means that the jth nonterminal (Nj) dominates terminals starting at
k and ending at l (inclusive).
Probability Rules
• starting at k and ending at l, we get:
for any k,l and any {m,n,…q : k· m < n · … ·
q<l}
• X,Y, …, Z can correspond to any valid set
of terminals or non-terminals that partition
the terminals from k to l.
Chomsky Hierarchy (reminder)
Chomsky
Hierarchy
Grammars
Languages
Automaton
Type-0
Unrestricted
Recursively
Enumerable
Turing
Machine
Type-1
ContextSensitive
ContextSensitive
Linearbounded
Type-2
Context-free
Context-free
Pushdown
Type-3
Regular
Regular
Finite-State
Regular grammar (reminder)
• Generates regular languages (e.g., regular
expressions without bells/whistles, FSAs, etc.)
• Aa, where A is a non-terminal and a is a
terminal.
• AaB, where A and B are non-terminals and a is
a terminal.
• A , where A is a non-terminal.
• Ex: {anbm | m,n > 0} can be generated by:
–
–
–
–
–
SaA
AaA
AbB
BbB
B 
Context-Free (reminder)
• Generates context-free languages
• Left-hand side of production rule can be
only one non-terminal symbol
• Every rule of the form:
– A
– where  is a string of terminals or nonterminals.
• “context free” since non-terminal A can
always be replaced by  regardless of the
context in which A occurs.
Context-Free (reminder)
•
•
•
•
Example: generate {anbn : n ¸ 0}
A  aAb
A
Above language can’t be generated by
regular grammar (language is not regular)
• Often used for compilers
• No probabilities (yet).
Context-Sensitive (reminder)
• Generates context-sensitive languages
• Generated by rules of the form:
– A
– A 
• where A is a non-terminal,  is any string of
terminals/non-terminals,  is any string of
terminals/non-terminals, and  is also any
non-empty string of terminals-nonterminals.
Context-Sensitive (reminder)
• Example: {anbncn : n¸ 1}
–
–
–
–
A  abc
A aABc
cB  Bc
bB  bb
• Example 2:
A
A
a
abc
c
B
A
abc
– { an : n prime }
Bc
bb
Type-0 (unrestricted) grammar
• all languages that can be recognized by a
Turing machine (i.e., ones that the TM can
say yes, and then halt).
• Known as recursively enumerable
languages.
• More general than context-sensitive.
• Example (I think):
– { anbm : n prime, m is prime following n }
– aabbb, aaabbbbb,aaaaabbbbbbb, etc.
Independence Assumptions
• PCFGs are CFGs with probabilities, but there
are other statistical independence assumptions
made about random events.
• Probabilities are unaffected by certain parts of
the tree given certain other non-terminals.
Example
• Following parse tree has equation:
A1,5 is
above
C4,5
B1,3 is
outside
C4,5
Chomsky vs. Conditional Independence
• The fact that it is context-free means that only
certain production rules can occur in the grammar
specification. The context-free determines the set
of possible trees, not their probabilities.
• This alone does not mean that the probabilities
may not be influenced by parses outside a given
non-terminal. This is where we need the
conditional independence assumptions over
probabilistic events.
• The “events” are parse events or just dominance
events. E.g., without independence assumptions,
p(w4,w5|C4,5,B2,3) might not be the same as
p(w4,w5|C4,5,B1,3), or could even change
depending on how we parse some other set of
constituents.
Issues
• We want to be able to compute:
• But number of parses of a given sentence
can grow exponentially in the length of the
sentence, for a given grammar  naïve
summing won’t work.
Problems to Solve
• 1) compute probability of a sentence for a
given grammar, p(w1:n|G)
• 2) find most likely parse for a given
sentence.
• 3) Train grammar probabilities in some
good way (e.g., maximum likelihood).
• These can all be solved by inside-outside
algorithm.
A Note on comparisons to HMMs
• Text says that HMMs and probabilistic regular
grammars (PRG) assign different probabilities to
strings.
– PRG is such that summing probabilities of *all*
sentences of all lengths should be one.
– HMM says that summing probabilities of all sentences
of a given length T should be one.
• But HMMs have an implicit conditional on T.
HMM probability is:
A Note on comparisons to HMMs
• But HMMs can be defined to have an explicit
length distribution:
• This can be done implicitly in HMM by having a
start and a stop state, thus defining P(T).
• We can alternatively explicitly make P(T) equal to
the PRG probability of summing over all
sentences of length T.
Example: PRG & its parses
forward/backward in HMMs
• Forward (alpha) recursion:
• Backward (Beta) recursion:
• Probability of a sentence:
association of / with outside/inside
• For PRG, backwards  probability is like the stuff
below a given non-terminal, or perhaps “inside”
the non-terminal.
• For PRG, forwards  probability is like the stuff
above a given non-terminal, or perhaps “outside”
the non-terminal.
Inside/Outside (/) probability defs.
• Inside probability: the stuff that is inside (or that is
dominated by) a given non-terminal,
corresponding to the terminals in the range k
through l.
• Outside probability: the stuff that is outside the
terminal range k through l.
Inside/Outside probability defs.
Inside Nj
Inside/Outside probability
• Like in HMMs, we can get sentence probabilities
given inside & outside probabilities for any nonterminal.
Inside probability computation
• For Chomsky-Normal form:
– for more general case, see today’s handout from
Charniak text, need to use details about edges in chart.
• Base case (to then go from terminal on up the
parse tree):
• Next case: consider range k through l, and all nonterminals that could generate wk:l. Since Chomsky
normal form, we have two non-terminals Np and
Nq, but we must consider all possible terminal
ranges within k through l so that these two nonterminals jointly dominate k through l.
Inside probability computation
sum over all pairs of nonterminals p,q and location of
split m.
by def. of chain rule
of probability.
PCFG
conditional
independence
Rename
Inside probability computation
• m unique production rules
• n length of string
• computational complexity is:
– O(n3m3)
Outside probability
• Outside probability:
• it can also be used to compute the word
probability, as follows:
• how do we get this?
Outside probability computation
un-marginalizing all nonterminals j.
chain rule
conditional independence and rename
• Note again: this is true for any k at all.
• So how do we compute outside probabilities?
• Inside probabilities are calculated bottom up,
and (not surprisingly), outside probabilities
are calculated top down, so we start at the
top:
Outside probability computation
• Starting at the top:
• Next, to calculate outside probability for
Njk,j, (and considering we’ve got Chomsky
normal form), there are two possibilities
that this could have come from higher
constituents, namely Nj is either on the left
or the right of its sibling:
These two events are exclusive and
exhaustive, so this means we’ll be
needing to sum probabilities of
the two cases.
Outside probability computation
Outside probability computation
Outside probability computation
• m unique production rules
• n length of string
• computational complexity is:
– O(n3m3)
Other notes:
• We can form products of inside/outside values (but
they have slightly different meaning than with
HMMs).
follows again from conditional indep.
• So final sum is the probability of the word
sequence, and *some* non-terminal constituent
spanning k through l (diff than HMM, which is
just the observations). We get:
Training PCFG probabilities
• inside/outside procedure can be used to compute
expected sufficient statistics (e.g., posteriors over
production rules) so that we can use these in an
EM iterative update procedure.
– this is very useful when we have a database of text, but
we don’t have a database of observed parses for this
(i.e., when we don’t have a treebank). We can use EM
& inside/outside to find a maximal likelihood solution
to the probability values.
• All we need are equations that give us expected
counts.
– note book uses notation and mentions ratios of actual
counts C, but they really are expected counts E[count]
which are sums of probabilities, as in normal EM.
Recall: expected value of an indicator is its probability.
• E[1{event}] = Pr({event})
Training PCFG probabilities
• First, we need definition of posterior as
ratio of expected counts for production:
• Next we need expected counts:
Training PCFG probabilities
• Lastly, we need terminal node expected
counts that non-terminal Ni produced vocab
word wj
Summary
• PCFGs are powerful
• Good training algorithms for them.
• But are they enough? Context-sensitive
grammars were really developed for natural
(written and spoken) language, but
algorithms for them not efficient.