Example - Department of Computer Engineering

Download Report

Transcript Example - Department of Computer Engineering

CPE 641
Natural Language Processing
Lecture 7: More on Parsing
Asst. Prof. Nuttanart Facundes, Ph.D.
1
Grammars
• Before you can parse you need a grammar.
• So where do grammars come from?
 Grammar Engineering
 Lovingly hand-crafted decades-long efforts by humans to write
grammars (typically in some particular grammar formalism of interest
to the linguists developing the grammar).
 TreeBanks
 Semi-automatically generated sets of parse trees for the sentences in
some corpus. Typically in a generic lowest common denominator
formalism (of no particular interest to any modern linguist).
2
TreeBank Grammars
• Reading off the grammar…
• The grammar is the set of rules (local
subtrees) that occur in the annotated
corpus
• They tend to avoid recursion (and
elegance and parsimony)
 Ie. they tend to the flat and redundant
• Penn TreeBank (III) has about 17500
grammar rules under this definition.
3
TreeBanks
4
TreeBanks
5
Sample Rules
6
Example
7
TreeBanks
• TreeBanks provide a grammar (of a sort).
• As we’ll see they also provide the training data for various ML
approaches to parsing.
• But they can also provide useful data for more purely linguistic
pursuits.
 You might have a theory about whether or not something can happen
in particular language.
 Or a theory about the contexts in which something can happen.
 TreeBanks can give you the means to explore those theories. If you
can formulate the questions in the right way and get the data you
need.
8
Tgrep
• You might for example like to grep through
a file filled with trees.
9
TreeBanks
• Finally, you should have noted a bit of a
circular argument here.
• Treebanks provide a grammar because we
can read the rules of the grammar out of
the treebank.
• But how did the trees get in there in the
first place? There must have been a
grammar theory in there someplace…
10
TreeBanks
• Typically, not all of the sentences are
hand-annotated by humans.
• They’re automatically parsed and then
hand-corrected.
11
Parsing
• CKY
• Earley
• Both are dynamic programming solutions
that run in O(n**3) time.
 CKY is bottom-up
 Earley is top-down
12
Sample Grammar
13
Dynamic Programming
• DP methods fill tables with partial results
and
 Do not do too much avoidable repeated work
 Solve exponential problems in polynomial
time (sort of)
 Efficiently store ambiguous structures with
shared sub-parts.
14
CKY Parsing
• First we’ll limit our grammar to epsilonfree, binary rules (more later)
• Consider the rule A -> BC
 If there is an A in the input then there must
be a B followed by a C in the input.
 If the A spans from i to j in the input then
there must be some k st. i<k<j
 Ie. The B splits from the C someplace.
15
CKY
• So let’s build a table so that an A spanning
from i to j in the input is placed in cell [i,j] in
the table.
• So a non-terminal spanning an entire
string will sit in cell [0, n]
• If we build the table bottom up we’ll know
that the parts of the A must go from i to k
and from k to j
16
CKY
• Meaning that for a rule like A -> B C we
should look for a B in [i,k] and a C in [k,j].
• In other words, if we think there might be
an A spanning i,j in the input… AND
• A -> B C is a rule in the grammar THEN
• There must be a B in [i,k] and a C in [k,j]
for some i<k<j
17
CKY
• So to fill the table loop over the cell[i,j]
values in some systematic way
 What constraint should we put on that?
 For each cell loop over the appropriate k
values to search for things to add.
18
CKY Table
19
CKY Algorithm
20
CKY Parsing
• Is that really a parser?
21
Note
• We arranged the loops to fill the table a
column at a time, from left to right, bottom
to top.
 This assures us that whenever we’re filling a
cell, the parts needed to fill it are already in
the table (to the left and below)
22
Example
23
Other Ways to Do It?
• Are there any other sensible ways to fill
the table that still guarantee that the cells
we need are already filled?
24
Other Ways to Do It?
25
Sample Grammar
26
Problem
• What if your grammar isn’t binary?
 As in the case of the TreeBank grammar?
• Convert it to binary… any arbitrary CFG can
be rewritten into Chomsky-Normal Form
automatically.
• What does this mean?
 The resulting grammar accepts (and rejects) the
same set of strings as the original grammar.
 But the resulting derivations (trees) are different.
27
Problem
• More specifically, rules have to be of the
form
A -> B C
Or
A -> w
That is rules can expand to either 2 nonterminals or to a single terminal.
28
Binarization Intuition
• Eliminate chains of unit productions.
• Introduce new intermediate non-terminals into the
grammar that distribute rules with length > 2 over
several rules. So…
S -> A B C
 Turns into
S -> X C
X-AB
Where X is a symbol that doesn’t occur anywhere else in the the
grammar.
29
CNF Conversion
30
CKY Algorithm
31
Example
Filling column 5
32
Example
33
Example
34
Example
35
Example
36
CKY Notes
• Since it’s bottom up, CKY populates the table
with a lot of phantom constituents.
 Segments that by themselves are constituents but
cannot really occur in the context in which they are
being suggested.
 To avoid this we can switch to a top-down control
strategy or
 We can add some kind of filtering that blocks
constituents where they can not happen in a final
analysis.
37
Earley Parsing
• Allows arbitrary CFGs
• Top-down control
• Fills a table in a single sweep over the input
words
 Table is length N+1; N is number of words
 Table entries represent
 Completed constituents and their locations
 In-progress constituents
 Predicted constituents
38
States
• The table-entries are called states and are
represented with dotted-rules.
S -> · VP
A VP is predicted
NP -> Det · NominalAn NP is in progress
VP -> V NP ·
A VP has been found
39
States/Locations
• S ->  VP [0,0]
• A VP is predicted at the start
of the sentence
• NP -> Det  Nominal
[1,2]
• VP -> V NP

[0,3]
• An NP is in progress; the Det
goes from 1 to 2
• A VP has been found starting
at 0 and ending at 3
40
Graphically
41
Earley
• As with most dynamic programming
approaches, the answer is found by
looking in the table in the right place.
• In this case, there should be an S state in
the final column that spans from 0 to n and
is complete.
• If that’s the case you’re done.
 S –> α  [0,n]
42
Earley
• So sweep through the table from 0 to n…
 New predicted states are created by starting
top-down from S
 New incomplete states are created by
advancing existing states as new constituents
are discovered
 New complete states are created in the same
way.
43
Earley
• More specifically…
1. Predict all the states you can upfront
2. Read a word
1. Extend states based on matches
2. Generate new predictions
3. Go to step 2
3. Look at n to see if you have a winner
44
Earley Code
45
Earley Code
46
Example
• Book that flight
• We should find… an S from 0 to 3 that is a
completed state…
47
Example
48
Add To Chart
49
Example
50
Example
51
Efficiency
• For such a simple example, there seems
to be a lot of useless stuff in there.
• Why?
• It’s predicting things that aren’t consistent
with the input
•That’s the flipside to the CKY problem.
52
Details
• As with CKY that isn’t a parser until we
add the backpointers so that each state
knows where it came from.
53
Back to Ambiguity
• Did we solve it?
54
Ambiguity
55
Ambiguity
• No…
 Both CKY and Earley will result in multiple S
structures for the [0,n] table entry.
 They both efficiently store the sub-parts that
are shared between multiple parses.
 And they obviously avoid re-deriving those
sub-parts.
 But neither can tell us which one is right.
56
Ambiguity
• In most cases, humans don’t notice
incidental ambiguity (lexical or syntactic).
It is resolved on the fly and never noticed.
• We’ll try to model that with probabilities.
• But note something odd and important
about the Groucho Marx example…
57
Heads
• To do that we’re going to make use of the
notion of the head of a phrase
 The head of an NP is its noun
 The head of a VP is its verb
 The head of a PP is its preposition
(It’s really more complicated than that but this
will do.)
58
Example (right)
Attribute grammar
59
Example (wrong)
60
How?
• We used to have
 VP -> V NP PP
P(rule|VP)
 That’s the count of this rule divided by the number
of VPs in a treebank
• Now we have
 VP(dumped)-> V(dumped) NP(sacks)PP(into)
 P(r|VP ^ dumped is the verb ^ sacks is the
head of the NP ^ into is the head of the PP)
 Not likely to have significant counts in any
treebank
61
Declare Independence
• When stuck, exploit independence and
collect the statistics you can…
• We’ll focus on capturing two things
 Verb subcategorization
 Particular verbs have affinities for particular VPs
 Objects affinities for their predicates (mostly
their mothers and grandmothers)
 Some objects fit better with some predicates than
others
62
Subcategorization
• Condition particular VP rules on their head… so
r15: VP -> V NP PP P(r|VP)
Becomes
P(r15 | VP ^ dumped)
What’s the count?
How many times was this rule used with dump, divided
by the number of VPs that dump appears in total
63
Preferences
• Verb subcategorization captures the
affinity between VP heads (verbs) and the
VP rules they go with.
 That is the affinity between a node and one of
its daughter nodes.
• What about the affinity between VP heads
and the heads of the other daughters of
the VP
• Back to our examples…
64
Example (right)
65
Example (wrong)
66
Preferences
• The issue here is the attachment of the PP. So
the affinities we care about are the ones
between dumped and into vs. sacks and into.
• So count the places where dumped is the head
of a constituent that has a PP daughter with into
as its head and normalize
• Vs. the situation where sacks is a constituent
with into as the head of a PP daughter.
67
Preferences (2)
• Consider the VPs
 Ate spaghetti with gusto
 Ate spaghetti with marinara
• Here the heads of the PPs are the same (with) so that
won’t help.
• But the affinity of gusto for eat is much larger than its
affinity for spaghetti
• On the other hand, the affinity of marinara for spaghetti
is much higher than its affinity for ate (we hope).
68
Preferences (2)
• Note the relationship here is more
distant and doesn’t involve a headword
since gusto and marinara aren’t the
heads of the PPs.
Vp (ate)
Vp(ate) Pp(with)
np
v
Ate spaghetti with gusto
Vp(ate)
Np(spag)
np Pp(with)
v
Ate spaghetti with marinara
69
Note
• In case someone hasn’t pointed this out
yet, this lexicalization stuff is a thinly veiled
attempt to incorporate semantics into the
syntactic parsing process…
 Duhh..,. Picking the right parse requires the
use of semantics.
70
Rule Rewriting
• An alternative to using these kinds of
probabilistic lexical dependencies is to
rewrite the grammar so that the rules do
capture the regularities we want.
 By splitting and merging the non-terminals in
the grammar.
 Example: split NPs into different classes…
71
NPs
• Our CFG rules for NPs don’t condition on
where the rule is applied (they’re contextfree remember)
• But we know that not all the rules occur
with equal frequency in all contexts.
72
Other Examples
• Lots of other examples like this in the
TreeBank
 Many at the part of speech level
 Recall that many decisions made in
annotation efforts are directed towards
improving annotator agreement, not towards
doing the right thing.
 Often this involves conflating distinct classes into a
larger class
• TO, IN, Det, etc.
73
Rule Rewriting
• Three approaches
 Use linguistic intuitions to directly rewrite rules
 NP_Obj and the NP_Subj approach
 Automatically rewrite the rules using context
to capture some of what we want
 Ie. Incorporate context into a context-free approach
 Search through the space of rewrites for the
grammar that maximizes the probability of the
training set
74
Local Context Approach
• Condition the rules based on their parent
nodes
 This splitting based on tree-context captures
some of the linguistic intuitions
75
Parent Annotation
• Now we have non-terminals NP^S and NP^VP
that should capture the subject/object and
pronoun/full NP cases.
76
Parent Annotation
• Recall what’s going on here. We’re in effect rewriting the
treebank, thus rewriting the grammar.
• And changing the probabilities since they’re being
derived from different counts…
 And if we’re splitting what’s happening to the counts?
77
Auto Rewriting
• If this is such a good idea we may as well
apply a learning approach to it.
• Start with a grammar (perhaps a treebank
grammar)
• Search through the space of splits/merges
for the grammar that in some sense
maximizes parsing performance on the
training/development set.
78
Auto Rewriting
• Basic idea…
 Split every non-terminal into two new non-terminals
across the entire grammar (X becomes X1 and X2).
 Duplicate all the rules of the grammar that use X,
dividing the probability mass of the original rule
almost equally.
 Run EM to readjust the rule probabilities
 Perform a merge step to back off the splits that look
like they don’t really do any good.
79
Last Point
• Statistical parsers are getting quite good,
but its still quite silly to expect them to
come up with the correct parse given only
statistically massage syntactic information.
• But its not so crazy to think that they can
come up with the right parse among the
top-N parses.
• Lots of current work on
 Re-ranking to make the top-N list even better.
80
Evaluation
• So if it’s unreasonable to expect these
probabilistic parsers to get the right answer
what can we expect from them and how do
we measure it.
• Look at the content of the trees rather than
the entire trees.
 Assuming that we have gold standard trees for
test sentences
81
Evaluation
• Precision
 What fraction of the sub-trees in our parse
matched corresponding sub-trees in the
reference answer
 How much of what we’re producing is right?
• Recall
 What fraction of the sub-trees in the reference
answer did we actually get?
 How much of what we should have gotten did we
get?
82
Evaluation
• Crossing brackets
Parser hypothesis
((A B) C)
Reference answer
(A (B C))
83
Example
84