Lexicalized & Probabilistic Parsing

Download Report

Transcript Lexicalized & Probabilistic Parsing

Chapter 12: Probabilistic Parsing
and Treebanks
Heshaam Faili
[email protected]
University of Tehran
Motivation and Outline

Previously, we used CFGs to parse with, but:



Some ambiguous sentences could not be
disambiguated, and we would like to know the
most likely parse
How do we get such grammars? Do we write
them ourselves? Maybe we could use a corpus …
Where we’re going:



Probabilistic Context-Free Grammars (PCFGs)
Lexicalized PCFGs
Dependency Grammars
2
Statistical Parsing

Basic idea

Start with a treebank




a collection of sentences with syntactic annotation, i.e.,
already-parsed sentences
Examine which parse trees occur frequently
Extract grammar rules corresponding to those
parse trees, estimating the probability of the
grammar rule based on its frequency
That is, we’ll have a CFG augmented with
probabilities
3
Probabilistic Context-Free
Grammars (PCFGs)

Definition of a CFG:





Set of non-terminals (N)
Set of terminals (T)
Set of rules/productions (P), of the form Α  β
Designated start symbol (S)
Definition of a PCFG:


Same as a CFG, but with one more function, D
D assigns probabilities to each rule in P
4
Probabilities

The function D gives probabilities for a non-terminal
A to be expanded to a sequence β.



The idea is that, given A as the mother non-terminal
(LHS), what is the likelihood that β is the correct RHS


Written as P(A  β)
or as P(A  β|A)
Note that Σi (A  βi | A) = 1
For example, we would augment a CFG with these
probabilities:



P(S  NP VP | S) = .80
P(S  Aux NP VP | S) = .15
P(S  VP | S) = .05
5
Estimating Probabilities using a
Treebank

Given a corpus of sentences annotated with syntactic
annotation (e.g., the Penn Treebank)






Consider all parse trees
(1) Each time you have a rule of the form Aβ applied in a
parse tree, increment a counter for that rule
(2) Also count the number of times A is on the left hand side
of a rule
Divide (1) by (2)
P(Aβ|A) = Count(Aβ)/Count(A)
If you don’t have annotated data, parse the corpus
(as we’ll describe next) and estimate the probabilities
… which are then used to re-parse.
6
An Example
7
Using Probabilities to Parse


P(T): Probability of a particular parse tree
P(T,S) = ΠnєT p( r(n) ) =
P(T).P(S|T)
but P(S|T) = 1 ?
P(T) = ΠnєT p( r(n) )
i.e., the product of the probabilities of all
the rules r used to expand each node n in
the parse tree
Example: given the probabilities on p. 449,
compute the probability of the parse tree
on the right

8
Computing probabilities

We have the following rules and probabilities
(adapted from Figure 12.1):







S  VP
VP  V NP
NP  Det N
V  book
Det  that
N  flight
.05
.40
.20
.30
.05
.25
P(T) = P(SVP)*P(VPV NP)*…*P(Nflight)
= .05 * .40 * .20 * .30 * .05 * .25 = .000015, or 1.5 x 10-5
9
Using probabilities

So, the probability for that parse is 0.000015. What’s
the big deal?



Whereas we couldn’t decide between two parses
using a regular CFG, we now can.
For example, TWA flights is ambiguous between being
two separate NPs (cf. I gave [NP John] [NP money]) or
one NP:



Probabilities are useful for comparing with other probabilities
A: [book [TWA] [flights]]
B: [book [TWA flights]]
Probabilities allows us to choose choice B (see figure
12.2)
10
11
Obtaining the best parse

Call the best parse T(S), where S is
your sentence



Get the tree which has the highest
probability, i.e.
T(S) = argmaxTєparse-trees(S) P(T)
Can use the Cocke-Younger-Kasami
(CYK) algorithm to calculate best parse


CYK is a form of dynamic programming
CYK is a chart parser, like the Earley parser12
The CYK algorithm

Base case



Add words to the chart
Store P(A  wi) for every category A in the chart
Recursive case  makes this dynamic programming
because we only calculate B and C once


Rules must be of the form A  BC, i.e., exactly two items on
the RHS (we call this Chomsky Normal Form (CNF))
Get the probability for A at this node by multiplying the
probabilities for B and for C by P(A  BC)


P(B)*P(C)*P(A  BC)
For a given A, only keep the maximum probability
(again, this is dynamic programming)
13
PCYK pseudo-code
14
Example: The flight includes a meal
15
Problems with PCFGs

It’s still only a CFG, so dependencies on nonCFG info not captured


e.g., Pronouns are more likely to be subjects than
objects:
P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]
16
Problems with PCFGs

Ignores lexical information (statistics), which
is usually crucial for disambiguation


(T1) America sent [[250,000 soldiers] [into Iraq]]
(T2) America sent [250,000 soldiers] [into Iraq]


send with into-PP always attached high (T2) in PTB!
To handle lexical information, we’ll turn to
lexicalized PCFGs
17
Ignore lexical information
VP  VBD NP PP
VP  VBD NP
NP  NP PP
18
Lexicalized Grammars

Remember how we talked about head information
being passed up in a syntactic analysis?





e.g., VP[head *1]  V[head *1] NP
Well, if you follow this down all the way to the
bottom of a tree, you wind up with a head word
In some sense, we can say that Book that flight is
not just an S, but an S rooted in book
Thus, book is the headword of the whole sentence
By adding headword information to nonterminals, we
19
wind up with a lexicalized grammar
Lexicalized Grammars

Best Results until now,


Collins Parser
Charniak Parser
20
Lexicalized PCFGs

Lexicalized Parse Trees


Each PCFG rule in a tree
is augmented to identify
one RHS constituent to be
the head daughter
The headword for a node
is set to the head word of
its head daughter
[book]
[book]
[flight]
[flight]
21
Incorporating Head
Probabilities: Wrong Way

Simply adding headword w to node won’t work:



The probabilities are too small, i.e., we don’t have a
big enough corpus to calculate these probabilities



So, the node A becomes A[w]
e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)
VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10
VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11
These probabilities are tiny, and others will never
occur
22
23
Incorporating head
probabilities: Right way

Previously, we conditioned on the mother
node (A):


Now, we can condition on the mother node
and the headword of A (h(A)):


P(Aβ|A)
P(Aβ|A, h(A))
We’re no longer conditioning on simply the
mother category A, but on the mother
category when h(A) is the head

e.g., P(VPVBD NP PP | VP, dumped)
24
Calculating rule probabilities

We’ll write the probability more generally as:



P(r(n) | n, h(n))
where n = node, r = rule, and h = headword
We calculate this by comparing how many
times the rule occurs with h(n) as the
headword versus how many times the
mother/headword combination appear in
total:
P(VP  VBD NP PP | VP, dumped)
= C(VP(dumped)  VBD NP PP)/ Σβ
C(VP(dumped)  β)
25
Adding info about word-word
dependencies

We want to take into account one other factor: the
probability of being a head word (in a given context)


We condition this probability on two things:
1. the category of the node (n), and
2. the headword of the mother (h(m(n)))




P(h(n)=word | …)
P(h(n)=word | n, h(m(n))), shortened as:
P(h(n) | n, h(m(n)))
P(sacks | NP, dumped)
What we’re really doing is factoring in how words
relate to each other
We will call this a dependency relation later: sacks is
26
dependent on dumped, in this case
Putting it all together

See p. 459 for an example lexicalized parse
tree for
workers dumped sacks into a bin

For rules r, category n, head h, mother m
P(T) =
ΠnєT
p(r(n)| n, h(n))
e.g., P(VP VBD NP PP |VP, dumped)
subcategorization info
*
p(h(n) | n, h(m(n)))
e.g. P(sacks | NP, dumped)
dependency info between words
27
Dependency Grammar




Capturing relations between words (e.g. dumped and
sacks) is moving in the direction of dependency
grammar (DG)
In DG, there is no such thing as constituency
The structure of a sentence is purely the binary
relations between words
John loves Mary is represented as:



LOVE  JOHN
LOVE  MARY
where A  B means that B depends on A
28
Evaluating Parser Output


Dependency relations are also useful for comparing
parser output to a treebank
Traditional measures of parser accuracy:

Labeled bracketing precision:
# correct constituents in parse/# constituents in parse

Labeled bracketing recall:
# correct constituents in parse/# (correct) constituents in
treebank parse

There are known problems with these measures, so
people are trying to use dependency-based measures
instead

How many dependency relations did the parse get correct?
29