Transcript ppt

Probabilistic
Parsing
and
Treebanks
L545
Spring 2000
Page 1
Motivation and Outline
 Previously, we used CFGs to parse with, but:
- Some ambiguous sentences could not be disambiguated, and
we would like to know the most likely parse
- We did not discuss how to obtain such grammars
 Where we’re going:
- Probabilistic Context-Free Grammars (PCFGs)
+ some discussion of treebanks
- Lexicalized PCFGs
 We’ll only cover PCFGs in very broad strokes - L645 offers
Page 2
more details
Statistical Parsing
 Basic idea
- Start with a treebank
 a collection of sentences with syntactic annotation, i.e.,
already-parsed sentences
- Examine which parse trees occur frequently
- Extract grammar rules corresponding to those parse trees,
estimating the probability of the grammar rule based on
its frequency
 Result: a CFG augmented with probabilities
Page 3
Probabilistic Context-Free Grammars (PCFGs)
 Definition of a CFG (review):
- Set of non-terminals (N)
- Set of terminals (T)
- Set of rules/productions (P), of the form Α  β
- Designated start symbol (S)
 Definition of a PCFG:
- Same as a CFG, but with one more function, D
- D assigns probabilities to each rule in P
Page 4
Probabilities
 The function D gives probabilities for a non-terminal A to be
expanded to a sequence β.
- Written as P(A  β)
- or as P(A  β|A)
 Idea: given A as the mother non-terminal (LHS), what is the
likelihood that β is the correct RHS?
- Note that Σi (A  βi | A) = 1
 i.e., these are all the ways of expanding A
- For example, we would augment a CFG with these
probabilities:
 P(S  NP VP | S) = .80
 P(S  Aux NP VP | S) = .15
Page 5
 P(S  VP | S) = .05
Estimating Probabilities using a Treebank
 Given a corpus of sentences annotated with syntactic
annotation (e.g., the Penn Treebank)
- Consider all parse trees
- (1) Each time you have a rule of the form Aβ applied in a
parse tree, increment a counter for that rule
- (2) Also count the number of times A is on the left hand
side of a rule
- Divide (1) by (2)
 P(Aβ|A) = Count(Aβ)/Count(A)
 If you don’t have annotated data, parse the corpus (as we’ll
Page 6
describe next) and estimate the probabilities … which are then
Using Probabilities to Parse

P(T): probability of a particular
parse tree

P(T) = ΠnєT p(r(n))
i.e., the product of the
probabilities of the rules r
used to expand each node n in
the parse tree
Page 7
Computing probabilities
 We have the following rules and probabilities (adapted from
Figure 14.1):
- S  VP
.05
- VP  V NP
.40
- NP  Det N
.20
- V  book
.30
- Det  that
.05
- N  flight
.25
 P(T) = P(SVP)*P(VPV NP)*…*P(Nflight)
= .05*.40*.20*.30*.05*.25 = .000015, or 1.5 x 10-5
Page 8
Using probabilities
 So, the probability for that parse is 0.000015.
this mean?
What’ does
- Probabilities are useful for comparing with other
probabilities
 Whereas we couldn’t decide between two parses using a
regular CFG, we now can.
 For example, TWA flights is ambiguous between being two
separate NPs (cf. I gave [NP John] [NP money]) or one NP:
- A: [book [TWA] [flights]]
- B: [book [TWA flights]]
 Probabilities allows us to choose choice B
Page 9
Obtaining the best parse
 Call the best parse T(S), where S is your sentence
- Get the tree which has the highest probability, i.e.
- T(S) = argmaxTєparse-trees(S) P(T)
 Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate
best parse
Page 10
The CYK algorithm
 Base case
- Add words to the chart
- Store P(A  w_i) for every category A in the chart
 Recursive case
- Get the probability for A at this node by multiplying the
probabilities for B and for C by P(A  BC)
 P(B)*P(C)*P(A  BC)
- Only calculate this once
- Rules must be of the form A  BC, i.e., exactly two items
on the RHS (Chomsky Normal Form (CNF))
 For a given A, only keep the maximum probability
Page 11
- Previously, we kept A, but without any probabilities
Problems with PCFGs
 It’s still only a CFG, so dependencies on non-CFG info not
captured
- e.g., Pronouns are more likely to be subjects than objects:
- P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]
 Ignores lexical information (statistics), which is usually crucial
for disambiguation
- (T1) America sent [[250,000 soldiers] [into Iraq]]
- (T2) America sent [250,000 soldiers] [into Iraq]
 send with into-PP is the correct analysis (T2) because
they “go well” together
Page 12
 To handle lexical information, we’ll turn to lexicalized PCFGs
Lexicalized Grammars
 Remember how head information is passed up in a syntactic
analysis?
- e.g., VP[head [1]]  V[head [1]] NP
- If you follow this down all the way to the bottom of a tree,
you wind up with a head word
 In some sense, we can say that Book that flight is not just an
S, but an S rooted in book
- Thus, book is the headword of the whole sentence
 By adding headword information to nonterminals, we wind up
with a lexicalized grammar
Page 13
Lexicalized PCFGs

Lexicalized Parse Trees
- Each PCFG rule in a tree is
augmented to identify one
RHS constituent to be the
head daughter
- The headword for a node
is set to the head word of
its head daughter
Page 14
[book]
[book]
[flight]
[flight]
Incorporating Head Probabilities: Wrong Way
 Simply adding headword w to node won’t work:
- So, the node A becomes A[w]
- e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)
 The probabilities are too small, i.e., we don’t have a big enough
corpus to calculate these probabilities
- VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10
- VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11
 These probabilities are tiny, and others will never occur
Page 15
Incorporating head probabilities: Right way
 Previously, we conditioned on the mother node (A):
- P(Aβ|A)
 Now, we can condition on the mother node and the headword of
A (h(A)):
- P(Aβ|A, h(A))
 We’re no longer conditioning on simply the mother category A,
but on the mother category when h(A) is the head
- e.g., P(VPVBD NP PP | VP, dumped)
- The likelihood of VP expanding to VBD NP PP when dumped
is the head is different than with ate
Page 16
Calculating rule probabilities
 We’ll write the probability more generally as:
- P(r(n) | n, h(n))
- where n = node, r = rule, and h = headword
 We calculate this by comparing how many times the rule occurs
with h(n) as the headword versus how many times the
mother/headword combination appear in total:
P(VP  VBD NP PP | VP, dumped)
= C(VP(dumped)  VBD NP PP)/ Σβ C(VP(dumped)  β)
Page 17
Adding info about word-word dependencies
 We want to take into account one other factor: the probability
of being a head word (in a given context)
- P(h(n)=word | …)
 We condition this probability on two things: 1. the category of
the node (n), and 2. the headword of the mother (h(m(n)))
- P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n)))
- P(sacks | NP, dumped)
 What we’re really doing is factoring in how words relate to
each other
 These are dependency relations: sacks is dependent on
dumped, in this case
Page 18
Putting it all together
 See sec. 14.6 for an example lexicalized parse tree for
workers dumped sacks into a bin
 For rules r, category n, head h, mother m
P(T) =
ΠnєT
p(r(n)| n, h(n))
e.g., P(VP VBD NP PP |VP, dumped)
subcategorization info
*
p(h(n) | n, h(m(n)))
e.g. P(sacks | NP, dumped)
dependency info between words
Page 19
Evaluating Parser Output
 Traditional measures of parser accuracy:
- Labeled bracketing precision:
# correct constituents in parse/# constituents in parse
- Labeled bracketing recall:
# correct constituents in parse/# (correct) constituents
in treebank parse
 There are known problems with these measures, so people are
trying to use dependency-based measures instead
- How many dependency relations did the parse get correct?
Page 20